最終更新:2024-04-10 (水) 13:18:16 (19d)
LLM/量子化
Qx - xビット量子化
フォーマット
llama.cpp形式 (GGUF/ggml)
GPTQ
メモ
- q5_K_Mがバランスがよさそう
メモ
- Kのついたものが「k-quant?メソッド」なる新方式による量子化モデル
- K
- K_S
- K_M
- K_L
メモ
LLAMA_FTYPE_MOSTLY_Q2_K uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. LLAMA_FTYPE_MOSTLY_Q3_K_S uses GGML_TYPE_Q3_K for all tensors LLAMA_FTYPE_MOSTLY_Q3_K_M uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K LLAMA_FTYPE_MOSTLY_Q3_K_L uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K LLAMA_FTYPE_MOSTLY_Q4_K_S uses GGML_TYPE_Q4_K for all tensors LLAMA_FTYPE_MOSTLY_Q4_K_M uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K LLAMA_FTYPE_MOSTLY_Q5_K_S uses GGML_TYPE_Q5_K for all tensors LLAMA_FTYPE_MOSTLY_Q5_K_M uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K LLAMA_FTYPE_MOSTLY_Q6_K uses 6-bit quantization (GGML_TYPE_Q8_K) for all tensors