最終更新:2024-04-10 (水) 13:18:16 (2d)  

LLM/量子化
Top / LLM / 量子化

Qx - xビット量子化

フォーマット

llama.cpp形式 (GGUF/ggml)

GPTQ

メモ

  • q5_K_Mがバランスがよさそう

メモ

  • Kのついたものが「k-quant?メソッド」なる新方式による量子化モデル
    • K
    • K_S
    • K_M
    • K_L

メモ

  • LLAMA_FTYPE_MOSTLY_Q2_Kuses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.
    LLAMA_FTYPE_MOSTLY_Q3_K_Suses GGML_TYPE_Q3_K for all tensors
    LLAMA_FTYPE_MOSTLY_Q3_K_Muses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
    LLAMA_FTYPE_MOSTLY_Q3_K_Luses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
    LLAMA_FTYPE_MOSTLY_Q4_K_Suses GGML_TYPE_Q4_K for all tensors
    LLAMA_FTYPE_MOSTLY_Q4_K_Muses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K
    LLAMA_FTYPE_MOSTLY_Q5_K_Suses GGML_TYPE_Q5_K for all tensors
    LLAMA_FTYPE_MOSTLY_Q5_K_Muses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K
    LLAMA_FTYPE_MOSTLY_Q6_Kuses 6-bit quantization (GGML_TYPE_Q8_K) for all tensors

TheBloke

メモ

関連

参考