最終更新:2025-05-15 (木) 00:24:01 (34d)
llama.cpp/convert-hf-to-gguf.py
Hugging Faceモデルをllama.cppで使用できるGGUF形式に変換するスクリプト
https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py
メモ
- torch~=2.2.1
- Python 3.13だと依存のNumPyがエラー
- Python 3.11だとOK
例
python convert_hf_to_gguf.py <Hugging Faceモデルのパス> --outtype <出力形式>
使い方
--vocab-only extract only the vocab --outfile OUTFILE path to write to; default: based on input. {ftype} will be replaced by the outtype. --outtype {f32,f16 (default),bf16,q8_0,tq1_0,tq2_0,auto} output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, tq1_0 or tq2_0 for ternary, and auto for the highest-fidelity 16-bit float type depending on the first loaded tensor type --bigendian model is executed on big endian machine --use-temp-file use the tempfile library while processing (helpful when running out of memory, process killed) --no-lazy use more RAM by computing all outputs before writing (use in case lazy evaluation is broken) --model-name MODEL_NAME name of the model --verbose increase output verbosity --split-max-tensors SPLIT_MAX_TENSORS max tensors in each split --split-max-size SPLIT_MAX_SIZE max size per split N(M|G) --dry-run only print out a split plan and exit, without writing any new files --no-tensor-first-split do not add tensors to the first split (disabled by default) --metadata METADATA Specify the path for an authorship metadata override file --print-supported-models Print the supported models --remote (Experimental) Read safetensors file remotely without downloading to disk. Config and tokenizer files will still be downloaded. To use this feature, you need to specify Hugging Face model repo name instead of a local directory. For example: 'HuggingFaceTB/SmolLM2-1.7B- Instruct'. Note: To access gated repo, set HF_TOKEN environment variable to your Hugging Face token.
--outtype
"--outtype", type=str, choices=["f32", "f16", "bf16", "q8_0", "tq1_0", "tq2_0", "auto"], default="f16", output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, tq1_0 or tq2_0 for ternary, and auto for the highest-fidelity 16-bit float type depending on the first loaded tensor type