最終更新:2026-02-26 (木) 22:17:17 (10d)  

llama-bench
Top / llama-bench

https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench

usage

  • usage: llama-bench [options]
    
    options:
      -h, --help
      --numa <distribute|isolate|numactl>       numa mode (default: disabled)
      -r, --repetitions <n>                     number of times to repeat each test (default: 5)
      --prio <0|1|2|3>                          process/thread priority (default: 0)
      --delay <0...N> (seconds)                 delay between each test (default: 0)
      -o, --output <csv|json|jsonl|md|sql>      output format printed to stdout (default: md)
      -oe, --output-err <csv|json|jsonl|md|sql> output format printed to stderr (default: none)
      --list-devices                            list available devices and exit
      -v, --verbose                             verbose output
      --progress                                print test progress indicators
      -rpc, --rpc <rpc_servers>                 register RPC devices (comma separated)
    
    test parameters:
      -m, --model <filename>                    (default: models/7B/ggml-model-q4_0.gguf)
      -p, --n-prompt <n>                        (default: 512)
      -n, --n-gen <n>                           (default: 128)
      -pg <pp,tg>                               (default: )
      -d, --n-depth <n>                         (default: 0)
      -b, --batch-size <n>                      (default: 2048)
      -ub, --ubatch-size <n>                    (default: 512)
      -ctk, --cache-type-k <t>                  (default: f16)
      -ctv, --cache-type-v <t>                  (default: f16)
      -t, --threads <n>                         (default: system dependent)
      -C, --cpu-mask <hex,hex>                  (default: 0x0)
      --cpu-strict <0|1>                        (default: 0)
      --poll <0...100>                          (default: 50)
      -ngl, --n-gpu-layers <n>                  (default: 99)
      -ncmoe, --n-cpu-moe <n>                   (default: 0)
      -sm, --split-mode <none|layer|row>        (default: layer)
      -mg, --main-gpu <i>                       (default: 0)
      -nkvo, --no-kv-offload <0|1>              (default: 0)
      -fa, --flash-attn <0|1>                   (default: 0)
      -dev, --device <dev0/dev1/...>            (default: auto)
      -mmp, --mmap <0|1>                        (default: 1)
      -embd, --embeddings <0|1>                 (default: 0)
      -ts, --tensor-split <ts0/ts1/..>          (default: 0)
      -ot --override-tensors <tensor name pattern>=<buffer type>;...
                                                (default: disabled)
      -nopo, --no-op-offload <0|1>              (default: 0)
    
    Multiple values can be given for each parameter by separating them with ','
    or by specifying the parameter multiple times. Ranges can be given as
    'first-last' or 'first-last+step' or 'first-last*mult'.