GGUF model how to run ?
Hi,
I see gguf files model-q6k.gguf model-q4k.gguf, how to run it ?
original llama.cpp looks like does not support madlad ?
Waiting for an solution too..
Looks like llama.cpp T5 support was merged two days ago (2024-07-04) but these gguf files are missing some required metadata:
$ ~/projects/llama.cpp/build/bin/llama-cli --model /mnt/e/cache/huggingface/hub/models--google--madlad400-10b-mt/snapshots/9f2797629c31e69617186dbe5f0ca43bf662f36d/model-q6k.gguf --prompt "<2pt> I love pizza!"
Log start
main: build = 3325 (87e25a1d)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed = 1720258911
WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_calloc!
llama_model_loader: loaded meta data with 0 key-value pairs and 742 tensors from /mnt/e/cache/huggingface/hub/models--google--madlad400-10b-mt/snapshots/9f2797629c31e69617186dbe5f0ca43bf662f36d/model-q6k.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - type f32: 164 tensors
llama_model_loader: - type q6_K: 578 tensors
llama_model_load: error loading model: error loading model architecture: unknown model architecture: ''
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/e/cache/huggingface/hub/models--google--madlad400-10b-mt/snapshots/9f2797629c31e69617186dbe5f0ca43bf662f36d/model-q6k.gguf'
main: error: unable to load model
$
You'll have to convert to gguf and quantize yourself, then it works (tested on Q8_0 quant, CUDA accelerated):
$~/projects/llama.cpp/build/bin/llama-cli --n-gpu-layers 40 --model model-q8_0.gguf --prompt "<2pt> I love pizza!"
.. skipped a lot of logs ..
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 0
Eu amo pizza! [end of text]
llama_print_timings: load time = 6691,80 ms
llama_print_timings: sample time = 0,74 ms / 5 runs ( 0,15 ms per token, 6784,26 tokens per second)
llama_print_timings: prompt eval time = 71,99 ms / 8 tokens ( 9,00 ms per token, 111,13 tokens per second)
llama_print_timings: eval time = 131,41 ms / 4 runs ( 32,85 ms per token, 30,44 tokens per second)
llama_print_timings: total time = 378,46 ms / 12 tokens
Log end
Just in case someone wants to try the llama.cpp with GGUF weights, I've uploaded here:
https://huggingface.co/thirteenbit/madlad400-10b-mt-gguf/tree/main
GGUF weights were made by following this guide:
https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md
These worked with llama-cli as noted above (single prompt from command line).
llama-server failed to start, llama-cli interactive (chat) mode outputs garbage.
Have not tried other frontends.