GGUF?

#11
by johnnnna - opened

Please ^-^

Google org

there is a GGUF file provided in the repo

it's quite large, can we get a quant?

Please @TheBloke

Seems really large for a GGUF file, I have enough memory but why is it so large? Is it FP16 etc.? Other variants should be provided. I found these though, haven't tested them
https://huggingface.co/mlabonne/gemma-2b-GGUF

He also has a 7B version, but the repo seems empty:
https://huggingface.co/mlabonne/gemma-7b-it-GGUF

Download at your own risk of course

You can run quantize (include in llama.cpp repo) to get Q8_0 versions. I expect the community will spring up with various quantized versions very soon too.

why is it so large? Is it FP16 etc.?

Yes, it is float 32.

This is the command to quantize to 4-bits. It assumes you have llama.cpp built and installed.

8-bit: quantize gemma-7b.gguf ./gemma-7b-Q8_0.gguf Q8_0
4-bit: quantize gemma-7b.gguf ./gemma-7b-Q4_K_M.gguf Q4_K_M

I tried the GGUF from https://huggingface.co/rahuldshetty/gemma-7b-it-gguf-quantized in ollama
But it crashes! Any facing same issue?

@aptha a dumb question but are you compling from the latest ollama source, including updating its llama.cpp submodule?

@aptha try with a 8-bit quantized version. Ollama crashes if you're out of memory.

llm = CTransformers(model="mlabonne/Gemmalpaca-2B-GGUF", model_file="gemmalpaca-2b.Q8_0.gguf", model_type="gemma", gpu_layers=0)

this one doesn't work. Is there a generic way to open gguf files with CTransformers?

johnnnna changed discussion status to closed

Sign up or log in to comment