Please check these quantizations.
I don't have enough resources to run all tests, but I came up with a slightly different way to quantize models.
As you will see, the f16.q6 and f16.q5 are smaller than the q8_0 and very similar to the pure f16.
https://huggingface.co/ZeroWw/Mistral-7B-Instruct-v0.3-GGUF/tree/main
These are my own quantizations (updated almost daily).
This is how I did:
echo Quantizing f16/q5
./build/bin/llama-quantize &>/dev/null --allow-requantize --output-tensor-type f16 --token-embedding-type f16 ${model_name}.f16.gguf ${model_name}.f16.q5.gguf q5_k $(nproc)
echo Quantizing f16/q6
./build/bin/llama-quantize &>/dev/null --allow-requantize --output-tensor-type f16 --token-embedding-type f16 ${model_name}.f16.gguf ${model_name}.f16.q6.gguf q6_k $(nproc)
echo Quantizing q8_0
./build/bin/llama-quantize &>/dev/null --allow-requantize --pure ${model_name}.f16.gguf ${model_name}.q8.gguf q8_0 $(nproc)
I quantized the output and embed tensors to f16 and the other ones to q5_k or q6_k.
If someone could test them better it would be great.
P.S.
even the f16/q5 is not that different from the pure f16. And way better than the q8_0.
Please start posting some side-by-side comparisons, we really need to see how the model is different, no sense asking for it everywhere without proof that there's a difference
@bartowski as I said, I test the models by chatting with them. I have no equipment (not even a decent GPU) to do any kind of testing... but many people here can...
@ZeroWw sure but can you do a chat with one and then a chat with the other with the exact same prompt and show the results? otherwise just saying "this chat is better" is a bit useless. Not to take anything away from it, i've been releasing with the new --output-tensor-type f16 --token-embedding-type f16 and have a bunch of models up with those quants, but no concrete feedback that they're better yet
@ZeroWw sure but can you do a chat with one and then a chat with the other with the exact same prompt and show the results? otherwise just saying "this chat is better" is a bit useless. Not to take anything away from it, i've been releasing with the new --output-tensor-type f16 --token-embedding-type f16 and have a bunch of models up with those quants, but no concrete feedback that they're better yet
I have very very little resources.. imagine that I made all those quants from google colab :D
Just test any of the models in my profile (they are all quantized this way) and you will notice that f16/q6 is (imho) almost indistiguishable from the pure f16 at almost half the size.
Also in f16/q5 I don't notice particular degradations... I just did a few perplexity tests.
Most guys on this website have better computer resources than me.
I just use to try to optimize things until the trade-off is fair.