can you share your quantization code?

#1
by shawei3000 - opened

can you share your quantization code? I would like to have 4.5 or 5 bit quantized model...

I ran the following command:
python convert.py -i C:\users\pc\Ex2bot\models\Athene\ -o C:\users\pc\exl2\ -cf C:\users\pc\atheneexl\ -b 3.5
This is using the convert.py program in the exllamav2 project as documented here: https://github.com/turboderp/exllamav2/blob/master/doc/convert.md.
It took about 3 hours on my machine.
Please let me know if you have any other questions.

thanks, I did 5.0 bit version, just fit into a A6000 Ada GPU, result much better...

Good to hear! So far I'm liking Llama 3.1 70B a bit better than Athene.

really! Llama 3.1 really good as they promised ; ) I will do a 70B 5.0 bit exl2, see if that improve my use case, thanks for the knowledge!!! ; )

Thanks, Wow, they are fast, thank!

I have encountered the known error raised by others as well "Value for eos_token_id is not of expected type <class 'int'>", cant test the above llama 3.1 (5bpw) model from the link... you did not have that issue?

Yes I should have mentioned that. You need to install the dev branch of exllamav2 to use llama3.1 If you're not sure how to do that, just wait a few days and the main exllamav2 should be fixed and you can just update to that.

in my use case (complex customer knowledge extraction), my old favorate( https://huggingface.co/gbueno86/Meta-LLama-3-Cat-A-LLama-70b-exl2-5.0bpw ) is way better than
Athene-70B-5.0btw.

will update u reagrding llama 3.1 in a few days...

I am searching the best model for single GPU run (A6000 48GB), not lucky w Qwen2 model as well...

I tried using Qwen2 before Athene and also had some serious problems with it.

Sign up or log in to comment