can you share your quantization code?
can you share your quantization code? I would like to have 4.5 or 5 bit quantized model...
I ran the following command:
python convert.py -i C:\users\pc\Ex2bot\models\Athene\ -o C:\users\pc\exl2\ -cf C:\users\pc\atheneexl\ -b 3.5
This is using the convert.py program in the exllamav2 project as documented here: https://github.com/turboderp/exllamav2/blob/master/doc/convert.md.
It took about 3 hours on my machine.
Please let me know if you have any other questions.
thanks, I did 5.0 bit version, just fit into a A6000 Ada GPU, result much better...
Good to hear! So far I'm liking Llama 3.1 70B a bit better than Athene.
really! Llama 3.1 really good as they promised ; ) I will do a 70B 5.0 bit exl2, see if that improve my use case, thanks for the knowledge!!! ; )
Turboderp has you covered: https://huggingface.co/turboderp/Llama-3.1-70B-Instruct-exl2
Thanks, Wow, they are fast, thank!
I have encountered the known error raised by others as well "Value for eos_token_id is not of expected type <class 'int'>", cant test the above llama 3.1 (5bpw) model from the link... you did not have that issue?
Yes I should have mentioned that. You need to install the dev branch of exllamav2 to use llama3.1 If you're not sure how to do that, just wait a few days and the main exllamav2 should be fixed and you can just update to that.
thnx!
in my use case (complex customer knowledge extraction), my old favorate( https://huggingface.co/gbueno86/Meta-LLama-3-Cat-A-LLama-70b-exl2-5.0bpw ) is way better than
Athene-70B-5.0btw.
will update u reagrding llama 3.1 in a few days...
I am searching the best model for single GPU run (A6000 48GB), not lucky w Qwen2 model as well...
I tried using Qwen2 before Athene and also had some serious problems with it.