Error while loading
I tried this model that you kindly shared, and that's what I get on Ooba, with a RTX 3090 :
127.0.0.1 - - [16/Sep/2023 06:27:45] "GET /api/v1/model HTTP/1.1" 200 -
2023-09-16 06:28:16 INFO:Loading Panchovix_airoboros-l2-70b-gpt4-1.4.1_2.5bpw-h6-exl2...
2023-09-16 06:28:31 ERROR:Failed to load the model.
Traceback (most recent call last):
File “U:\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py”, line 194, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File “U:\oobabooga_windows\text-generation-webui\modules\models.py”, line 77, in load_model
output = load_func_maploader
File “U:\oobabooga_windows\text-generation-webui\modules\models.py”, line 338, in ExLlamav2_loader
model, tokenizer = Exllamav2Model.from_pretrained(model_name)
File “U:\oobabooga_windows\text-generation-webui\modules\exllamav2.py”, line 40, in from_pretrained
model.load(split)
File “U:\oobabooga_windows\installer_files\env\lib\site-packages\exllamav2\model.py”, line 233, in load
for module in self.modules: module.load()
File “U:\oobabooga_windows\installer_files\env\lib\site-packages\exllamav2\mlp.py”, line 44, in load
self.up_proj.load()
File “U:\oobabooga_windows\installer_files\env\lib\site-packages\exllamav2\linear.py”, line 37, in load
if w is None: w = self.load_weight()
File “U:\oobabooga_windows\installer_files\env\lib\site-packages\exllamav2\module.py”, line 79, in load_weight
qtensors = self.load_multi(["q_weight", "q_invperm", "q_scale", "q_scale_max", "q_groups", "q_perm"])
File “U:\oobabooga_windows\installer_files\env\lib\site-packages\exllamav2\module.py”, line 69, in load_multi
tensors[k] = st.get_tensor(self.key + "." + k).to(self.device())
RuntimeError: [enforce fail at …\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 88080384 bytes.
This, while :
2023-09-16 06:34:32 INFO:Loading turboderp_LLama2-70B-chat-2.55bpw-h6-exl2...
2023-09-16 06:35:01 INFO:Loaded the model in 28.95 seconds.
The error DefaultCPUAllocator: not enough memory
refers as not having enough RAM when loading the model.
You could try increasing swap file and see again.
I can load the model without any warning on 64GB RAM (and 200GB Swap), but the RAM itself should be enough for a model of this size.
Thanks Panchovix, it works now. I'm surprised though that this quant needs a swap, while the 2.55bpw of Turboderp (in 3 safetensor files) doesn't. But we're still in Alpha stage, so it's super already.
I'm hasty to be able to quantize myself, I'm using 3072 context right now.
Once again, thank you very much !
Edit : I was a bit optimistic. I can remain around 10 tokens/s with 1792 ctx only. But it works.
Edit2 : I guess I'll have to buy a second 3090 to have a decent output ! :D