How come I can't
Is there any rhyme or reason that out of all the LoneStriker exl2 models I have downloaded from 70B - 120B this is the only one that refuses to let me span across multiple GPUs? Running 3 A40s and I have split countless models across them and this one refuses to split. Fills GPU 0 and then Out of Memory..
You generally have to set the first GPU to use much less VRAM than the others to reserve room for the context length. The Qwen models have a much bigger vocab size, so they will be significantly more memory than a LLaMA or Mistral-type model. You can also drop the context size lower, try 2k for example, to get it to load and test the model.
Okay I will try that.
I usually start with 2k context anyway, just to see how much VRAM it will require, but I think I had it set to 45 of 48 on GPU 0 I'll try lower.
I guess I am out of my depth here. I can't get it to even attempt to span across GPUs, no matter what I set GPU0 to. It just keeps filling GPU0 up to 100% and doesn't split. First time I have had this issue. I guess it's because of what you said. It's not LLaMA or Mistral-based which I have had no issues loading dozens of different 70-120B models. Ahh well I appreciate you trying to help but it gets expensive trying to troubleshoot one model when renting GPUs when so many others work fine.
Thanks for trying!