getting very low tokens per second (under 1 t/s) on M2 Ultra 192GB.

#6
by j4ys0n - opened

I've tried unloading/reloading and restarting the machine, no difference in response. I'm using LM Studio to host the model and interacting with it via Open WebUI. When I use the 4 bit quant instead, i get a more normal token/second response rate. I know that 4bit will generally respond a little faster, but I typically get 7-9 t/s with other 70b models, like llama 3.1 70b with this machine.

Sign up or log in to comment