getting very low tokens per second (under 1 t/s) on M2 Ultra 192GB.
#6
by
j4ys0n
- opened
I've tried unloading/reloading and restarting the machine, no difference in response. I'm using LM Studio to host the model and interacting with it via Open WebUI. When I use the 4 bit quant instead, i get a more normal token/second response rate. I know that 4bit will generally respond a little faster, but I typically get 7-9 t/s with other 70b models, like llama 3.1 70b with this machine.