getting very low tokens per second (under 1 t/s) on M2 Ultra 192GB.

by j4ys0n - opened Nov 26, 2024

Nov 26, 2024

I've tried unloading/reloading and restarting the machine, no difference in response. I'm using LM Studio to host the model and interacting with it via Open WebUI. When I use the 4 bit quant instead, i get a more normal token/second response rate. I know that 4bit will generally respond a little faster, but I typically get 7-9 t/s with other 70b models, like llama 3.1 70b with this machine.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment