Anybody has been able to run their chat.py model on a Mac?

by neodymion - opened 6 days ago

6 days ago

•

Thanks for uploading. But I am struggling to get the chat.py to run on a M2 Pro 32GB

It won't run with AppleSilicon MPS due to it using bfloat16. I tried changing that to float32 but then it did not run. :D
Now with CPU it is running, but takes ages to reply. All I entered was "hi".
Is this model not supposed to be faster? Anything I need to change?

modifications to the chat.py
from generate import generate
from transformers import AutoTokenizer, AutoModel

def chat():
device = 'cpu' ##<-- force cpu use
model = AutoModel.from_pretrained('GSAI-ML/LLaDA-8B-Instruct', trust_remote_code=True, torch_dtype=torch.bfloat16).to(device).eval()

21world

6 days ago

•

edited 6 days ago

Linux Fedora / cpu .and works on 1 thread, dual xeon never answered /5-10 min waits/

neodymion

6 days ago

Linux Fedora / cpu .and works on 1 thread, dual xeon never answered /5-10 min waits/

Okay thanks, so it's the threading. Thanks for the reply.

neodymion

6 days ago

Changing threads to 1 did not help. 30 minutes wait, still no output.

import os
import torch

Set single thread environment variables

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

Configure PyTorch thread settings

torch.set_num_threads(1)
torch.set_num_interop_threads(1)

Check for MPS availability (macOS 12.3+ and PyTorch 1.12+ required)

device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

from generate import generate
from transformers import AutoTokenizer, AutoModel

def chat():
# Load the model in bfloat16 on CPU first to avoid MPS dtype issues
model = AutoModel.from_pretrained(
'GSAI-ML/LLaDA-8B-Instruct',
trust_remote_code=True,
torch_dtype=torch.bfloat16 # Load weights in bfloat16
).to('cpu').eval()

21world

6 days ago

:)
yes :))))

21world

6 days ago

cpu cores x 2 are correct threads ,for example 24 cores x 2 = 48 threads
with 1 thread only 1/48 ,2%-4% cpu load

21world

6 days ago

when i run chat.py it use only one thread,not the max threads available
will try your code later

spawn99

5 days ago

Thanks for uploading. But I am struggling to get the chat.py to run on a M2 Pro 32GB

It won't run with AppleSilicon MPS due to it using bfloat16. I tried changing that to float32 but then it did not run. :D
Now with CPU it is running, but takes ages to reply. All I entered was "hi".
Is this model not supposed to be faster? Anything I need to change?

modifications to the chat.py
from generate import generate
from transformers import AutoTokenizer, AutoModel

def chat():
device = 'cpu' ##<-- force cpu use
model = AutoModel.from_pretrained('GSAI-ML/LLaDA-8B-Instruct', trust_remote_code=True, torch_dtype=torch.bfloat16).to(device).eval()

i'm working on mlx, standby

nieshen

GSAI-ML org 3 days ago

I'm extremely sorry. I'm not very familiar with running our code on MAC and I'm eagerly looking forward to more help from the community!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment