Anybody has been able to run their chat.py model on a Mac?

#3
by neodymion - opened

Thanks for uploading. But I am struggling to get the chat.py to run on a M2 Pro 32GB

It won't run with AppleSilicon MPS due to it using bfloat16. I tried changing that to float32 but then it did not run. :D
Now with CPU it is running, but takes ages to reply. All I entered was "hi".
Is this model not supposed to be faster? Anything I need to change?

modifications to the chat.py
from generate import generate
from transformers import AutoTokenizer, AutoModel

def chat():
device = 'cpu' ##<-- force cpu use
model = AutoModel.from_pretrained('GSAI-ML/LLaDA-8B-Instruct', trust_remote_code=True, torch_dtype=torch.bfloat16).to(device).eval()

Linux Fedora / cpu .and works on 1 thread, dual xeon never answered /5-10 min waits/

Linux Fedora / cpu .and works on 1 thread, dual xeon never answered /5-10 min waits/

Okay thanks, so it's the threading. Thanks for the reply.

Changing threads to 1 did not help. 30 minutes wait, still no output.

import os
import torch

Set single thread environment variables

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

Configure PyTorch thread settings

torch.set_num_threads(1)
torch.set_num_interop_threads(1)

Check for MPS availability (macOS 12.3+ and PyTorch 1.12+ required)

device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

from generate import generate
from transformers import AutoTokenizer, AutoModel

def chat():
# Load the model in bfloat16 on CPU first to avoid MPS dtype issues
model = AutoModel.from_pretrained(
'GSAI-ML/LLaDA-8B-Instruct',
trust_remote_code=True,
torch_dtype=torch.bfloat16 # Load weights in bfloat16
).to('cpu').eval()

:)
yes :))))

cpu cores x 2 are correct threads ,for example 24 cores x 2 = 48 threads
with 1 thread only 1/48 ,2%-4% cpu load

when i run chat.py it use only one thread,not the max threads available
will try your code later

Thanks for uploading. But I am struggling to get the chat.py to run on a M2 Pro 32GB

It won't run with AppleSilicon MPS due to it using bfloat16. I tried changing that to float32 but then it did not run. :D
Now with CPU it is running, but takes ages to reply. All I entered was "hi".
Is this model not supposed to be faster? Anything I need to change?

modifications to the chat.py
from generate import generate
from transformers import AutoTokenizer, AutoModel

def chat():
device = 'cpu' ##<-- force cpu use
model = AutoModel.from_pretrained('GSAI-ML/LLaDA-8B-Instruct', trust_remote_code=True, torch_dtype=torch.bfloat16).to(device).eval()

i'm working on mlx, standby

GSAI-ML org

I'm extremely sorry. I'm not very familiar with running our code on MAC and I'm eagerly looking forward to more help from the community!

Sign up or log in to comment