Phi-3.5-mini-instruct: CPU vs. GPU Mismatch Error in modeling_phi3.py During Generation
Hello,
I’m working with the microsoft/Phi-3.5-mini-instruct model locally. While loading it from a local cache and moving the model to GPU (e.g. model.to("cuda")), I keep hitting a runtime error in modeling_phi3.py:
"
Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!
(when checking argument for argument mat2 in method wrapper_CUDA_bmm)
"
A quick trace shows it happens around line 368 in modeling_phi3.py:
"
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
"
It looks like inv_freq_expanded or position_ids_expanded is allocated on CPU while the rest of the model is on GPU. Even though all my parameters and inputs are on cuda:0 (verified by printing parameter devices), this snippet tries to multiply a CPU tensor by a GPU tensor—hence the mismatch.
I’ve tried:
- Removing device_map="auto", loading in half precision, and explicitly calling model.to("cuda").
- Setting trust_remote_code=False.
- Inspecting the config file to remove references to modeling_phi3.py.
- Using a minimal test script that just loads the model and runs model.generate(...).
In all cases, the custom code from modeling_phi3.py remains, and the mismatch persists if it references CPU.
Has anyone resolved this by patching modeling_phi3.py so these positional embeddings are allocated on the same device as the input? For example, forcibly calling something like:
"
device = hidden_states.device
inv_freq_expanded = inv_freq_expanded.to(device)
position_ids_expanded = position_ids_expanded.to(device)
"
Or is there an official update/PR that fixes this in the repo? I couldn’t find any documented solution.
Any guidance on how to handle the Phi3ForCausalLM custom code would be greatly appreciated—especially if you have tips on maintaining the special rope scaling logic without forcing partial offload or CPU usage.
Thanks in advance
How do you load the model? In my case, it is working just fine loading the model directly on GPU as follow.
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3.5-mini-instruct",
device_map="cuda",
torch_dtype="half",
trust_remote_code=True,
)
I'm having this same issue. Here is how I am loading the model:
from transformers import AutoTokenizer, Phi3ForCausalLM
model = Phi3ForCausalLM.from_pretrained('microsoft/Phi-3.5-mini-instruct', device_map='cuda:0')
tokenizer = AutoTokenizer.from_pretrained('microsoft/Phi-3.5-mini-instruct')
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda:0')
Generate
generate_ids = model.generate(inputs.input_ids, max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
Try specifying the data type to half-precision. In my case default auto loaded it on CPU, probably because of my GPU VRAM limit 8 Gb.