GPU is not fully utilized
Only 20% GPU (cuda 114)utilized with the following sample code:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("01-ai/Yi-34B", device_map="auto", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-34B", trust_remote_code=True)
inputs = tokenizer("There's a place where time stands still. A place of breath taking wonder, but also", return_tensors="pt")
max_length = 256
outputs = model.generate(
inputs.input_ids.cuda(),
max_length=max_length,
eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Sorry, but we can't do much about that...
If you want to get high GPU utilization, maybe you can infer with bigger batchsize?
This is just a sample code for inference, consider to switch to another inference engine(like vllm) if you need to run the model efficiently~
@Yhyu13 no lol, exllama 2 is created by turboderp and other contributors also help as well. Exllamav2 is specifically designed for fastest single batch inference and it’s infact the fastest one probably.
Vllm is better for batching but most people are not gonna input tens or a hundred input prompts.