Slow inference/low GPU utilization.
#27
by
hmanju
- opened
I have noticed low GPU utilization on several LLMs when using the huggingface pipeline api or the AutoModelForCausalLM api
My setup is 8 * H100. During inference, utilization fluctuates between 10% to 15% on each GPU.
How can I improve the same?
import torch
model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
reviews = "\n".join(reviews)
messages = [
{"role": "system", "content": "Your goal is summarize text."},
{"role": "user", "content": f"Summarize the below reviews. \n {reviews}" },
]
outputs = pipeline(
messages,
max_new_tokens=1256,
)
print(outputs[0]["generated_text"][-1])