Transformers documentation
Multimodal templates
Multimodal templates
Multimodal model chat templates expect a similar template as text-only models. It needs messages
that includes a dictionary of the role
and content
.
Multimodal templates are included in the Processor class and requires an additional type
key for specifying whether the included content is an image, video, or text.
This guide will show you how to format chat templates for multimodal models as well as some best practices for configuring the template
ImageTextToTextPipeline
ImageTextToTextPipeline is a high-level image and text generation class with a “chat mode”. Chat mode is enabled when a conversational model is detected and the chat prompt is properly formatted.
Start by building a chat history with the following two roles.
system
describes how the model should behave and respond when you’re chatting with it. This role isn’t supported by all chat models.user
is where you enter your first message to the model.
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
},
{
"role": "user",
"content": [
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": "What are these?"},
],
},
]
Create a ImageTextToTextPipeline and pass the chat to it. For large models, setting device_map=“auto” helps load the model quicker and automatically places it on the fastest device available. Changing the data type to torch.bfloat16 also helps save memory.
The ImageTextToTextPipeline accepts chats in the OpenAI format to make inference easier and more accessible.
import torch
from transformers import pipeline
pipeline = pipeline("image-text-to-text", model="llava-hf/llava-onevision-qwen2-0.5b-ov-hf", device="cuda", torch_dtype=torch.float16)
pipeline(text=messages, max_new_tokens=50, return_full_text=False)
[{'input_text': [{'role': 'system',
'content': [{'type': 'text',
'text': 'You are a friendly chatbot who always responds in the style of a pirate'}]},
{'role': 'user',
'content': [{'type': 'image',
'url': 'http://images.cocodataset.org/val2017/000000039769.jpg'},
{'type': 'text', 'text': 'What are these?'}]}],
'generated_text': 'The image shows two cats lying on a pink surface, which appears to be a cushion or a soft blanket. The cat on the left has a striped coat, typical of tabby cats, and is lying on its side with its head resting on the'}]
Image inputs
For multimodal models that accept images like LLaVA, include the following in content
as shown below.
- The content
"type"
can be an"image"
or"text"
. - For images, it can be a link to the image (
"url"
), a file path ("path"
), or"base64"
. Images are automatically loaded, processed, and prepared into pixel values as inputs to the model.
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
},
{
"role": "user",
"content": [
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": "What are these?"},
],
},
]
Pass messages
to apply_chat_template() to tokenize the input content and return the input_ids
and pixel_values
.
processed_chat = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
print(processed_chat.keys())
These inputs are now ready to be used in generate().
Video inputs
Some vision models also support video inputs. The message format is very similar to the format for image inputs.
- The content
"type"
should be"video"
to indicate the the content is a video. - For videos, it can be a link to the video (
"url"
) or it could be a file path ("path"
). Videos loaded from a URL can only be decoded with PyAV or Decord.
Loading a video from "url"
is only supported by the PyAV or Decord backends.
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
},
{
"role": "user",
"content": [
{"type": "video", "url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"},
{"type": "text", "text": "What do you see in this video?"},
],
},
]
Pass messages
to apply_chat_template() to tokenize the input content. There are a few extra parameters to include in apply_chat_template() that controls the sampling process.
The video_load_backend
parameter refers to a specific framework to load a video. It supports PyAV, Decord, OpenCV, and torchvision.
The examples below uses Decord as the backend because it is a bit faster than PyAV.
The num_frames
parameter controls how many frames to uniformly sample from the video. Each checkpoint has a maximum frame count it was pretrained with and exceeding this count can significantly lower generation quality. It’s important to choose a frame count that fits both the model capacity and your hardware resources. If num_frames
isn’t specified, the entire video is loaded without any frame sampling.
processed_chat = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
num_frames=32,
video_load_backend="decord",
)
print(processed_chat.keys())
These inputs are now ready to be used in generate().
Template configuration
You can create a custom chat template with Jinja and set it with apply_chat_template(). Refer to the Template writing guide for more details.
For example, to enable a template to handle a list of content from multiple modalities while still supporting plain strings for text-only inference, specify how to handle the content['type']
if it is an image or text as shown below in the Llama 3.2 Vision Instruct template.
{% for message in messages %}
{% if loop.index0 == 0 %}{{ bos_token }}{% endif %}
{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
{% if message['content'] is string %}
{{ message['content'] }}
{% else %}
{% for content in message['content'] %}
{% if content['type'] == 'image' %}
{{ '<|image|>' }}
{% elif content['type'] == 'text' %}
{{ content['text'] }}
{% endif %}
{% endfor %}
{% endif %}
{{ '<|eot_id|>' }}
{% endfor %}
{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}