Processors

Multimodal models require a preprocessor capable of handling inputs that combine more than one modality. Depending on the input modality, a processor needs to convert text into an array of tensors, images into pixel values, and audio into an array with tensors with the correct sampling rate.

For example, PaliGemma is a vision-language model that uses the SigLIP image processor and the Llama tokenizer. A ProcessorMixin class wraps both of these preprocessor types, providing a single and unified processor class for a multimodal model.

Call from_pretrained() to load a processor. Pass the input type to the processor to generate the expected model inputs, input ids and pixel values.

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests

processor = AutoProcessor.from_pretrained("google/paligemma-3b-pt-224")

prompt = "answer en Where is the cow standing?"
url = "https://huggingface.co/gv-hf/PaliGemma-test-224px-hf/resolve/main/cow_beach_1.png"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors="pt")
inputs

This guide describes the processor class and how to preprocess multimodal inputs.

Processor classes

All processors inherit from the ProcessorMixin class which provides methods like from_pretrained(), save_pretrained(), and push_to_hub() for loading, saving, and sharing processors to the Hub.

There are two ways to load a processor, with an AutoProcessor and with a model-specific processor class.

AutoProcessor

model-specific processor

Preprocess

Processors preprocess multimodal inputs into the expected Transformers format. There are a couple combinations of input modalities that a processor can handle such as text and audio or text and image.

Automatic speech recognition (ASR) tasks require a processor that can handle text and audio inputs. Load a dataset and take a look at the audio and text columns (you can remove the other columns which aren’t needed).

from datasets import load_dataset

dataset = load_dataset("lj_speech", split="train")
dataset = dataset.map(remove_columns=["file", "id", "normalized_text"])
dataset[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
         7.3242188e-04,  2.1362305e-04,  6.1035156e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
 'sampling_rate': 22050}

dataset[0]["text"]
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'

Remember to resample the sampling rate to match the pretrained models required sampling rate.

dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

Load a processor and pass the audio array and text columns to it.

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("openai/whisper-tiny")

def prepare_dataset(example):
    audio = example["audio"]
    example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000))
    return example

Apply the prepare_dataset function to preprocess the dataset. The processor returns input_features for the audio column and labels for the text column.

prepare_dataset(dataset[0])

< > Update on GitHub