Feature extractors

Feature extractors preprocess audio data into the correct format for a given model. It takes the raw audio signal and converts it into a tensor that can be fed to a model. The tensor shape depends on the model, but the feature extractor will correctly preprocess the audio data for you given the model you’re using. Feature extractors also include methods for padding, truncation, and resampling.

Call from_pretrained() to load a feature extractor and its preprocessor configuration from the Hugging Face Hub or local directory. The feature extractor and preprocessor configuration is saved in a preprocessor_config.json file.

Pass the audio signal, typically stored in array, to the feature extractor and set the sampling_rate parameter to the pretrained audio models sampling rate. It is important the sampling rate of the audio data matches the sampling rate of the data a pretrained audio model was trained on.

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
processed_sample = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=16000)
processed_sample
{'input_values': [array([ 9.4472744e-05,  3.0777880e-03, -2.8888427e-03, ...,
       -2.8888427e-03,  9.4472744e-05,  9.4472744e-05], dtype=float32)]}

The feature extractor returns an input, input_values, that is ready for the model to consume.

This guide walks you through the feature extractor classes and how to preprocess audio data.

Feature extractor classes

Transformers feature extractors inherit from the base SequenceFeatureExtractor class which subclasses FeatureExtractionMixin.

SequenceFeatureExtractor provides a method to pad() sequences to a certain length to avoid uneven sequence lengths.
FeatureExtractionMixin provides from_pretrained() and save_pretrained() to load and save a feature extractor.

There are two ways you can load a feature extractor, AutoFeatureExtractor and a model-specific feature extractor class.

AutoFeatureExtractor

model-specific feature extractor

Preprocess

A feature extractor expects the input as a PyTorch tensor of a certain shape. The exact input shape can vary depending on the specific audio model you’re using.

For example, Whisper expects input_features to be a tensor of shape (batch_size, feature_size, sequence_length) but Wav2Vec2 expects input_values to be a tensor of shape (batch_size, sequence_length).

The feature extractor generates the correct input shape for whichever audio model you’re using.

A feature extractor also sets the sampling rate (the number of audio signal values taken per second) of the audio files. The sampling rate of your audio data must match the sampling rate of the dataset a pretrained model was trained on. This value is typically given in the model card.

Load a dataset and feature extractor with from_pretrained().

from datasets import load_dataset, Audio
from transformers import AutoFeatureExtractor

dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

Check out the first example from the dataset and access the audio column which contains array, the raw audio signal.

dataset[0]["audio"]["array"]
array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
        0.        ,  0.        ])

The feature extractor preprocesses array into the expected input format for a given audio model. Use the sampling_rate parameter to set the appropriate sampling rate.

processed_dataset = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=16000)
processed_dataset
{'input_values': [array([ 9.4472744e-05,  3.0777880e-03, -2.8888427e-03, ...,
       -2.8888427e-03,  9.4472744e-05,  9.4472744e-05], dtype=float32)]}

Padding

Audio sequence lengths that are different is an issue because Transformers expects all sequences to have the same lengths so they can be batched. Uneven sequence lengths can’t be batched.

dataset[0]["audio"]["array"].shape
(86699,)

dataset[1]["audio"]["array"].shape
(53248,)

Padding adds a special padding token to ensure all sequences have the same length. The feature extractor adds a 0 - interpreted as silence - to array to pad it. Set padding=True to pad sequences to the longest sequence length in the batch.

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=16000,
        padding=True,
    )
    return inputs

processed_dataset = preprocess_function(dataset[:5])
processed_dataset["input_values"][0].shape
(86699,)

processed_dataset["input_values"][1].shape
(86699,)

Truncation

Models can only process sequences up to a certain length before crashing.

Truncation is a strategy for removing excess tokens from a sequence to ensure it doesn’t exceed the maximum length. Set truncation=True to truncate a sequence to the length in the max_length parameter.

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=16000,
        max_length=50000,
        truncation=True,
    )
    return inputs

processed_dataset = preprocess_function(dataset[:5])
processed_dataset["input_values"][0].shape
(50000,)

processed_dataset["input_values"][1].shape
(50000,)

Resampling

The Datasets library can also resample audio data to match an audio models expected sampling rate. This method resamples the audio data on the fly when they’re loaded which can be faster than resampling the entire dataset in-place.

The audio dataset you’ve been working on has a sampling rate of 8kHz and the pretrained model expects 16kHz.

dataset[0]["audio"]
{'path': '/root/.cache/huggingface/datasets/downloads/extracted/f507fdca7f475d961f5bb7093bcc9d544f16f8cab8608e772a2ed4fbeb4d6f50/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ]),
 'sampling_rate': 8000}

Call cast_column on the audio column to upsample the sampling rate to 16kHz.

dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

When you load the dataset sample, it is now resampled to 16kHz.

dataset[0]["audio"]
{'path': '/root/.cache/huggingface/datasets/downloads/extracted/f507fdca7f475d961f5bb7093bcc9d544f16f8cab8608e772a2ed4fbeb4d6f50/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'array': array([ 1.70562416e-05,  2.18727451e-04,  2.28099874e-04, ...,
         3.43842403e-05, -5.96364771e-06, -1.76846661e-05]),
 'sampling_rate': 16000}

< > Update on GitHub