--- library_name: keras-hub license: mit tags: - speech-recognition - keras - automatic-speech-recognition pipeline_tag: automatic-speech-recognition --- ## Model Overview ⚠️ Whisper is currently only available via the `keras-hub-nightly` package. Use `pip install keras-hub-nightly` to try this model. A Whisper encoder-decoder network for speech. This class implements a Transformer-based encoder-decoder model as described in ["Robust Speech Recognition via Large-Scale Weak Supervision"](https://arxiv.org/abs/2212.04356). It includes the embedding lookups and transformer layers, but not the head for predicting the next token. The default constructor gives a fully customizable, randomly initialized Whisper model with any number of layers, heads, and embedding dimensions. To load preset architectures and weights, use the `from_preset()` constructor. Disclaimer: Pre-trained models are provided on an "as is" basis, without warranties or conditions of any kind. The underlying model is provided by a third party and subject to a separate license, available [here](https://github.com/openai/whisper). __Arguments__ - __vocabulary_size__: int. The size of the token vocabulary. - __num_layers__: int. The number of transformer encoder layers and transformer decoder layers. - __num_heads__: int. The number of attention heads for each transformer. The hidden size must be divisible by the number of attention heads. - __hidden_dim__: int. The size of the transformer encoding and pooler layers. - __intermediate_dim__: int. The output dimension of the first Dense layer in a two-layer feedforward network for each transformer. - __num_mels__: int. The number of mel-frequency filters. Defaults to `80`. - __dropout__: float. Dropout probability for the Transformer encoder. - __max_encoder_sequence_length__: int. The maximum sequence length that the audio encoder can consume. Since the second convolutional layer in the encoder reduces the sequence length by half (stride of 2), we use `max_encoder_sequence_length // 2` as the sequence length for the positional embedding layer. - __max_decoder_sequence_length__: int. The maximum sequence length that the text decoder can consume. ## Example Usage ```python import keras_hub import keras_core as keras import numpy as np ``` ```python input_data = { "encoder_features": np.ones(shape=(1, 12, 80), dtype="int32"), "decoder_token_ids": np.ones(shape=(1, 12), dtype="int32"), "decoder_padding_mask": np.array( [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]] ), } # Randomly initialized Whisper encoder-decoder model with a custom config. model = keras_hub.models.WhisperBackbone( vocabulary_size=51864, num_layers=4, num_heads=4, hidden_dim=256, intermediate_dim=512, max_encoder_sequence_length=128, max_decoder_sequence_length=128, ) model(input_data) ``` ## Example Usage with Hugging Face URI ```python import keras_hub import keras_core as keras import numpy as np ``` ```python input_data = { "encoder_features": np.ones(shape=(1, 12, 80), dtype="int32"), "decoder_token_ids": np.ones(shape=(1, 12), dtype="int32"), "decoder_padding_mask": np.array( [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]] ), } # Randomly initialized Whisper encoder-decoder model with a custom config. model = keras_hub.models.WhisperBackbone( vocabulary_size=51864, num_layers=4, num_heads=4, hidden_dim=256, intermediate_dim=512, max_encoder_sequence_length=128, max_decoder_sequence_length=128, ) model(input_data) ```