DionTimmer
/

whisper-small-multitask-analyzer

Inference Endpoints

Model card Files Files and versions Community

whisper-small-multitask-analyzer / README.md

DionTimmer's picture

Upload 2 files

69943db verified 7 months ago

|

3.42 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	library_name: transformers
	---
	# Whisper Multitask Analyzer

	A transformer encoder-decoder model for automatic audio captioning. As opposed to speech-to-text, captioning describes the content and features of audio clips.

	- Model, codebase & card adapted from: MU-NLPC/whisper-small-audio-captioning
	- Model type: Whisper encoder-decoder transformer
	- Language(s) (NLP): en
	- License: cc-by-4.0
	- Parent Model: openai/whisper-small

	## Usage

	The model expects an audio clip (up to 30s) to the encoder as an input and information about caption style as forced prefix to the decoder.
	The forced prefix is an integer which is mapped to various tasks. This mapping is defined in the model config and can be retrieved with a function.

	The tag mapping of the current model is:

	\| Task \| ID \| Description \|
	\| -------- \| -- \| ------------------------------------------------------ \|
	\| tags \| 0 \| General descriptions, can include genres and features. \|
	\| genre \| 1 \| Estimated musical genres. \|
	\| mood \| 2 \| Estimated emotional feeling. \|
	\| movement \| 3 \| Estimated audio pace and expression. \|
	\| theme \| 4 \| Estimated audio usage (not very accurate) \|

	```

	Minimal example:

	```python
	# Load model
	checkpoint = "DionTimmer/whisper-small-multitask-analyzer"
	model = WhisperForAudioCaptioning.from_pretrained(checkpoint)
	tokenizer = transformers.WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe")
	feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(checkpoint)

	# Load and preprocess audio
	input_file = "..."
	audio, sampling_rate = librosa.load(input_file, sr=feature_extractor.sampling_rate)
	features = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features

	# Mappings by ID
	print(model.task_mapping) # {0: 'tags', 1: 'genre', 2: 'mood', 3: 'movement', 4: 'theme'}

	# Inverted
	print(model.named_task_mapping) # {'tags': 0, 'genre': 1, 'mood': 2, 'movement': 3, 'theme': 4}

	# Prepare caption style
	style_prefix = f"{model.named_task_mapping['tags']}: "
	style_prefix_tokens = tokenizer("", text_target=style_prefix, return_tensors="pt", add_special_tokens=False).labels

	# Generate caption
	model.eval()
	outputs = model.generate(
	inputs=features.to(model.device),
	forced_ac_decoder_ids=style_prefix_tokens,
	max_length=100,
	)

	print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
	```

	Example output:
	0: advertising, beautiful, beauty, bright, cinematic, commercial, corporate, emotional, epic, film, heroic, hopeful, inspiration, inspirational, inspiring, love, love story, movie, orchestra, orchestral, piano, positive, presentation, romantic, sentimental

	WhisperTokenizer must be initialized with `language="en"` and `task="transcribe"`.

	The model class `WhisperForAudioCaptioning` can be found in the git repository or here on the HuggingFace Hub in the model repository. The class overrides default Whisper `generate` method to support forcing decoder prefix.


	## Licence

	The model weights are published under non-commercial license CC BY-NC 4.0 as the model was finetuned on a dataset for non-commercial use.