DionTimmer
/

whisper-small-multitask-analyzer

Transformers

Safetensors

English

whisper

Inference Endpoints

Model card Files Files and versions Community

DionTimmer commited on Jul 20, 2024

Commit

8680be5

verified ·

1 Parent(s): 69943db

Update README.md

Browse files

Files changed (1) hide show

README.md +75 -78

README.md CHANGED Viewed

@@ -1,79 +1,76 @@
----
-license: cc-by-nc-4.0
-language:
-- en
-library_name: transformers
----
-# Whisper Multitask Analyzer
-A transformer encoder-decoder model for automatic audio captioning. As opposed to speech-to-text, captioning describes the content and features of audio clips.
-- **Model, codebase & card adapted from:** MU-NLPC/whisper-small-audio-captioning
-- **Model type:** Whisper encoder-decoder transformer
-- **Language(s) (NLP):** en
-- **License:** cc-by-4.0
-- **Parent Model:** openai/whisper-small
-## Usage
-The model expects an audio clip (up to 30s) to the encoder as an input and information about caption style as forced prefix to the decoder.
-The forced prefix is an integer which is mapped to various tasks. This mapping is defined in the model config and can be retrieved with a function.
-The tag mapping of the current model is:
-| Task     | ID | Description                                            |
-| -------- | -- | ------------------------------------------------------ |
-| tags     | 0  | General descriptions, can include genres and features. |
-| genre    | 1  | Estimated musical genres.                              |
-| mood     | 2  | Estimated emotional feeling.                           |
-| movement | 3  | Estimated audio pace and expression.                   |
-| theme    | 4  | Estimated audio usage (not very accurate)              |
-```
-Minimal example:
-```python
-# Load model
-checkpoint = "DionTimmer/whisper-small-multitask-analyzer"
-model = WhisperForAudioCaptioning.from_pretrained(checkpoint)
-tokenizer = transformers.WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe")
-feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(checkpoint)
-# Load and preprocess audio
-input_file = "..."
-audio, sampling_rate = librosa.load(input_file, sr=feature_extractor.sampling_rate)
-features = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features
-# Mappings by ID
-print(model.task_mapping) # {0: 'tags', 1: 'genre', 2: 'mood', 3: 'movement', 4: 'theme'}
-# Inverted
-print(model.named_task_mapping) # {'tags': 0, 'genre': 1, 'mood': 2, 'movement': 3, 'theme': 4}
-# Prepare caption style
-style_prefix = f"{model.named_task_mapping['tags']}: "
-style_prefix_tokens = tokenizer("", text_target=style_prefix, return_tensors="pt", add_special_tokens=False).labels
-# Generate caption
-model.eval()
-outputs = model.generate(
-    inputs=features.to(model.device),
-    forced_ac_decoder_ids=style_prefix_tokens,
-    max_length=100,
-)
-print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
-```
-Example output:
-*0: advertising, beautiful, beauty, bright, cinematic, commercial, corporate, emotional, epic, film, heroic, hopeful, inspiration, inspirational, inspiring, love, love story, movie, orchestra, orchestral, piano, positive, presentation, romantic, sentimental*
-WhisperTokenizer must be initialized with `language="en"` and `task="transcribe"`.
-The model class `WhisperForAudioCaptioning` can be found in the git repository or here on the HuggingFace Hub in the model repository. The class overrides default Whisper `generate` method to support forcing decoder prefix.
-## Licence
 The model weights are published under non-commercial license CC BY-NC 4.0 as the model was finetuned on a dataset for non-commercial use.

+---
+license: cc-by-nc-4.0
+language:
+- en
+library_name: transformers
+---
+# Whisper Multitask Analyzer
+A transformer encoder-decoder model for automatic audio captioning. As opposed to speech-to-text, captioning describes the content and features of audio clips.
+- **Model, codebase & card adapted from:** MU-NLPC/whisper-small-audio-captioning
+- **Model type:** Whisper encoder-decoder transformer
+- **Language(s) (NLP):** en
+- **License:** cc-by-4.0
+- **Parent Model:** openai/whisper-small
+## Usage
+The model expects an audio clip (up to 30s) to the encoder as an input and information about caption style as forced prefix to the decoder.
+The forced prefix is an integer which is mapped to various tasks. This mapping is defined in the model config and can be retrieved with a function.
+The tag mapping of the current model is:
+| Task     | ID | Description                                            |
+| -------- | -- | ------------------------------------------------------ |
+| tags     | 0  | General descriptions, can include genres and features. |
+| genre    | 1  | Estimated musical genres.                              |
+| mood     | 2  | Estimated emotional feeling.                           |
+| movement | 3  | Estimated audio pace and expression.                   |
+| theme    | 4  | Estimated audio usage (not very accurate)              |
+Minimal example:
+```python
+# Load model
+checkpoint = "DionTimmer/whisper-small-multitask-analyzer"
+model = WhisperForAudioCaptioning.from_pretrained(checkpoint)
+tokenizer = transformers.WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe")
+feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(checkpoint)
+# Load and preprocess audio
+input_file = "..."
+audio, sampling_rate = librosa.load(input_file, sr=feature_extractor.sampling_rate)
+features = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features
+# Mappings by ID
+print(model.task_mapping) # {0: 'tags', 1: 'genre', 2: 'mood', 3: 'movement', 4: 'theme'}
+# Inverted
+print(model.named_task_mapping) # {'tags': 0, 'genre': 1, 'mood': 2, 'movement': 3, 'theme': 4}
+# Prepare caption style
+style_prefix = f"{model.named_task_mapping['tags']}: "
+style_prefix_tokens = tokenizer("", text_target=style_prefix, return_tensors="pt", add_special_tokens=False).labels
+# Generate caption
+model.eval()
+outputs = model.generate(
+    inputs=features.to(model.device),
+    forced_ac_decoder_ids=style_prefix_tokens,
+    max_length=100,
+)
+print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
+```
+Example output:
+*0: advertising, beautiful, beauty, bright, cinematic, commercial, corporate, emotional, epic, film, heroic, hopeful, inspiration, inspirational, inspiring, love, love story, movie, orchestra, orchestral, piano, positive, presentation, romantic, sentimental*
+WhisperTokenizer must be initialized with `language="en"` and `task="transcribe"`.
+The model class `WhisperForAudioCaptioning` can be found in the git repository or here on the HuggingFace Hub in the model repository. The class overrides default Whisper `generate` method to support forcing decoder prefix.
+## Licence
 The model weights are published under non-commercial license CC BY-NC 4.0 as the model was finetuned on a dataset for non-commercial use.