Transformers
PyTorch
English
whisper
Eval Results
Inference Endpoints
hajekad commited on
Commit
eac9c9c
1 Parent(s): 857aba7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -0
README.md CHANGED
@@ -1,3 +1,132 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: cc-by-nc-4.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ datasets:
3
+ - AudioSet
4
+ - AudioCaps
5
+ - Clotho-v2.1
6
+ metrics:
7
+ -
8
+
9
+ model-index:
10
+ - name: whisper-small-audio-captioning
11
+ results:
12
+ - task:
13
+ type: audio-captioning
14
+ name: Audio Captioning
15
+ dataset:
16
+ type: clotho-v2.1
17
+ name: Clotho
18
+ split: evaluation
19
+ metrics:
20
+ - type: SPICE
21
+ value: 0.1234
22
+ - type: CIDEr
23
+ value: 0.4142
24
+ - type: SPIDEr
25
+ value: 0.2687
26
+ - type: METEOR
27
+ value: 0.3781
28
+ - type: SacreBLEU
29
+ value: 15.76
30
  license: cc-by-nc-4.0
31
+ language:
32
+ - en
33
  ---
34
+
35
+
36
+
37
+ # Model Card for Whisper Audio Captioning
38
+
39
+ A transformer encoder-decoder model for automatic audio captioning. As opposed to speech-to-text, captioning describes the content of audio clips, such as prominent sounds or environmental noises. This task has numerous practical applications, e.g., for providing access to audio information for people with hearing impairments or improving the searchability of audio content.
40
+
41
+ - **Model type:** Whisper encoder-decoder transformer
42
+ - **Language(s) (NLP):** en
43
+ - **License:** cc-by-4.0
44
+ - **Parent Model:** openai/whisper-
45
+
46
+ - **Resources for more information:**
47
+ - [GitHub Repo](https://github.com/prompteus/audio-captioning)
48
+ - [Technical Report](TODO)
49
+
50
+
51
+ ## Usage
52
+
53
+ The model expects an audio clip (up to 30s) to the encoder as an input and information about caption style as forced prefix to the decoder.
54
+
55
+ Minimal example:
56
+
57
+ ```python3
58
+ # Load model
59
+ architecture = "openai/whisper-small"
60
+ checkpoint = "MU-NLPC/whisper-small-audio-captioning"
61
+ model = audiocap.WhisperForAudioCaptioning.from_pretrained(checkpoint)
62
+ tokenizer = transformers.WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe")
63
+ feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(architecture)
64
+
65
+ # Load and preprocess audio
66
+ input_file = "..."
67
+ audio, sampling_rate = librosa.load(input_file, sr=feature_extractor.sampling_rate)
68
+ features = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features
69
+
70
+ # Prepare caption style
71
+ style_prefix = "clotho > caption: "
72
+ style_prefix_tokens = tokenizer("", text_target=style_prefix, return_tensors="pt", add_special_tokens=False).labels
73
+
74
+ # Generate caption
75
+ model.eval()
76
+ outputs = model.generate(
77
+ inputs=features.to(model.device),
78
+ forced_ac_decoder_ids=style_prefix_tokens,
79
+ max_length=100,
80
+ )
81
+
82
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
83
+ ```
84
+
85
+ Example output:
86
+ *clotho > caption: Rain is pouring down and thunder is rumbling in the background.*
87
+
88
+ The style prefix influences the style of the caption. Model knows 3 styles: `audioset > keywords: `, `audiocaps > caption: `, and `clotho > caption: `. It was finetuned on Clotho and that is the indended "default" style.
89
+
90
+ WhisperTokenizer must be initialized with `language="en"` and `task="transcribe"`.
91
+
92
+ Our model class `WhisperForAudioCaptioning` can be found in our git repository or here on the HuggingFace Hub in the model repository. The class overrides default Whisper `generate` method to support forcing decoder prefix.
93
+
94
+
95
+ ## Training details
96
+
97
+ The model was initialized by original speech-to-text `openai/whisper-small` weights. Then, it was pretrained on a mix of (1) subset of AudioSet with synthetic labels, (2) AudioCaps captioning dataset and (3) Clotho v2.1 captioning dataset. Finally, it was finetuned on Clotho v2.1 to focus the model on the specific style of the captions. For each traning input, the model was informed about the source of the data, so it can mimic the caption style in all 3 styles.
98
+
99
+ During pretraining, the ratio of samples in each batch was approximately 12:3:1 (AudioSet:AudioCaps:Clotho). The pretraining took 19800 steps with batch size 32 and learning rate 2e-5. Finetuning was done on Clotho only, and the model was trained for 1500 steps with batch size 32 and learning rate 4e-6. All layers except *fc1* layers were frozen during finetuning.
100
+
101
+ For more information about the training regime, see the [technical report](TODO).
102
+
103
+
104
+ ## Evaluation details
105
+
106
+ Metrics reported in the metadata were computed on Clotho v2.1 test split with captions generated using a beam search with 5 beams.
107
+
108
+ | | whisper-tiny | whisper-small | whisper-large-v2 |
109
+ |----------------------|--------------|---------------|------------------|
110
+ | SacreBLEU | 13.77 | 15.76 | 16.50 |
111
+ | METEOR | 0.3452 | 0.3781 | 0.3782 |
112
+ | CIDEr | 0.3404 | 0.4142 | 0.4331 |
113
+ | SPICE | 0.1077 | 0.1234 | 0.1257 |
114
+ | SPIDEr | 0.2240 | 0.2687 | 0.2794 |
115
+
116
+
117
+ ## Limitations
118
+
119
+ The captions generated by the model can be misleading or not truthful, even if they appear convincing. The hallucination occurs especially in domains that were not present in the finetuning data.
120
+
121
+ While the original speech-to-text checkpoints by OpenAI were trained on multilingual data, our training contains only English captions, and therefore is not expected for the model to support other languages.
122
+
123
+
124
+
125
+ ## Licence
126
+
127
+ The model weights are published under non-commercial license CC BY-NC 4.0 as the model was finetuned on a dataset for non-commercial use.
128
+
129
+
130
+ ## Contact
131
+
132
+ If you'd like to chat about this, please get in touch with is via email at kadlcik`<at>`mail.muni.cz or ahajek`<at>`mail.muni.cz.