Labbeti commited on
Commit
a353501
·
1 Parent(s): 88c61c9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +34 -4
README.md CHANGED
@@ -31,16 +31,46 @@ model = CoNeTTEModel.from_pretrained("Labbeti/conette", config=config)
31
 
32
  path = "/my/path/to/audio.wav"
33
  outputs = model(path)
34
- cands = outputs["cands"][0]
35
- print(cands)
36
  ```
37
 
38
- ## Single model performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  | Dataset | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) |
40
  | ------------- | ------------- | ------------- | ------------- |
41
  | AudioCaps | 44.14 | 43.98 | 60.81 |
42
  | Clotho | 30.97 | 30.87 | 51.72 |
43
 
 
 
44
  ## Citation
45
  The preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf
46
 
@@ -60,6 +90,6 @@ The preprint version of the paper describing CoNeTTE is available on arxiv: http
60
  ## Additional information
61
 
62
  The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://huggingface.co/topel/ConvNeXt-Tiny-AT.
63
- The encoder weights used are named "convnext_tiny_465mAP_BL_AC_70kit.pth", available on Zenodo: https://zenodo.org/record/8020843.
64
 
65
  It was created by [@Labbeti](https://hf.co/Labbeti).
 
31
 
32
  path = "/my/path/to/audio.wav"
33
  outputs = model(path)
34
+ candidate = outputs["cands"][0]
35
+ print(candidate)
36
  ```
37
 
38
+ The model can also accept several audio files at the same time (list[str]), or a list of pre-loaded audio files (list[Tensor]). IN this second case you also need to provide the sampling rate of this files:
39
+
40
+ ```py
41
+ import torchaudio
42
+
43
+ path_1 = "/my/path/to/audio_1.wav"
44
+ path_2 = "/my/path/to/audio_2.wav"
45
+
46
+ audio_1, sr_1 = torchaudio.load(path_1)
47
+ audio_2, sr_2 = torchaudio.load(path_2)
48
+
49
+ outputs = model([audio_1, audio_2], sr=[sr_1, sr_2])
50
+ candidates = outputs["cands"]
51
+ print(candidates)
52
+ ```
53
+
54
+ The model can also produces different captions using a Task Embedding input which indicates the dataset caption style. The default task is "clotho".
55
+
56
+ ```py
57
+ outputs = model(path, task="clotho")
58
+ candidate = outputs["cands"][0]
59
+ print(candidate)
60
+
61
+ outputs = model(path, task="audiocaps")
62
+ candidate = outputs["cands"][0]
63
+ print(candidate)
64
+ ```
65
+
66
+ ## Performance
67
  | Dataset | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) |
68
  | ------------- | ------------- | ------------- | ------------- |
69
  | AudioCaps | 44.14 | 43.98 | 60.81 |
70
  | Clotho | 30.97 | 30.87 | 51.72 |
71
 
72
+ This model checkpoint has been trained for the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.
73
+
74
  ## Citation
75
  The preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf
76
 
 
90
  ## Additional information
91
 
92
  The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://huggingface.co/topel/ConvNeXt-Tiny-AT.
93
+ More precisely, the encoder weights used are named "convnext_tiny_465mAP_BL_AC_70kit.pth", available on Zenodo: https://zenodo.org/record/8020843.
94
 
95
  It was created by [@Labbeti](https://hf.co/Labbeti).