--- language: en license: mit tags: - audio - captioning - text - audio-captioning - automated-audio-captioning task_categories: - audio-captioning --- # CoNeTTE (ConvNext-Transformer with Task Embedding) for Automated Audio Captioning <font color='red'>This model is currently in developement, and all the required files are not yet available.</font> This model generate a short textual description of any audio file. ## Installation ```bash pip install conette ``` ## Usage ```py from conette import CoNeTTEConfig, CoNeTTEModel config = CoNeTTEConfig.from_pretrained("Labbeti/conette") model = CoNeTTEModel.from_pretrained("Labbeti/conette", config=config) path = "/my/path/to/audio.wav" outputs = model(path) cands = outputs["cands"][0] print(cands) ``` ## Single model performance | Dataset | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) | | ------------- | ------------- | ------------- | ------------- | | AudioCaps | 44.14 | 43.98 | 60.81 | | Clotho | 30.97 | 30.87 | 51.72 | ## Citation The preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf ``` @misc{labbé2023conette, title = {CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding}, author = {Étienne Labbé and Thomas Pellegrini and Julien Pinquier}, year = 2023, journal = {arXiv preprint arXiv:2309.00454}, url = {https://arxiv.org/pdf/2309.00454.pdf}, eprint = {2309.00454}, archiveprefix = {arXiv}, primaryclass = {cs.SD} } ``` ## Additional information The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://huggingface.co/topel/ConvNeXt-Tiny-AT. The encoder weights used are named "convnext_tiny_465mAP_BL_AC_70kit.pth", available on Zenodo: https://zenodo.org/record/8020843. It was created by [@Labbeti](https://hf.co/Labbeti).