alefiury
/

wav2vec2-xls-r-300m-pt-br-spontaneous-speech-emotion-recognition

Audio Classification

portuguese-speech-corpus

italian-speech-corpus

english-speech-corpus

arabic-speech-corpus

Inference Endpoints

Model card Files Files and versions Community

alefiury commited on Mar 25, 2022

Commit

6906af3

•

1 Parent(s): cd9c831

Upload README.md

Files changed (1) hide show

README.md +62 -0

README.md ADDED Viewed

	@@ -0,0 +1,62 @@

+---
+language: pt
+datasets:
+- coraa_ser
+- emovo
+- ravdess
+- baved
+metrics:
+- f1
+tags:
+- audio
+- speech
+- wav2vec2
+- pt
+- portuguese-speech-corpus
+- italian-speech-corpus
+- english-speech-corpus
+- arabic-speech-corpus
+- spontaneous
+- speech
+- PyTorch
+license: apache-2.0
+model_index:
+  name: wav2vec2-xls-r-300m-pt-br-spontaneous-speech-emotion-recognition
+---
+# Wav2vec 2.0 XLS-R For Spontaneous Speech Recognition
+This is the model that got first place in the SER track of the Automatic Speech Recognition for spontaneous and prepared speech & Speech Emotion Recognition in Portuguese (SE&R 2022) Workshop.
+The following datasets were used in the training:
+- [CORAA SER v1.0](https://github.com/rmarcacini/ser-coraa-pt-br/): a dataset composed of spontaneous portuguese speech and approximately 50 minutes of audio segments labeled in three classes: neutral, non-neutral female, and non-neutral male.
+- [EMOVO Corpus](https://aclanthology.org/L14-1478/): a database of emotional speech for the Italian language, built from the voices of up to 6 actors who played 14 sentences simulating 6 emotional states (disgust, fear, anger, joy, surprise, sadness) plus the neutral state.
+- [RAVDESS]((https://zenodo.org/record/1188976#.YO6yI-gzaUk)): a dataset that provides 1440 samples of recordings from actors performing on 8 different emotions in English, which are: angry, calm, disgust, fearful, happy, neutral, sad and surprised.
+- [BAVED](https://github.com/40uf411/Basic-Arabic-Vocal-Emotions-Dataset): a collection of audio recordings of Arabic words spoken with varying degrees of emotion. The dataset contains seven words: like, unlike, this, file, good, neutral, and bad, which are spoken at three emotional levels: low emotion (tired or feeling down), neutral emotion (the way the speaker speaks daily), and high emotion (positive or negative emotions such as happiness, joy, sadness, anger).
+The test set used is a part of the CORAA SER v1.0 that has been set aside for this purpose.
+It achieves the following results on the test set:
+- Accuracy: 0.9090
+- Macro Precision: 0.8171
+- Macro Recall: 0.8397
+- Macro F1-Score: 0.8187
+## Datasets Details
+The following image shows the overall distribution of the datasets:
+![distribution](https://docs.google.com/spreadsheets/d/e/2PACX-1vTUvuMLRnoFv3MBkStOcMQE5GuiqqyrvpyEtIiwoQEg8uA6dWvfZM-faHORLFNmPYJUzDbO6TZ2a9Zb/pubchart?oid=446282973&format=image)
+The following image shows the number of instances by label:
+![numberInstances](https://docs.google.com/spreadsheets/d/e/2PACX-1vS7PUbW6J3Hnof1D2l492KW0sbF4BzWCeaiGQm53w-9EZck_Y14feE48HtcBvmjjZKsTJWP1RZpdh_v/pubchart?oid=1904097403&format=image)
+## Repository
+The repository that implements the model to be trained and tested is avaible [here](https://github.com/alefiury/SE-R-2022-SER-Track).