vpelloin/MEDIA_NLU-flaubert_oral_asr
This is a Natural Language Understanding (NLU) model for the French MEDIA benchmark. It maps each input words into outputs concepts tags (76 available).
This model is trained using nherve/flaubert-oral-asr
as its inital checkpoint. It obtained 12.43% CER (lower is better) in the MEDIA test set, in our Interspeech 2023 publication, using Kaldi ASR transcriptions.
Available MEDIA NLU models:
vpelloin/MEDIA_NLU-flaubert_base_cased
: MEDIA NLU model trained usingflaubert/flaubert_base_cased
. Obtains 13.20% CER on MEDIA test.vpelloin/MEDIA_NLU-flaubert_base_uncased
: MEDIA NLU model trained usingflaubert/flaubert_base_uncased
. Obtains 12.40% CER on MEDIA test.vpelloin/MEDIA_NLU-flaubert_oral_ft
: MEDIA NLU model trained usingnherve/flaubert-oral-ft
. Obtains 11.98% CER on MEDIA test.vpelloin/MEDIA_NLU-flaubert_oral_mixed
: MEDIA NLU model trained usingnherve/flaubert-oral-mixed
. Obtains 12.47% CER on MEDIA test.vpelloin/MEDIA_NLU-flaubert_oral_asr
: MEDIA NLU model trained usingnherve/flaubert-oral-asr
. Obtains 12.43% CER on MEDIA test.vpelloin/MEDIA_NLU-flaubert_oral_asr_nb
: MEDIA NLU model trained usingnherve/flaubert-oral-asr_nb
. Obtains 12.24% CER on MEDIA test.
Usage with Pipeline
from transformers import pipeline
generator = pipeline(
model="vpelloin/MEDIA_NLU-flaubert_oral_asr",
task="token-classification"
)
sentences = [
"je voudrais réserver une chambre à paris pour demain et lundi",
"d'accord pour l'hôtel à quatre vingt dix euros la nuit",
"deux nuits s'il vous plait",
"dans un hôtel avec piscine à marseille"
]
for sentence in sentences:
print([(tok['word'], tok['entity']) for tok in generator(sentence)])
Usage with AutoTokenizer/AutoModel
from transformers import (
AutoTokenizer,
AutoModelForTokenClassification
)
tokenizer = AutoTokenizer.from_pretrained(
"vpelloin/MEDIA_NLU-flaubert_oral_asr"
)
model = AutoModelForTokenClassification.from_pretrained(
"vpelloin/MEDIA_NLU-flaubert_oral_asr"
)
sentences = [
"je voudrais réserver une chambre à paris pour demain et lundi",
"d'accord pour l'hôtel à quatre vingt dix euros la nuit",
"deux nuits s'il vous plait",
"dans un hôtel avec piscine à marseille"
]
inputs = tokenizer(sentences, padding=True, return_tensors='pt')
outputs = model(**inputs).logits
print([
[model.config.id2label[i] for i in b]
for b in outputs.argmax(dim=-1).tolist()
])
Reference
If you use this model for your scientific publication, or if you find the resources in this repository useful, please cite the following paper:
@inproceedings{pelloin22_interspeech,
author={Valentin Pelloin and Franck Dary and Nicolas Hervé and Benoit Favre and Nathalie Camelin and Antoine LAURENT and Laurent Besacier},
title={ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={3453--3457},
doi={10.21437/Interspeech.2022-352}
}
- Downloads last month
- 110
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.