README.md · nherve/flaubert-oral-ft at main

flaubert-oral-ft / README.md

nherve

Update README.md

c383faf over 2 years ago

preview code

raw

history blame contribute delete

2.21 kB

	---
	language: fr
	license: mit
	tags:
	- bert
	- language-model
	- flaubert
	- french
	- flaubert-base
	- uncased
	- asr
	- speech
	- oral
	- natural language understanding
	- NLU
	- spoken language understanding
	- SLU
	- understanding
	---

	# FlauBERT-Oral models: Using ASR-Generated Text for Spoken Language Modeling

	FlauBERT-Oral are French BERT models trained on a very large amount of automatically transcribed speech from 350,000 hours of diverse French TV shows. They were trained with the [FlauBERT software](https://github.com/getalp/Flaubert) using the same parameters as the [flaubert-base-uncased](https://huggingface.co/flaubert/flaubert_base_uncased) model (12 layers, 12 attention heads, 768 dims, 137M parameters, uncased).

	## Available FlauBERT-Oral models

	- `flaubert-oral-asr` : trained from scratch on ASR data, keeping the BPE tokenizer and vocabulary of flaubert-base-uncased
	- `flaubert-oral-asr_nb` : trained from scratch on ASR data, BPE tokenizer is also trained on the same corpus
	- `flaubert-oral-mixed` : trained from scratch on a mixed corpus of ASR and text data, BPE tokenizer is also trained on the same corpus
	- `flaubert-oral-ft` : fine-tuning of flaubert-base-uncased for a few epochs on ASR data

	## Usage for sequence classification
	```python
	flaubert_tokenizer = FlaubertTokenizer.from_pretrained("nherve/flaubert-oral-asr")
	flaubert_classif = FlaubertForSequenceClassification.from_pretrained("nherve/flaubert-oral-asr", num_labels=14)
	flaubert_classif.sequence_summary.summary_type = 'mean'
	# Then, train your model
	```

	## References
	If you use FlauBERT-Oral models for your scientific publication, or if you find the resources in this repository useful, please cite the following papers:
	```
	@InProceedings{herve2022flaubertoral,
	author = {Herv\'{e}, Nicolas and Pelloin, Valentin and Favre, Benoit and Dary, Franck and Laurent, Antoine and Meignier, Sylvain and Besacier, Laurent},
	title = {Using ASR-Generated Text for Spoken Language Modeling},
	booktitle = {Proceedings of "Challenges & Perspectives in Creating Large Language Models" ACL 2022 Workshop},
	month = {May},
	year = {2022}
	}
	```