--- language: en datasets: - timit_asr tags: - audio - automatic-speech-recognition - speech license: apache-2.0 --- # Wav2Vec2-Large-LV60-TIMIT Fine-tuned [facebook/wav2vec2-large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60) on the [timit_asr dataset](https://huggingface.co/datasets/timit_asr). When using this model, make sure that your speech input is sampled at 16kHz. ## Usage The model can be used directly (without a language model) as follows: ```python import soundfile as sf import torch from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor model_name = "elgeish/wav2vec2-large-lv60-timit-asr" processor = Wav2Vec2Processor.from_pretrained(model_name) model = Wav2Vec2ForCTC.from_pretrained(model_name) model.eval() dataset = load_dataset("timit_asr", split="test").shuffle().select(range(10)) char_translations = str.maketrans({"-": " ", ",": "", ".": "", "?": ""}) def prepare_example(example): example["speech"], _ = sf.read(example["file"]) example["text"] = example["text"].translate(char_translations) example["text"] = " ".join(example["text"].split()) # clean up whitespaces example["text"] = example["text"].lower() return example dataset = dataset.map(prepare_example, remove_columns=["file"]) inputs = processor(dataset["speech"], sampling_rate=16000, return_tensors="pt", padding="longest") with torch.no_grad(): predicted_ids = torch.argmax(model(inputs.input_values).logits, dim=-1) predicted_ids[predicted_ids == -100] = processor.tokenizer.pad_token_id # see fine-tuning script predicted_transcripts = processor.tokenizer.batch_decode(predicted_ids) for reference, predicted in zip(dataset["text"], predicted_transcripts): print("reference:", reference) print("predicted:", predicted) print("--") ``` Here's the output: ``` reference: the emblem depicts the acropolis all aglow predicted: the amblum depicts the acropolis all a glo -- reference: don't ask me to carry an oily rag like that predicted: don't ask me to carry an oily rag like that -- reference: they enjoy it when i audition predicted: they enjoy it when i addition -- reference: set aside to dry with lid on sugar bowl predicted: set aside to dry with a litt on shoogerbowl -- reference: a boring novel is a superb sleeping pill predicted: a bor and novel is a suberb sleeping peel -- reference: only the most accomplished artists obtain popularity predicted: only the most accomplished artists obtain popularity -- reference: he has never himself done anything for which to be hated which of us has predicted: he has never himself done anything for which to be hated which of us has -- reference: the fish began to leap frantically on the surface of the small lake predicted: the fish began to leap frantically on the surface of the small lake -- reference: or certain words or rituals that child and adult go through may do the trick predicted: or certain words or rituals that child an adult go through may do the trick -- reference: are your grades higher or lower than nancy's predicted: are your grades higher or lower than nancies -- ``` ## Fine-Tuning Script You can find the script used to produce this model [here](https://github.com/elgeish/transformers/blob/8ee49e09c91ffd5d23034ce32ed630d988c50ddf/examples/research_projects/wav2vec2/finetune_large_lv60_timit_asr.sh). **Note:** This model can be fine-tuned further; [trainer_state.json](https://huggingface.co/elgeish/wav2vec2-large-lv60-timit-asr/blob/main/trainer_state.json) shows useful details, namely the last state (this checkpoint): ```json { "epoch": 29.51, "eval_loss": 25.424150466918945, "eval_runtime": 182.9499, "eval_samples_per_second": 9.183, "eval_wer": 0.1351704233095107, "step": 8500 } ```