File size: 3,208 Bytes
03cbbb3
 
 
 
 
 
 
 
 
 
 
 
888fa0d
7e81488
dde3ab2
7e81488
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0d58e58
 
7e81488
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db7ef55
6eeb58d
 
 
 
 
6b26122
db7ef55
9e7cfad
 
631310a
 
9e7cfad
 
03cbbb3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
language:
- ar
metrics:
- wer
- cer
tags:
- Quran
- speech
- arabic
- asr
---
 # Quran syllables recognition with tashkeel.
This is fine tuned wav2vec2 model to recognize quran syllables from speech.        
The model was trained on private dataset along with part of Tarteel dataset after cleanning and converting into syllables .\
5-gram language model is available with the model.

The model transcripe audio speech into syllables .\
For instance, when presented with the audio and transcription "ู…ูู†ูŽ ุงู„ู’ุฌูู†ูŽู‘ุฉู ูˆูŽุงู„ู†ูŽู‘ุงุณู" the expected model output would be
"ู…ู ู†ูŽู„ู’ ุฌูู†ู’ ู†ูŽ ุชู ูˆูŽู†ู’ ู†ูŽุงู’ุณู’" .\
To try it out :

```
!pip install datasets transformers
!pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
```

```
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from transformers import Wav2Vec2ProcessorWithLM
processor = Wav2Vec2ProcessorWithLM.from_pretrained('IbrahimSalah/Wav2vecLarge_quran_syllables_recognition')
model = Wav2Vec2ForCTC.from_pretrained("IbrahimSalah/Wav2vecLarge_quran_syllables_recognition")
```
```
import pandas as pd
dftest = pd.DataFrame(columns=['audio'])
import datasets
from datasets import Dataset
path ='/content/908-33.wav'
dftest['audio']=[path]  ## audio path
dataset = Dataset.from_pandas(dftest)
```
```
import torch
import torchaudio
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["audio"])
    print(sampling_rate)
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000) # The original data was with 48,000 sampling rate. You can change it according to your input.
    batch["audio"] = resampler(speech_array).squeeze().numpy()
    return batch
```
```
import numpy as np
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["audio"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values).logits
    print(logits.numpy().shape)

transcription = processor.batch_decode(logits.numpy()).text
print("Prediction:",transcription[0])
```

You can try the model with live recording using this Google Colab notebook : [Live Recording Recognition](https://colab.research.google.com/drive/1WYFG03o93-CBFNHhAuAo3MNmzgo4nLEJ?usp=sharing)


sample audios and Outputs

1- <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/645098004f731658826cfe57/tZLR9sn6VnsjYS5xT2qFB.wav"></audio>

2- <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/645098004f731658826cfe57/_6p54hi5JRt_PIqxY-P0I.wav"></audio>

Output
```
1- ุกููˆู’ ู„ูŽุงู’ ุกู ูƒูŽ ู„ูŽู…ู’ ูŠูŽ ูƒููˆู’ ู†ููˆู’ ู…ูุนู’ ุฌู ุฒููŠู’ ู†ูŽ ููู„ู’ ุกูŽุฑู’ ุถู ูˆูŽ ู…ูŽุงู’ ูƒูŽุงู’ ู†ูŽ ู„ูŽ ฺพูู…ู’ ู…ูู†ู’ ุกูŽูˆู’ ู„ู ูŠูŽุงู’
2- ุกูุฐู’ ู‚ูŽุงู’ ู„ูŽ ูŠููˆู’ ุณู ูู ู„ู ุกูŽูŠู’ ุจููŠู’ ฺพู ูŠูŽุงู’ ุกูŽ ุจูŽ ุชู ุกูู†ู’ ู†ููŠู’ ุฑูŽ ุกูŽูŠู’ ุชู ุกูŽ ุญูŽ ุฏูŽ ุนูŽ ุดูŽ ุฑูŽ ูƒูŽูˆู’ ูƒูŽ ุจูŽู„ู’ ูˆูŽุดู’ ุดูŽู…ู’ ุณูŽ ูˆูŽู„ู’ ู‚ูŽ ู…ูŽ ุถูŽ ุฑูŽ ุกูŽูŠู’ ุชู ฺพูู…ู’ ู„ููŠู’ ุณูŽุงู’ ุฌู ุฏููŠู’ู†ู’
```