Model Card for Model ID

This model is a fine-tuned version of lucio/xls-r-uyghur-cv7 which based on the facebook/wav2vec2-xls-r-300m, the MOZILLA-FOUNDATION/COMMON_VOICE_18_0 - UG dataset used for fine-tuning.

It achieves the following results:

Loss: 1.0882

Model Details

Detail of the model see facebook/wav2vec2-xls-r-300m.

Model Description

The model vocabulary consists of the alphabetic characters of the Perso-Arabic script for the Uyghur language, with punctuation removed.

Intended uses & limitations

This model is expected to be of some utility for low-fidelity use cases such as:

Draft video captions

Indexing of recorded broadcasts

The model is not reliable enough to use as a substitute for live captions for accessibility purposes, and it should not be used in a manner that would infringe the privacy of any of the contributors to the Common Voice dataset nor any other speakers.

Training and evaluation data

The combination of train and dev of common voice official splits were used as training data.

The official test split was used for final evaluation.

Training procedure

The featurization layers of the XLS-R model are frozen while tuning a final CTC/LM layer on the Uyghur CV18 example sentences.

Training hyperparameters

The following hyperparameters were used during training:

group_by_length=True,
per_device_train_batch_size=8,
evaluation_strategy="no",
eval_strategy="steps",
num_train_epochs=3,
fp16=True,
save_steps=500,
eval_steps=500,
logging_steps=500,
learning_rate=1e-4,
warmup_steps=500,
save_total_limit=2

How to Training:

You may create a python document named as "fine_tuen.py".

"fine_tune.py" shoud including the following contents:

import torchaudio
import torch
from datasets import load_dataset, Audio
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from transformers import DefaultDataCollator
from transformers import TrainingArguments, Trainer
from dataclasses import dataclass
from typing import Dict, List, Union
import librosa

# 加载数据集
dataset = load_dataset("MOZILLA-FOUNDATION/COMMON_VOICE_18_0 - UG", split="train")
dataset = dataset.cast_column("path", Audio())

# 加载处理器
processor = Wav2Vec2Processor.from_pretrained("lucio/xls-r-uyghur-cv7")

def preprocess_function(batch):
   
    audio = batch["path"]
    
    if audio["sampling_rate"] != 16000:
        resampler = torchaudio.transforms.Resample(audio["sampling_rate"], 16000)
        waveform = torch.tensor(audio["array"], dtype=torch.float32)
        audio["array"] = resampler(waveform).numpy()

    # 确保所有音频长度相同
    audio_array = librosa.util.fix_length(audio["array"], size=200000)

    # 将音频数组转换为张量
    audio_tensor = torch.from_numpy(audio_array).float()

    inputs = processor(
        audio_tensor,
        sampling_rate=16000,
        return_tensors="pt",
        padding="longest"
    )

    with processor.as_target_processor():
        labels = processor(batch["sentence"]).input_ids

    batch["input_values"] = inputs.input_values[0]  # 移除批次维度
    batch["labels"] = labels
    return batch


# 应用预处理
dataset = dataset.map(preprocess_function, remove_columns=["path", "sentence"])
model = Wav2Vec2ForCTC.from_pretrained("lucio/xls-r-uyghur-cv7", ctc_loss_reduction="mean", pad_token_id=processor.tokenizer.pad_token_id)

# 冻结特征提取器参数
model.freeze_feature_encoder()

training_args = TrainingArguments(

    output_dir="./wav2vec2_finetune",
    group_by_length=True,
    per_device_train_batch_size=8,
    evaluation_strategy="no",
    eval_strategy="steps",
    num_train_epochs=3,
    fp16=True,
    save_steps=500,
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4,
    warmup_steps=500,
    save_total_limit=2,
)

@dataclass
class DataCollatorCTCWithPadding:
    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # 提取所有的 input_values 并转换为张量
        input_features = [torch.tensor(feature["input_values"]) for feature in features]

        # 找到最短的序列长度
        min_length = min(map(len, input_features))

        # 截断 input_values
        input_features = [feature[:min_length] for feature in input_features]

        # 填充 input_values
        input_features = torch.nn.utils.rnn.pad_sequence(input_features, batch_first=True)

        # 获取所有的标签序列并转换为张量
        label_features = [torch.tensor(feature["labels"]) for feature in features]

        # 填充标签
        labels_batch = torch.nn.utils.rnn.pad_sequence(label_features, batch_first=True, padding_value=-100)

        batch = {
            "input_values": input_features,
            "labels": labels_batch,
        }

        return batch
# 使用自定义的数据整理器
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

# 更新 Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=processor.feature_extractor,
    data_collator=data_collator
)

trainer.train()

model.save_pretrained("fine_tuned_wav2vec2_UGASR_model") #微调后的模型名称

processor.save_pretrained("fine_tuned_wav2vec2_UGASR_model") #到这里微调工作全部结束，可对微调后的"fine_tuned_wav2vec2_UGASR_model"模型进行进一步的评估。

above is the full contents of fine_tune.py

Developed by: Mamajtan Abudkader 2024.9.10
Model type: ASR
Language(s) (NLP): Uyghur
License: Apache2
Finetuned from model: lucio/xls-r-uyghur-cv7

Uses

This model is used for auto speech recognition of uyghur language in arabic scripts.

How to Get Started with the Model

Use the code below to get started with the model, you may create a python document named as "asr.py".

"asr.py" should include the following contents:

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import librosa
import torch
import time

stt = time.time()

# 指定模型的路径
model_path = "mamatjan/xls-r-uyghur-cv18"

# 加载模型和处理器
model = Wav2Vec2ForCTC.from_pretrained(model_path)
processor = Wav2Vec2Processor.from_pretrained(model_path)

# 读取音频文件并重采样到16kHz
audio_input, sampling_rate = librosa.load("exmaple.mp3", sr=None) #"exmaple.mp3"是需要音转文的音频文件，确保该文件和asr.py文件在同一个目录或者给出"exmaple.mp3"文件的完整路径。 
if sampling_rate != 16000:

    audio_input = librosa.resample(audio_input, orig_sr=sampling_rate, target_sr=16000)

    sampling_rate = 16000

# 使用处理器处理音频数据
inputs = processor(audio_input, return_tensors="pt", sampling_rate=sampling_rate, padding=True)

# 使用模型进行预测
with torch.no_grad():
    logits = model(inputs.input_values).logits

# 使用 CTC 解码器解码预测结果
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
waqit = time.time()-stt
print("======سەرىپ قىلغان ۋاقىت===============") # 打印消耗的时间
print(f"ۋاقىت: {waqit:.2f} سىكۇنت")  #打印（时间：*.**秒）
print(transcription[0])    # 打印音转文维吾尔语文本，至此asr.py的全部内容运行完了。

above is the full content of the "asr.py".

Hardware

NVIDIA Geforce 3060 ti been used for the training on Win10 system for 14 hours.

mamatjan
/

xls-r-uyghur-cv18