Design and development of automatic speech recognition system
Speech recognition is one of the research areas that has improved in recent years due to significant advances in the field of deep neural networks. These advances have led to the development of several methods for speech recognition, which are divided into three main categories: traditional methods, methods based on deep neural networks, and hybrid methods. Automatic speech recognition systems generally include four main parts: audio feature extraction, audio model, language model, and search based on Bayes decision rule. With the introduction of deep neural networks and their combination with traditional methods, a huge evolution has occurred in these systems. Recently introduced end-to-end operational models have brought fundamental changes to automatic speech recognition systems. These models are able to convert a sequence of speech inputs into a sequence of signs in the output using a neural network. End-to-end models have significant advantages over traditional hybrid models, including the use of a unified objective function that is compatible with speech recognition systems and enables the optimization of the entire network. In this project, the main focus has been on preparing a clean database and training automatic speech recognition models. Three methods for creating the dataset have been investigated. One of these methods has been using open source audio and text sources to train the model and create a data cleaning pipeline. In this regard, the CommonVoice-V16 dataset was used and a pipeline was designed to clean the data. Text-to-speech models have also been evaluated for dataset construction. Among the existing models, the Pertts model performed well, but it was only trained for one speaker. Also, another method for creating a dataset of open-text audio and text sources has been investigated, and in this method, correct matching between texts and sounds is very important. At this stage, the training of different models is also considered and the Whisper model is trained on the cleaned CommonVoice-V16 dataset. The lexical error rate (WER) on this dataset reached about 2.5%, which indicates significant improvements in the field of speech recognition using clean datasets and advanced models. This project shows how to achieve significant improvements in automatic speech recognition systems by using appropriate data sources and advanced neural network methods. In this phase of the project, the input of the speech recognition system is a signal without noise and distortion. This system will be developed in the next phase to detect noisy speech.