Whisper for Singaporean Aphasia
Aphasia is a language disorder that affects the ability to communicate, often resulting from brain injury or stroke. Patients with aphasia frequently experience difficulty with speech fluency, articulation, and comprehension, which poses challenges for both daily communication and effective engagement in therapeutic exercises.
This model is part of an innovative chatbot-based therapy tool designed to assist aphasic patients in practicing speech exercises remotely. By leveraging OpenAI’s Whisper small model and fine-tuning it for Singaporean English aphasic speech, this solution provides a tailored automatic speech recognition (ASR) system that can accurately transcribe atypical speech patterns associated with aphasia. This tool aims to give patients more flexibility in practicing speech tasks without needing frequent in-person visits, making it accessible and convenient for individuals with work, school, or mobility constraints.
The fine-tuned Whisper model enables the chatbot to understand and respond to aphasic speech with greater accuracy, helping patients practice specific speech exercises, such as describing objects or actions, in a supportive and flexible environment. This document outlines the model’s details, intended applications, and performance.
Model Details
This Whisper model is a fine-tuned variant of whisper-small.en
. It has been tuned specifically to transcribe Singaporean English aphasic speech with improved accuracy, addressing the unique transcription challenges presented by atypical speech patterns. Aphasic speech, common among patients with language or speech disorders, often features softer, slower, or non-standard pronunciation, which can be difficult for traditional ASR models to accurately transcribe.
Model Description
- Developed by: Farhan Azmi
- Funded by: Singapore Institute of Technology
- Shared by: National University Health Systems (NUHS) Singapore
- Model type: Whisper-based Automatic Speech Recognition (ASR) model
- Language(s) (NLP): English (Singapore, normal and atypical speech)
- License: apache-2.0
- Finetuned from model:
openai/whisper-small.en
Model Sources [optional]
- Repository: [More Information Needed]
- Paper [optional]: TODO
- Demo [optional]: TODO
Uses
- Primary Use Case: This model is a core component of a chatbot-based therapy tool for aphasia patients. The tool enables patients to practice speech tasks remotely, providing them with more flexibility and reducing their need to travel frequently for in-person appointments with a speech and language therapist. This is particularly beneficial for patients who face challenges such as work or school commitments or mobility issues (e.g., elderly patients).
- Extended Applications: The model can also be used in other healthcare and therapeutic settings that require automatic transcription of aphasic or atypical speech, as well as research on improving ASR for non-standard speech patterns.
Downstream Use [optional]
[More Information Needed]
Out-of-Scope Use
[More Information Needed]
Bias, Risks, and Limitations
[More Information Needed]
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
Training Details
Training Data
The dataset used for fine-tuning this model (train, validation, and test) consists of 22 hours of audio data spanning 13,680 audio files and corresponding transcripts, collected from four different tasks involving descriptive prompts. These tasks required participants to describe images depicting either an object (nouns) or an action (verbs).
2,989 samples of this dataset (or 7.2 hours) represents recordings from real-world patients with aphasia of varying severity, providing authentic aphasic speech patterns. The remaining 11,170 samples (or 14.8 hours) consists of recordings from individuals without aphasia, representing typical speech from native Singaporean English speakers. This diverse dataset allows the model to better generalize across both aphasic and typical speech patterns in a localized (Singaporean) English context.
The dataset was divided into training, validation, and test sets by grouping audio files by patients. This means that all audio samples and corresponding transcriptions from a single patient are assigned entirely to one of the three splits. This approach prevents data leakage and provides a more accurate assessment of whether our fine-tuned model can generalize effectively to new data. The dataset split is detailed in the table below:
Dataset Split | Audio Samples | Patients Involved (Sample) | Total Audio Hours |
---|---|---|---|
Train | 9422 | al_e026, al_e028, ...[^1] | 15.2 hours |
Validation | 2047 | al_e048, al_e106, ...[^2] | 3.2 hours |
Test | 2211 | al_e092, al_e155, ...[^3] | 3.6 hours |
[^1]: Full list of patients involved in Train:
al_e026, al_e028, al_e078, al_e085, al_e099, al_e100, al_e101, al_e117, al_e118, al_e122, al_e132, al_e179, hl_e002, hl_e003, hl_e005, hl_e006, hl_e007, hl_e008, hl_e010, hl_e011, hl_e013, hl_e014, hl_e015, hl_e016, hl_e017, hl_e018, hl_e019, hl_e020, hl_e021, hl_e024, hl_e025, hl_e023, hl_e031, hl_e027, hl_e032, hl_e033, hl_e034, hl_e035, hl_e037, hl_e043, hl_e044, hl_e045, hl_e046, hl_e047, hl_e050, hl_e051, hl_e049, hl_e052, hl_e053, hl_e057, hl_e059, hl_e058, hl_e060, hl_e062, hl_e065, hl_e066, hl_e069, hl_e071, hl_e070, hl_e072.
[^2]: Full list of patients involved in Validation:
al_e048, al_e106, al_e133, al_e180, hl_e001, hl_e012, hl_e030, hl_e036, hl_e040, hl_e042, hl_e055, hl_e064, hl_e068.
[^3]: Full list of patients involved in Test:
al_e092, al_e155, al_e137, hl_e004, hl_e009, hl_e029, hl_e038, hl_e039, hl_e041, hl_e054, hl_e056, hl_e061, hl_e063, hl_e067.
Note: Audio files representing patients begin with 'al', whereas healthy samples begin with 'hl'.
Training Procedure
Preprocessing [optional]
Preprocessing Audio (Features)
All waveforms were extracted from their respective audio files and downsampled to 16kHz. Audio features were then obtained using the WhisperFeatureExtractor
for whisper-small.en
from the transformers
library. This feature extractor pads audio waveforms of 30 seconds or less with zeros and truncates any audio longer than 30 seconds. To ensure consistency in training, audio files exceeding 30 seconds in length were excluded from the dataset. Following padding and truncation, the WhisperFeatureExtractor
generates a mel-spectrogram for each waveform, providing a time-frequency-amplitude representation of the audio that is tailored to human auditory perception.
Preprocessing Reference Transcripts (Labels)
Each of the reference transcriptions from the dataset were tokenized and converted into unique identifiers using the WhisperTokenizer
for whisper-small.en
from the transformers
library. The purpose of this process is to first separate the sequence into tokens, and then assign each token with a unique identifier (IDs). This process was necessary as the model will process the tokens in a numerical format, not raw text.
Training Hyperparameters
Learning Rate: 1e-5
Batch Size: 16
Gradient Accumulation Steps: 1
Number of Epochs: 7
Warmup Steps: 200
Weight Decay: 0.0
Dropout: 0.0
Evaluation Strategy: epochs
Save Strategy: epochs
Best Model Checkpoint and Epochs
Best Epoch: 6
Best Checkpoint: 3534 steps
Validation WER at Best Checkpoint: 69.14%
Speeds, Sizes, Times [optional]
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
The data used to evaluate the model's performance after fine-tuning is shown below:
Dataset Split | Audio Samples | Patients Involved (Sample) | Total Audio Hours |
---|---|---|---|
Test | 2211 | al_e092, al_e155, ...[^3] | 3.6 hours |
The test set consists of 2,211 audio samples comprising a mix of aphasia and healthy samples, totaling 3.6 hours of audio. This data allows for a comprehensive assessment of the model’s ability to generalize across different speakers and speech patterns.
Metrics
The primary metric used to evaluate model performance is Word Error Rate (WER). WER is the standard metric in ASR (Automatic Speech Recognition) to measure transcription accuracy, defined as the total number of errors (substitutions, deletions, and insertions) divided by the total words in the reference transcription. Lower WER values indicate higher transcription accuracy.
WER was chosen because:
- It provides a quantitative measure of transcription accuracy, making it easy to compare performance across different models.
- It accounts for all types of transcription errors, which is essential when working with atypical speech that may introduce unique pronunciation or articulation challenges.
- WER is widely used, allowing for benchmarking against baseline models and industry standards.
Results
The results on the test set are summarised below:
Model | WER (Test Set) | WER Reduction % (vs. whisper-small.en ) |
WER Reduction % (vs. whisper-large-v2 ) |
---|---|---|---|
whisper-small.en |
55.0% | - | - |
whisper-large-v2 |
52.0% | 5.45 | - |
whisper-small-singapore-aphasia |
48.0% | 12.73 | 7.69 |
Performance Testing
As our fine-tuned model is based on whisper-small.en
, this change in model size should lead to faster inference times. Using the test set, we ran experiments to determine the following performance figures:
- Total execution time, seconds (Entire test set)
- Average inference time, seconds
- Throughput, number of inferences per second
Model | Total Execution Time (s) | Average Inference Time (s) | Throughput (Number of Inferences per Second) | Hardware |
---|---|---|---|---|
whisper-large-v2 |
1495.49 | 0.69 | 1.46 | Nvidia RTX 4090 24GB |
whisper-small-singapore-aphasia |
263.76 | 0.12 | 8.26 | Nvidia RTX 4090 24GB |
Summary
The fine-tuned whisper-small-singapore-aphasia model demonstrates substantial gains in both transcription accuracy and processing efficiency.
In terms of accuracy, it achieved a WER of 48.0% on the test set, marking a 12.73% reduction in WER compared to whisper-small.en and a 7.69% reduction compared to the larger whisper-large-v2 model. These results highlight the effectiveness of domain-specific fine-tuning for Singaporean aphasic speech, enabling the model to handle atypical speech patterns with greater accuracy than even larger, more generalized models.
In terms of performance, the fine-tuned model, being based on whisper-small.en, also offers a notable improvement in inference speed. On the test set, it completed the transcription process in 263.76 seconds with an average inference time of 0.12 seconds per audio sample, achieving a throughput of 8.26 inferences per second. In comparison, whisper-large-v2 required 1495.49 seconds for the same test set, with an average inference time of 0.69 seconds and a throughput of only 1.46 inferences per second.
Together, these results demonstrate that whisper-small-singapore-aphasia provides a balanced solution, combining the speed benefits of a smaller model with enhanced accuracy for domain-specific speech.
Model Examination [optional]
[More Information Needed]
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]
- Downloads last month
- 0
Model tree for f-azm17/whisper-small-singapore-aphasia
Base model
openai/whisper-small.en