Whisper for Singaporean Aphasia

Aphasia is a language disorder that affects the ability to communicate, often resulting from brain injury or stroke. Patients with aphasia frequently experience difficulty with speech fluency, articulation, and comprehension, which poses challenges for both daily communication and effective engagement in therapeutic exercises.

This model is part of an innovative chatbot-based therapy tool designed to assist aphasic patients in practicing speech exercises remotely. By leveraging OpenAI’s Whisper small model and fine-tuning it for Singaporean English aphasic speech, this solution provides a tailored automatic speech recognition (ASR) system that can accurately transcribe atypical speech patterns associated with aphasia. This tool aims to give patients more flexibility in practicing speech tasks without needing frequent in-person visits, making it accessible and convenient for individuals with work, school, or mobility constraints.

The fine-tuned Whisper model enables the chatbot to understand and respond to aphasic speech with greater accuracy, helping patients practice specific speech exercises, such as describing objects or actions, in a supportive and flexible environment. This document outlines the model’s details, intended applications, and performance.

Model Details

This Whisper model is a fine-tuned variant of whisper-small.en. It has been tuned specifically to transcribe Singaporean English aphasic speech with improved accuracy, addressing the unique transcription challenges presented by atypical speech patterns. Aphasic speech, common among patients with language or speech disorders, often features softer, slower, or non-standard pronunciation, which can be difficult for traditional ASR models to accurately transcribe.

Model Description

Developed by: Farhan Azmi
Funded by: Singapore Institute of Technology
Shared by: National University Health Systems (NUHS) Singapore
Model type: Whisper-based Automatic Speech Recognition (ASR) model
Language(s) (NLP): English (Singapore, normal and atypical speech)
License: apache-2.0
Finetuned from model: openai/whisper-small.en

Model Sources [optional]

Repository: [More Information Needed]
Paper [optional]: TODO
Demo [optional]: TODO

Uses

Primary Use Case: This model is a core component of a chatbot-based therapy tool for aphasia patients. The tool enables patients to practice speech tasks remotely, providing them with more flexibility and reducing their need to travel frequently for in-person appointments with a speech and language therapist. This is particularly beneficial for patients who face challenges such as work or school commitments or mobility issues (e.g., elderly patients).
Extended Applications: The model can also be used in other healthcare and therapeutic settings that require automatic transcription of aphasic or atypical speech, as well as research on improving ASR for non-standard speech patterns.

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

The dataset used for fine-tuning this model (train, validation, and test) consists of 22 hours of audio data spanning 13,680 audio files and corresponding transcripts, collected from four different tasks involving descriptive prompts. These tasks required participants to describe images depicting either an object (nouns) or an action (verbs).

2,989 samples of this dataset (or 7.2 hours) represents recordings from real-world patients with aphasia of varying severity, providing authentic aphasic speech patterns. The remaining 11,170 samples (or 14.8 hours) consists of recordings from individuals without aphasia, representing typical speech from native Singaporean English speakers. This diverse dataset allows the model to better generalize across both aphasic and typical speech patterns in a localized (Singaporean) English context.

The dataset was divided into training, validation, and test sets by grouping audio files by patients. This means that all audio samples and corresponding transcriptions from a single patient are assigned entirely to one of the three splits. This approach prevents data leakage and provides a more accurate assessment of whether our fine-tuned model can generalize effectively to new data. The dataset split is detailed in the table below:

Dataset Split	Audio Samples	Patients Involved (Sample)	Total Audio Hours
Train	9422	al_e026, al_e028, ...[^1]	15.2 hours
Validation	2047	al_e048, al_e106, ...[^2]	3.2 hours
Test	2211	al_e092, al_e155, ...[^3]	3.6 hours

[^1]: Full list of patients involved in Train:

al_e026, al_e028, al_e078, al_e085, al_e099, al_e100, al_e101, al_e117, al_e118, al_e122, al_e132, al_e179, hl_e002, hl_e003, hl_e005, hl_e006, hl_e007, hl_e008, hl_e010, hl_e011, hl_e013, hl_e014, hl_e015, hl_e016, hl_e017, hl_e018, hl_e019, hl_e020, hl_e021, hl_e024, hl_e025, hl_e023, hl_e031, hl_e027, hl_e032, hl_e033, hl_e034, hl_e035, hl_e037, hl_e043, hl_e044, hl_e045, hl_e046, hl_e047, hl_e050, hl_e051, hl_e049, hl_e052, hl_e053, hl_e057, hl_e059, hl_e058, hl_e060, hl_e062, hl_e065, hl_e066, hl_e069, hl_e071, hl_e070, hl_e072.

[^2]: Full list of patients involved in Validation:

al_e048, al_e106, al_e133, al_e180, hl_e001, hl_e012, hl_e030, hl_e036, hl_e040, hl_e042, hl_e055, hl_e064, hl_e068.

[^3]: Full list of patients involved in Test:

al_e092, al_e155, al_e137, hl_e004, hl_e009, hl_e029, hl_e038, hl_e039, hl_e041, hl_e054, hl_e056, hl_e061, hl_e063, hl_e067.

Note: Audio files representing patients begin with 'al', whereas healthy samples begin with 'hl'.

Training Procedure

Preprocessing [optional]

Preprocessing Audio (Features)

All waveforms were extracted from their respective audio files and downsampled to 16kHz. Audio features were then obtained using the WhisperFeatureExtractor for whisper-small.en from the transformers library. This feature extractor pads audio waveforms of 30 seconds or less with zeros and truncates any audio longer than 30 seconds. To ensure consistency in training, audio files exceeding 30 seconds in length were excluded from the dataset. Following padding and truncation, the WhisperFeatureExtractor generates a mel-spectrogram for each waveform, providing a time-frequency-amplitude representation of the audio that is tailored to human auditory perception.

Preprocessing Reference Transcripts (Labels)

Each of the reference transcriptions from the dataset were tokenized and converted into unique identifiers using the WhisperTokenizer for whisper-small.en from the transformers library. The purpose of this process is to first separate the sequence into tokens, and then assign each token with a unique identifier (IDs). This process was necessary as the model will process the tokens in a numerical format, not raw text.

Training Hyperparameters

Learning Rate: 1e-5

Batch Size: 16

Gradient Accumulation Steps: 1

Number of Epochs: 7

Warmup Steps: 200

Weight Decay: 0.0

Dropout: 0.0

Evaluation Strategy: epochs

Save Strategy: epochs

Best Model Checkpoint and Epochs

Best Epoch: 6

Best Checkpoint: 3534 steps

Validation WER at Best Checkpoint: 69.14%

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

The data used to evaluate the model's performance after fine-tuning is shown below:

Dataset Split	Audio Samples	Patients Involved (Sample)	Total Audio Hours
Test	2211	al_e092, al_e155, ...[^3]	3.6 hours

The test set consists of 2,211 audio samples comprising a mix of aphasia and healthy samples, totaling 3.6 hours of audio. This data allows for a comprehensive assessment of the model’s ability to generalize across different speakers and speech patterns.

Metrics

The primary metric used to evaluate model performance is Word Error Rate (WER). WER is the standard metric in ASR (Automatic Speech Recognition) to measure transcription accuracy, defined as the total number of errors (substitutions, deletions, and insertions) divided by the total words in the reference transcription. Lower WER values indicate higher transcription accuracy.

WER was chosen because:

It provides a quantitative measure of transcription accuracy, making it easy to compare performance across different models.
It accounts for all types of transcription errors, which is essential when working with atypical speech that may introduce unique pronunciation or articulation challenges.
WER is widely used, allowing for benchmarking against baseline models and industry standards.

Results

The results on the test set are summarised below:

Model	WER (Test Set)	WER Reduction % (vs. `whisper-small.en`)	WER Reduction % (vs. `whisper-large-v2`)
`whisper-small.en`	55.0%	-	-
`whisper-large-v2`	52.0%	5.45	-
`whisper-small-singapore-aphasia`	48.0%	12.73	7.69

Performance Testing

As our fine-tuned model is based on whisper-small.en, this change in model size should lead to faster inference times. Using the test set, we ran experiments to determine the following performance figures:

Total execution time, seconds (Entire test set)
Average inference time, seconds
Throughput, number of inferences per second

Model	Total Execution Time (s)	Average Inference Time (s)	Throughput (Number of Inferences per Second)	Hardware
`whisper-large-v2`	1495.49	0.69	1.46	Nvidia RTX 4090 24GB
`whisper-small-singapore-aphasia`	263.76	0.12	8.26	Nvidia RTX 4090 24GB

Summary

The fine-tuned whisper-small-singapore-aphasia model demonstrates substantial gains in both transcription accuracy and processing efficiency.

In terms of accuracy, it achieved a WER of 48.0% on the test set, marking a 12.73% reduction in WER compared to whisper-small.en and a 7.69% reduction compared to the larger whisper-large-v2 model. These results highlight the effectiveness of domain-specific fine-tuning for Singaporean aphasic speech, enabling the model to handle atypical speech patterns with greater accuracy than even larger, more generalized models.

In terms of performance, the fine-tuned model, being based on whisper-small.en, also offers a notable improvement in inference speed. On the test set, it completed the transcription process in 263.76 seconds with an average inference time of 0.12 seconds per audio sample, achieving a throughput of 8.26 inferences per second. In comparison, whisper-large-v2 required 1495.49 seconds for the same test set, with an average inference time of 0.69 seconds and a throughput of only 1.46 inferences per second.

Together, these results demonstrate that whisper-small-singapore-aphasia provides a balanced solution, combining the speed benefits of a smaller model with enhanced accuracy for domain-specific speech.

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

You need to agree to share your contact information to access this model

Whisper for Singaporean Aphasia

Model Details

Model Description

Model Sources [optional]

Uses

Downstream Use [optional]

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Preprocessing [optional]

Preprocessing Audio (Features)

Preprocessing Reference Transcripts (Labels)

Training Hyperparameters

Best Model Checkpoint and Epochs

Speeds, Sizes, Times [optional]

Evaluation

Testing Data, Factors & Metrics

Testing Data

Metrics

Results

Performance Testing

Summary

Model Examination [optional]

Environmental Impact

Technical Specifications [optional]

Model Architecture and Objective

Compute Infrastructure

Hardware

Software

Citation [optional]

Glossary [optional]

More Information [optional]

Model Card Authors [optional]

Model Card Contact

Model tree for f-azm17/whisper-small-singapore-aphasia