metadata

title: Submission Template
emoji: 🔥
colorFrom: yellow
colorTo: green
sdk: docker
pinned: false

Fine-tuned ELECTRA model for Climate Disinformation Classification

Model Description

This is our best-performing model for the Frugal AI Challenge 2024, specifically for the text classification task of identifying climate disinformation.

Intended Use

Primary intended uses: Comparison to baseline (random selection of labels) for climate disinformation classification models
Primary intended users: Researchers and developers participating in the Frugal AI Challenge
Out-of-scope use cases: Not intended for production use or real-world classification tasks

Training Data

The model uses a balanced version of the QuotaClimat/frugalaichallenge-text-train training dataset. The dataset originally had the following structure:

Size: ~6000 examples
Split: 80% train, 20% test
8 categories of climate disinformation claims

Labels

No relevant claim detected
Global warming is not happening
Not caused by humans
Not bad or beneficial
Solutions harmful/unnecessary
Science is unreliable
Proponents are biased
Fossil fuels are needed

The balancing was done to improve accuracy. We used the Marian MT model to augment our dataset by translating sentences from the classes with the lowest sample count to Spanish and back-translating them to English. The goal of this strategy was to keep sentences with similar sentences, but with different wording, under the same label. However, to avoid the dataset having more synthetic than original data, the target number of sentences per category was 2 times the smallest category. After augmenting the dataset, we removed duplicate sentences generated from back-translation, in order to avoid pseudoreplication. We then split the dataset into training and test sets, making sure to keep original and backtranslated within the same dataset to avoid data leakage.

Performance

Metrics

Accuracy: Training - 0.91 , Validation - 0.87, Testing -
Environmental Impact:
- Emissions tracked in gCO2eq
- Energy consumption tracked in Wh

Model Architecture

We fine-tuned a pre-trained ELECTRA model on our balanced dataset for five epochs. The first four layers of the model were frozen, while the training was carried out on the last eight layers. The Electra tokenizer was used to tokenize the sentences, with truncation set to True and padding to max length. An Adam optimizer was used, with a learning rate = 5e-5, epsilon=1e-7, beta 1=0.9 and beta_2=0.999. Our loss was Sparse Categorical Cross-entropy, and the main metric used was accuracy.

Environmental Impact

Environmental impact is tracked using CodeCarbon, measuring:

Carbon emissions during inference
Energy consumption during inference

Limitations

The dataset was small to begin with, and every after augmenting the data, using a Neural Network on the dataset may lead to overfitting.
While visual inspection of some sample augmented sentences suggested that the MariamMT model was successful in its back-translation, validation by subject matter experts is needed to guarantee that the augmented sentences maintain the label they were automatically assigned from the original sentence.

Ethical Considerations

Dataset contains sensitive topics related to climate disinformation
Environmental impact is tracked to promote awareness of AI's carbon footprint