submission-jjje / README.md
jennasparks's picture
Updated with model info
ca7705e verified
metadata
title: Submission Template
emoji: 🔥
colorFrom: yellow
colorTo: green
sdk: docker
pinned: false

Fine-tuned ELECTRA model for Climate Disinformation Classification

Model Description

This is our best-performing model for the Frugal AI Challenge 2024, specifically for the text classification task of identifying climate disinformation.

Intended Use

  • Primary intended uses: Comparison to baseline (random selection of labels) for climate disinformation classification models
  • Primary intended users: Researchers and developers participating in the Frugal AI Challenge
  • Out-of-scope use cases: Not intended for production use or real-world classification tasks

Training Data

The model uses a balanced version of the QuotaClimat/frugalaichallenge-text-train training dataset. The dataset originally had the following structure:

  • Size: ~6000 examples
  • Split: 80% train, 20% test
  • 8 categories of climate disinformation claims

Labels

  1. No relevant claim detected
  2. Global warming is not happening
  3. Not caused by humans
  4. Not bad or beneficial
  5. Solutions harmful/unnecessary
  6. Science is unreliable
  7. Proponents are biased
  8. Fossil fuels are needed

The balancing was done to improve accuracy. We used the Marian MT model to augment our dataset by translating sentences from the classes with the lowest sample count to Spanish and back-translating them to English. The goal of this strategy was to keep sentences with similar sentences, but with different wording, under the same label. However, to avoid the dataset having more synthetic than original data, the target number of sentences per category was 2 times the smallest category. After augmenting the dataset, we removed duplicate sentences generated from back-translation, in order to avoid pseudoreplication. We then split the dataset into training and test sets, making sure to keep original and backtranslated within the same dataset to avoid data leakage.

Performance

Metrics

  • Accuracy: Training - 0.91 , Validation - 0.87, Testing -
  • Environmental Impact:
    • Emissions tracked in gCO2eq
    • Energy consumption tracked in Wh

Model Architecture

We fine-tuned a pre-trained ELECTRA model on our balanced dataset for five epochs. The first four layers of the model were frozen, while the training was carried out on the last eight layers. The Electra tokenizer was used to tokenize the sentences, with truncation set to True and padding to max length. An Adam optimizer was used, with a learning rate = 5e-5, epsilon=1e-7, beta 1=0.9 and beta_2=0.999. Our loss was Sparse Categorical Cross-entropy, and the main metric used was accuracy.

Environmental Impact

Environmental impact is tracked using CodeCarbon, measuring:

  • Carbon emissions during inference
  • Energy consumption during inference

Limitations

  • The dataset was small to begin with, and every after augmenting the data, using a Neural Network on the dataset may lead to overfitting.
  • While visual inspection of some sample augmented sentences suggested that the MariamMT model was successful in its back-translation, validation by subject matter experts is needed to guarantee that the augmented sentences maintain the label they were automatically assigned from the original sentence.

Ethical Considerations

  • Dataset contains sensitive topics related to climate disinformation
  • Environmental impact is tracked to promote awareness of AI's carbon footprint