Fake News Classifier - Finetuned: 'distilbert-base-cased'
LIAR Dataset
- This model is finetuned on a large dataset of hand-labeled short statements from politifact.com's API.
- Relevant columns of the data (speaker, statement, etc.) are concatenated and tokenized to create the model input.
DistilBERT Cased Tokenizer
- The text is tokenized using the 'distilbert-base-cased' HuggingFace tokenizer.
- For training, the text is cut to a block-size of 200.
- Max length padding is used to maintain consistent input data shape.
DistilBERT Cased Model
- The model that is finetuned is the DistilBERT model, 'distilbert-base-cased'.
- This is a small and fast text classifier, perfect for real-time inference!
- 40% less parameters than the base BERT model.
- 60% faster while preserving 95% performance of the base BERT model.
- The intuition for using the cased model is to capture some patterns in the writing style (capitalization, punctuation).
- This information may be relevant for detecting fake news sources.
- Writing styles may be relevant (as we see in clickbait titles with capitalization).
- This model performs well in flagging misinformation (fake news), especially if the format is similar to the training distribution.
- Overall, the performance is worse than the finetuned 'distilbert-base-uncased,' as the training data is less clean.
- Downloads last month
- 5
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.