elishowk's picture
Automatic correction of README.md metadata for keys. Contact [email protected] for any question
bea83f3
|
raw
history blame
3.7 kB
metadata
language: it
thumbnail: https://neuraly.ai/static/assets/images/huggingface/thumbnail.png
tags:
  - sentiment
  - Italian
license: mit
widget:
  - text: Huggingface è un team fantastico!

🤗 + neuraly - Italian BERT Sentiment model

Model description

This model performs sentiment analysis on Italian sentences. It was trained starting from an instance of bert-base-italian-cased, and fine-tuned on an Italian dataset of tweets, reaching 82% of accuracy on the latter one.

Intended uses & limitations

How to use

import torch
from torch import nn  
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("neuraly/bert-base-italian-cased-sentiment")
# Load the model, use .cuda() to load it on the GPU
model = AutoModelForSequenceClassification.from_pretrained("neuraly/bert-base-italian-cased-sentiment")

sentence = 'Huggingface è un team fantastico!'
input_ids = tokenizer.encode(sentence, add_special_tokens=True)

# Create tensor, use .cuda() to transfer the tensor to GPU
tensor = torch.tensor(input_ids).long()
# Fake batch dimension
tensor = tensor.unsqueeze(0)

# Call the model and get the logits
logits, = model(tensor)

# Remove the fake batch dimension
logits = logits.squeeze(0)

# The model was trained with a Log Likelyhood + Softmax combined loss, hence to extract probabilities we need a softmax on top of the logits tensor
proba = nn.functional.softmax(logits, dim=0)

# Unpack the tensor to obtain negative, neutral and positive probabilities
negative, neutral, positive = proba

Limitations and bias

A possible drawback (or bias) of this model is related to the fact that it was trained on a tweet dataset, with all the limitations that come with it. The domain is strongly related to football players and teams, but it works surprisingly well even on other topics.

Training data

We trained the model by combining the two tweet datasets taken from Sentipolc EVALITA 2016. Overall the dataset consists of 45K pre-processed tweets.

The model weights come from a pre-trained instance of bert-base-italian-cased. A huge "thank you" goes to that team, brilliant work!

Training procedure

Preprocessing

We tried to save as much information as possible, since BERT captures extremely well the semantic of complex text sequences. Overall we removed only @mentions, urls and emails from every tweet and kept pretty much everything else.

Hardware

  • GPU: Nvidia GTX1080ti
  • CPU: AMD Ryzen7 3700x 8c/16t
  • RAM: 64GB DDR4

Hyperparameters

  • Optimizer: AdamW with learning rate of 2e-5, epsilon of 1e-8
  • Max epochs: 5
  • Batch size: 32
  • Early Stopping: enabled with patience = 1

Early stopping was triggered after 3 epochs.

Eval results

The model achieves an overall accuracy on the test set equal to 82% The test set is a 20% split of the whole dataset.

About us

Neuraly is a young and dynamic startup committed to designing AI-driven solutions and services through the most advanced Machine Learning and Data Science technologies. You can find out more about who we are and what we do on our website.

Acknowledgments

Thanks to the generous support from the Hugging Face team, it is possible to download the model from their S3 storage and live test it from their inference API 🤗.