File size: 2,944 Bytes
7a0cada e924cf5 7a0cada 968fc69 34227e9 7a0cada 34227e9 7a0cada 968fc69 7a0cada 2508de5 f3e34b6 2508de5 88a0007 083e90c 2508de5 7a0cada 856553b d80dae4 856553b 7a0cada 2508de5 54186b7 2508de5 8bc605c 2508de5 7a0cada 2508de5 7a0cada 2508de5 7a0cada 2508de5 7a0cada 2508de5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
language:
- multilingual
- en
- pt
- es
- ar
- ko
- ja
- id
- tl
- tr
- fr
- ru
- th
- it
- de
- fa
- pl
- hi
- nl
- ht
- et
- ud
- ca
- sv
- fi
- el
- cs
- eu
- he
- ta
- zh
- no
- da
- cy
- lv
- hu
- ro
- lt
- vi
- uk
- ne
- sl
- is
- sr
- ml
- bn
- bg
- mr
- si
- te
- kn
- ku
- ps
- gu
- my
- am
- hy
- or
- sd
- pa
- km
- ka
- lo
- dv
- ug
widget:
- text: "๐ค"
- text: "T'estimo! โค๏ธ"
- text: "I love you!"
- text: "I hate you ๐คฎ"
- text: "Mahal kita!"
- text: "์ฌ๋ํด!"
- text: "๋ ๋๊ฐ ์ซ์ด"
- text: "๐๐๐"
---
# twitter-XLM-roBERTa-base for Sentiment Analysis
This is a multilingual XLM-roBERTa-base model trained on ~198M tweets and finetuned for sentiment analysis. The sentiment fine-tuning was done on 8 languages (Ar, En, Fr, De, Hi, It, Sp, Pt) but it can be used for more languages (see paper for details).
- Paper: [XLM-T: A Multilingual Language Model Toolkit for Twitter](https://arxiv.org/abs/2104.12250).
- Git Repo: [XLM-T official repository](https://github.com/cardiffnlp/xlm-t).
## Example Pipeline
```python
from transformers import pipeline
model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
sentiment_task("T'estimo!")
```
```
[{'label': 'Positive', 'score': 0.6600581407546997}]
```
## Full classification example
```python
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
import numpy as np
from scipy.special import softmax
# Preprocess text (username and link placeholders)
def preprocess(text):
new_text = []
for t in text.split(" "):
t = '@user' if t.startswith('@') and len(t) > 1 else t
t = 'http' if t.startswith('http') else t
new_text.append(t)
return " ".join(new_text)
MODEL = f"cardiffnlp/twitter-xlm-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)
text = "Good night ๐"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
# # TF
# model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
# model.save_pretrained(MODEL)
# text = "Good night ๐"
# encoded_input = tokenizer(text, return_tensors='tf')
# output = model(encoded_input)
# scores = output[0][0].numpy()
# scores = softmax(scores)
# Print labels and scores
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
l = config.id2label[ranking[i]]
s = scores[ranking[i]]
print(f"{i+1}) {l} {np.round(float(s), 4)}")
```
Output:
```
1) Positive 0.7673
2) Neutral 0.2015
3) Negative 0.0313
```
|