# Natural Language Processing

## Sentiment Analysis

![](https://c.files.bbci.co.uk/2A16/production/_115547701_gettyimages-1229654243.jpg)

Photo by [GETTY IMAGES]()

___

Today is about sentiment analysis, and the introduction of the Zindi NLP Project

# I. Sentiment Analysis

## I.1. Introduction

Sentiment Analysis is a wide field with a unique purpose: predict the feeling of a person based on features. Those features can be a voice recording or a face picture, bust most of the time in sentiment analysis this is text features.

Classic applications of sentiment analysis are the following:
* Is this product review positive, neutral or negative?
* Based on tweets on a topic, do people react positively?
* Is this customer review positive or negative?

So in most cases, sentiment analysis is just a classification.

The input features are the input text, and the output targets are the classes to predict.
It can be a binary classification (i.e. positive or negative review), or multiclass classification (e.g. 0 to 5 stars note)

## I.2. Sentiment Analysis using NLP

Sentiment Analysis can be done using NLP. Historically, several methods have been developed.

Some basic methods would use the polarity of words:
* words like bad, wrong, lame, disgusting suggest negative polarity
* words like good, amazing, great, delightful suggest positive polarity

Unfortunately, language is more complicated than just polarity: for example "not bad at all" would have a negative polarity while actually giving a good review.

Modern methods use Machine Learning methods, based on NLP.

To do sentiment analysis, you already know all the tools:
* Text preprocessing (tokenization, punctuation, stopwords, stemming/lemmatization, n-grams)
* Feature computation (BOW, TF-IDF)
* Classification (SVM, logistic regression...)

## I.3. Application: Zindi Covid-related Tweets Challenge

Let's see now a short example of sentiment analysis on Zindi Covid-related Tweets Challenge.

In [18]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# Load the dataset and display some values
df = pd.read_csv('../data/Train.csv')
df.head(10)

Unnamed: 0,tweet_id,safe_text,label,agreement
0,CL1KWCMY,Me &amp; The Big Homie meanboy3000 #MEANBOY #M...,0.0,1.0
1,E3303EME,I'm 100% thinking of devoting my career to pro...,1.0,1.0
2,M4IVFSMS,"#whatcausesautism VACCINES, DO NOT VACCINATE Y...",-1.0,1.0
3,1DR6ROZ4,I mean if they immunize my kid with something ...,-1.0,1.0
4,J77ENIIE,Thanks to <user> Catch me performing at La Nui...,0.0,1.0
5,OVNPOAUX,<user> a nearly 67 year old study when mental ...,1.0,0.666667
6,JDA2QDV5,"Study of more than 95,000 kids finds no link b...",1.0,0.666667
7,S6UKR4OJ,psa: VACCINATE YOUR FUCKING KIDS,1.0,1.0
8,V6IJATBE,Coughing extra on the shuttle and everyone thi...,1.0,0.666667
9,VB25IDQK,AIDS vaccine created at Oregon Health &amp; Sc...,1.0,0.666667


In [19]:
df.isna().sum()

tweet_id     0
safe_text    0
label        1
agreement    2
dtype: int64

In [22]:
# A way to eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]
df.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
0,CL1KWCMY,Me &amp; The Big Homie meanboy3000 #MEANBOY #M...,0.0,1.0
1,E3303EME,I'm 100% thinking of devoting my career to pro...,1.0,1.0
2,M4IVFSMS,"#whatcausesautism VACCINES, DO NOT VACCINATE Y...",-1.0,1.0
3,1DR6ROZ4,I mean if they immunize my kid with something ...,-1.0,1.0
4,J77ENIIE,Thanks to <user> Catch me performing at La Nui...,0.0,1.0


In [23]:
df.isna().sum()

tweet_id     0
safe_text    0
label        0
agreement    0
dtype: int64

As you can see, we have a two classes:
* -1 for a negative sentiment
* 0 for a neutral sentiment
* 1 for a positive sentiment

Each review is a text, more or less long. So now we will do as usual: preprocessing, TF-IDF and model building.

In [None]:
from nltk import download

# Download stopwords, execute it just once then may comment
download('stopwords')

In [24]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

stop = stopwords.words('english')
stemmer = PorterStemmer()

# Perform preprocessing
df['tokens'] = df['safe_text'].apply(lambda df: word_tokenize(df, preserve_line=True))
df['alpha'] = df['tokens'].apply(lambda x: [item for item in x if item.isalpha()])
df['stop'] = df['alpha'].apply(lambda x: [item for item in x if item not in stop])
df['stemmed'] = df['stop'].apply(lambda x: [stemmer.stem(item) for item in x])
df.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/emmanuelkoupoh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,tweet_id,safe_text,label,agreement,tokens,alpha,stop,stemmed
0,CL1KWCMY,Me &amp; The Big Homie meanboy3000 #MEANBOY #M...,0.0,1.0,"[Me, &, amp, ;, The, Big, Homie, meanboy3000, ...","[Me, amp, The, Big, Homie, MEANBOY, MB, MBS, M...","[Me, amp, The, Big, Homie, MEANBOY, MB, MBS, M...","[me, amp, the, big, homi, meanboy, mb, mb, mmr..."
1,E3303EME,I'm 100% thinking of devoting my career to pro...,1.0,1.0,"[I, 'm, 100, %, thinking, of, devoting, my, ca...","[I, thinking, of, devoting, my, career, to, pr...","[I, thinking, devoting, career, proving, autis...","[i, think, devot, career, prove, autism, caus,..."
2,M4IVFSMS,"#whatcausesautism VACCINES, DO NOT VACCINATE Y...",-1.0,1.0,"[#, whatcausesautism, VACCINES, ,, DO, NOT, VA...","[whatcausesautism, VACCINES, DO, NOT, VACCINAT...","[whatcausesautism, VACCINES, DO, NOT, VACCINAT...","[whatcausesaut, vaccin, do, not, vaccin, your,..."
3,1DR6ROZ4,I mean if they immunize my kid with something ...,-1.0,1.0,"[I, mean, if, they, immunize, my, kid, with, s...","[I, mean, if, they, immunize, my, kid, with, s...","[I, mean, immunize, kid, something, wo, secret...","[i, mean, immun, kid, someth, wo, secretli, ki..."
4,J77ENIIE,Thanks to <user> Catch me performing at La Nui...,0.0,1.0,"[Thanks, to, <, user, >, Catch, me, performing...","[Thanks, to, user, Catch, me, performing, at, ...","[Thanks, user, Catch, performing, La, Nuit, NY...","[thank, user, catch, perform, la, nuit, nyc, s..."


In [25]:
# New dimension of the DataFrame
df.shape

(9999, 8)

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Compute the TF-IDF
vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda x: x)
tf_idf = vectorizer.fit_transform(df['stemmed']).toarray()
pd.DataFrame(tf_idf, columns=vectorizer.get_feature_names()).head()

Unnamed: 0,a,aa,aaaaaaaand,aaasmtg,aack,aafpassembl,aap,aapglob,aaronhernandez,ab,...,мне,написать,о,оптимизмом,с,смотрю,стране,тут,чем,病院実習行くのにmmrと水疱瘡の抗体を調べたら
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(tf_idf, df['label'], test_size=0.2, random_state=42)

# Train the model
lr = LogisticRegression()
lr.fit(X_train, y_train)

# Predict using the trained model
y_pred_lr = lr.predict(X_test)

# Estimate some metrics
print('accuracy:', accuracy_score(y_pred_lr, y_test))
print('rmse:', mean_squared_error(y_pred_lr, y_test, squared=False))

accuracy: 0.719
rmse: 0.6931810730249348


We built here a very simplistic model, still reaching an accuracy of about 70%. Feel free to improve this model as an exercise with all your Machine Learning knowledge and experience.

In [37]:
# All distinct values that have been predicted
np.unique(y_pred_lr)

array([-1.,  0.,  1.])

In [41]:
# Train the model
model = SVC()
model.fit(X_train, y_train)

# Predict using the trained model
y_pred = model.predict(X_test)

# Estimate some metrics
print('accuracy:', accuracy_score(y_pred, y_test))
print('rmse:', mean_squared_error(y_pred, y_test, squared=False))

accuracy: 0.7225
rmse: 0.7024955515873392
