%%html
Natural Language Processing (NLP): Natural Language Processing is an interdisciplinary field that combines computer science, artificial intelligence, and linguistics. It is dedicated to the development of computational models for processing and comprehending natural language. These models are utilized for various tasks such as semantic word grouping, text-to-speech synthesis, and language translation.
Sentiment Analysis: Sentiment Analysis, also known as opinion mining, is a valuable practice in text analysis. It involves the interpretation and classification of emotions expressed in textual data. Organizations use Sentiment Analysis to gain insights into public sentiment regarding specific words or topics, which can inform their decision-making processes and strategies.
In the following sections, we will embark on the creation of a robust Sentiment Analysis model. This model will be designed to categorize tweets into either a Positive or Negative sentiment category, facilitating sentiment analysis for a range of applications.
Before diving into the analysis or modeling, the first crucial step is to import the necessary dependencies. These dependencies are external Python libraries that provide additional functionalities not available in the standard library. Importing them at the beginning of our script ensures that all the required tools are available and prevents runtime errors due to missing modules.
# utilities
import re
import pickle
import numpy as np
import pandas as pd
# plotting
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# nltk
from nltk.stem import WordNetLemmatizer
# sklearn
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report
The dataset utilized in this project is the renowned Sentiment140 dataset, comprising 1,600,000 tweets extracted via the Twitter API. Each tweet is annotated with a sentiment label (0 for Negative, 4 for Positive), offering a foundation for sentiment detection.
The categorization in the training data is derived from the emojis in the text, which might not always accurately represent the sentiment. Hence, models trained on this dataset could exhibit lower accuracy, as the categorization is not infallible.
The dataset includes the following six fields:
For our purposes, only the sentiment and text fields are required, hence the other fields will be disregarded. Additionally, the sentiment field will be re-encoded to reflect sentiment more intuitively (0 for Negative, 1 for Positive).
# Importing the dataset
DATASET_COLUMNS = ["sentiment", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"
dataset = pd.read_csv('data//training.1600000.processed.noemoticon.csv',
encoding=DATASET_ENCODING , names=DATASET_COLUMNS)
# Removing the unnecessary columns.
dataset = dataset[['sentiment','text']]
# Replacing the values to ease understanding.
dataset['sentiment'] = dataset['sentiment'].replace(4,1)
sns.set_style("whitegrid")
plt.figure(figsize=(10, 6), dpi=100)
ax = dataset.groupby('sentiment').count().plot(
kind='bar',
title='Distribution of Data',
legend=False,
color=['#1f77b4', '#ff7f0e']
)
ax.set_xticklabels(['Negative', 'Positive'], rotation=0)
ax.set_xlabel('Sentiment', labelpad=20, weight='bold', size=12)
ax.set_ylabel('Count', labelpad=20, weight='bold', size=12)
ax.set_title('Distribution of Data by Sentiment', pad=20, weight='bold', size=14)
for p in ax.patches:
ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
textcoords='offset points')
sns.despine()
plt.show()
# Storing data in lists.
text, sentiment = list(dataset['text']), list(dataset['sentiment'])
<Figure size 1000x600 with 0 Axes>
Text Preprocessing is traditionally an important step for Natural Language Processing (NLP) tasks. It transforms text into a more digestible form so that machine learning algorithms can perform better.
The Preprocessing steps taken are:
# Defining dictionary containing all emojis with their meanings.
emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad',
':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed',
':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
'@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
'<(-_-)>': 'robot', 'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink',
';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat'}
## Defining set containing all stopwords in english.
stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
'does', 'doing', 'down', 'during', 'each','few', 'for', 'from',
'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're',
's', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
'themselves', 'then', 'there', 'these', 'they', 'this', 'those',
'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',
'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
"youve", 'your', 'yours', 'yourself', 'yourselves']
def preprocess(textdata):
processedText = []
# Create Lemmatizer and Stemmer.
wordLemm = WordNetLemmatizer()
# Defining regex patterns.
urlPattern = r"((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)"
userPattern = '@[^\s]+'
alphaPattern = "[^a-zA-Z0-9]"
sequencePattern = r"(.)\1\1+"
seqReplacePattern = r"\1\1"
for tweet in textdata:
tweet = tweet.lower()
# Replace all URls with 'URL'
tweet = re.sub(urlPattern,' URL',tweet)
# Replace all emojis.
for emoji in emojis.keys():
tweet = tweet.replace(emoji, "EMOJI" + emojis[emoji])
# Replace @USERNAME to 'USER'.
tweet = re.sub(userPattern,' USER', tweet)
# Replace all non alphabets.
tweet = re.sub(alphaPattern, " ", tweet)
# Replace 3 or more consecutive letters by 2 letter.
tweet = re.sub(sequencePattern, seqReplacePattern, tweet)
tweetwords = ''
for word in tweet.split():
# Checking if the word is a stopword.
#if word not in stopwordlist:
if len(word)>1:
# Lemmatizing the word.
word = wordLemm.lemmatize(word)
tweetwords += (word+' ')
processedText.append(tweetwords)
return processedText
import time
t = time.time()
processedtext = preprocess(text)
print(f'Text Preprocessing complete.')
print(f'Time Taken: {round(time.time()-t)} seconds')
Text Preprocessing complete. Time Taken: 77 seconds
data_neg = processedtext[:800000]
plt.figure(figsize = (20,20))
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
collocations=False).generate(" ".join(data_neg))
plt.imshow(wc)
plt.axis('off')
plt.grid(False)
plt.show()
data_pos = processedtext[800000:]
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
collocations=False).generate(" ".join(data_pos))
plt.figure(figsize = (20,20))
plt.imshow(wc)
plt.axis('off')
plt.grid(False)
plt.show()
X_train, X_test, y_train, y_test = train_test_split(processedtext, sentiment,
test_size = 0.05, random_state = 0)
print(f'Data Split done.')
Data Split done.
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in the context of a corpus of documents. It reflects how crucial a word is in understanding the essence of the documents.
Consider an example where a collection of students' essays discusses 'Environmental Conservation.' In these essays, common words like 'the' might appear frequently, but they hold little unique information about the topic. In contrast, less frequent but topic-specific words like 'biodiversity,' 'ecosystem,' or 'conservation' offer much more insight into the theme of the essays. TF-IDF weighs these words more heavily, thus enhancing the thematic significance within the dataset.
The TF-IDF Vectoriser transforms a collection of text documents into a matrix of TF-IDF features. This is crucial for algorithms that operate on numerical input rather than raw text. Typically, the vectoriser is trained on the training data (X_train) to capture the vocabulary and IDF of the corpus.
The ngram_range parameter defines the range of word sequences considered. For instance, with an ngram_range of (1,2), the phrase 'climate change' would be treated as a single feature in addition to the individual words 'climate' and 'change'. This allows the model to capture context that single words may not convey.
The max_features parameter limits the number of features to the most frequently occurring words, thus reducing dimensionality and potentially improving model performance. For example, if set to 1000, only the top 1000 words by frequency would be used as features.
vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectoriser.fit(X_train)
print(f'Vectoriser fitted.')
print('No. of feature_words: ', len(vectoriser.get_feature_names_out()))
Vectoriser fitted. No. of feature_words: 500000
X_train = vectoriser.transform(X_train)
X_test = vectoriser.transform(X_test)
print(f'Data Transformed.')
Data Transformed.
We are crafting three distinct models for our sentiment analysis challenge:
Given that our dataset is balanced with equal numbers of Positive and Negative predictions, we have selected Accuracy as our evaluation metric. Additionally, we are visualizing the Confusion Matrix to comprehend our model's performance across both classification categories.
def model_Evaluate(model):
# Predict values for Test dataset
y_pred = model.predict(X_test)
# Print the evaluation metrics for the dataset.
print(classification_report(y_test, y_pred))
# Compute and plot the Confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
categories = ['Negative','Positive']
group_names = ['True Neg','False Pos', 'False Neg','True Pos']
group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix)]
labels = [f'{v1}\n{v2}' for v1, v2 in zip(group_names,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot = labels, cmap = 'Blues',fmt = '',
xticklabels = categories, yticklabels = categories)
plt.xlabel("Predicted values", fontdict = {'size':14}, labelpad = 10)
plt.ylabel("Actual values" , fontdict = {'size':14}, labelpad = 10)
plt.title ("Confusion Matrix", fontdict = {'size':18}, pad = 20)
BNBmodel = BernoulliNB(alpha = 2)
BNBmodel.fit(X_train, y_train)
model_Evaluate(BNBmodel)
precision recall f1-score support 0 0.81 0.79 0.80 39989 1 0.80 0.81 0.80 40011 accuracy 0.80 80000 macro avg 0.80 0.80 0.80 80000 weighted avg 0.80 0.80 0.80 80000
SVCmodel = LinearSVC()
SVCmodel.fit(X_train, y_train)
model_Evaluate(SVCmodel)
precision recall f1-score support 0 0.82 0.81 0.82 39989 1 0.81 0.83 0.82 40011 accuracy 0.82 80000 macro avg 0.82 0.82 0.82 80000 weighted avg 0.82 0.82 0.82 80000
LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1)
LRmodel.fit(X_train, y_train)
model_Evaluate(LRmodel)
precision recall f1-score support 0 0.83 0.82 0.83 39989 1 0.82 0.84 0.83 40011 accuracy 0.83 80000 macro avg 0.83 0.83 0.83 80000 weighted avg 0.83 0.83 0.83 80000
The Logistic Regression Model stands out as the most proficient, attaining nearly 82% accuracy in sentiment classification of tweets. However, it's noteworthy that the BernoulliNB Model is the swiftest in terms of training and prediction, also achieving a commendable 80% accuracy.
file = open('vectoriser-ngram-(1,2).pickle','wb')
pickle.dump(vectoriser, file)
file.close()
file = open('Sentiment-LR.pickle','wb')
pickle.dump(LRmodel, file)
file.close()
file = open('Sentiment-BNB.pickle','wb')
pickle.dump(BNBmodel, file)
file.close()
To deploy the model for Sentiment Prediction, we need to import the Vectoriser and LR Model using Pickle.
The vectoriser can be harnessed to transform data into a matrix of TF-IDF Features, while the model can be applied to predict the sentiment of the transformed data. Nevertheless, the text intended for sentiment prediction must undergo preprocessing.
def load_models():
'''
Replace '..path/' by the path of the saved models.
'''
# Load the vectoriser.
file = open('..path/vectoriser-ngram-(1,2).pickle', 'rb')
vectoriser = pickle.load(file)
file.close()
# Load the LR Model.
file = open('..path/Sentiment-LRv1.pickle', 'rb')
LRmodel = pickle.load(file)
file.close()
return vectoriser, LRmodel
def predict(vectoriser, model, text):
# Predict the sentiment
textdata = vectoriser.transform(preprocess(text))
sentiment = model.predict(textdata)
# Make a list of text with sentiment.
data = []
for text, pred in zip(text, sentiment):
data.append((text,pred))
# Convert the list into a Pandas DataFrame.
df = pd.DataFrame(data, columns = ['text','sentiment'])
df = df.replace([0,1], ["Negative","Positive"])
return df
if __name__=="__main__":
# Loading the models.
#vectoriser, LRmodel = load_models()
# Text to classify should be in a list.
text = ["I hate twitter",
"May the Force be with you.",
"Mr. Stark, I don't feel so good"]
df = predict(vectoriser, LRmodel, text)
print(df.head())
text sentiment 0 I hate twitter Negative 1 May the Force be with you. Positive 2 Mr. Stark, I don't feel so good Negative