Sentiment Analysis of Reviews for Amazon Products
Posted on Tue January 08 2019 in Project Details
In this project, I will be analysing reviews that were given to Amazon Products. This dataset contains around 41000 reviews of various Amazon products such as Fire tablet, Fire stick. Along with it, there is also the rating given by the user and if the user recommends this product or not.
Aim: Use the textual reviews of Amazon Products to predict ratings, while experimenting various parts of textual analysis.
I will be using the text column in the dataset to predict the rating given by the user, whether it is positive or negative. This is of course under the assumption that the review given by the user matches the rating given. Hopefully, the reviews here are not something like this one!
So, before I start, the idea is to change the rating from numbers to positive-negative or use the doRecommend flag and predict this rating using the reviews for each product.
Dataset: https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products/kernels
So, let's start!
Import all libraries that would be used¶
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import re
import string
import nltk
from wordcloud import WordCloud, STOPWORDS
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, roc_curve, auc, recall_score, precision_score, f1_score, accuracy_score
reviews= pd.read_csv('1429_1.csv')
reviews.columns = ['id', 'name', 'asins', 'brand', 'categories', 'keys', 'manufacturer', 'date', 'dateAdded', 'dateSeen', 'didPurchase', 'doRecommend', 'id', 'numHelpful', 'rating', 'sourceURLs', 'text', 'title', 'userCity', 'userProvince', 'username']
reviews.head()
Drop these columns as they would have no effect in the analysis later.
reviews.drop(['id', 'dateSeen', 'sourceURLs', 'userCity', 'userProvince', 'username'], axis = 1, inplace= True)
reviews[['rating', 'title', 'text' ]].isnull().sum()
Drop the one review without the text as it has no use in the model
reviews_nna= reviews[pd.notnull(reviews['text'])]
reviews_nna[['rating', 'title', 'text' ]].isnull().sum()
The top 5 most reviewed items in the dataset¶
Looks like Fire tablet is way ahead of the rest here.
reviews_nna['name'].value_counts().nlargest(5)
Create a new dataframe with the important columns. I am taking asins just to distinguish between the products.
sentiment_df= reviews_nna[[ 'asins', 'doRecommend', 'rating', 'title', 'text']]
sentiment_df.head()
Create a new plot that shows the count of reviews across ratings and if the product is recommended by the reviwer.
sns.catplot(x= 'rating', col= 'doRecommend', data= sentiment_df, kind= 'count')
sentiment_df[sentiment_df['doRecommend']== False]['rating'].value_counts()
Interesting! Looks like some people rate a product highly but do not recommend it.
Let's peek at the reviews that these products have been given.
pd.set_option('display.max_colwidth', -1)
sentiment_df[(sentiment_df['doRecommend']== False) & (sentiment_df['rating']== 5.0)][['doRecommend', 'rating', 'text']].head(10)
So, taking a look at just 10 of these and you can tell that these seem like false flags. It seems that the reviews are indeed positive, so they match the ratings BUT the recommendation does not match that.
Now, let's look at the products recommended and their split by ratings.
sentiment_df[sentiment_df['doRecommend']== True]['rating'].value_counts()
And now the text for these recommended products with bad ratings.
pd.set_option('display.max_colwidth', -1)
sentiment_df[(sentiment_df['doRecommend']== True) & (sentiment_df['rating']== 1.0)][['doRecommend', 'rating', 'text']].head(10)
Well, this makes things a bit confusing. The texts here either dont match the doRecommend or they do not match the rating.
So, going just by this observation, it makes sense to use the ratings as a target variable instead of the doRecommend variable, since we would probably have a huge number of contradictory reviews. There will be a bunch of false flags, just like in the reviews above, but that is something we cannot really do anything about, unless I read through each one of the 34000 reviews and label them myself...
Stemming
In Natural language writing, a word is written is several forms due to grammatical purposes. For example, the sentence- I am going to school can be written as I had gone to the school or I have been going to the school depending on the context. For processing in models, it makes sense to convert these types of sentences into a single format. This is where stemming and lemmatization come in.
Stemming and lemmatization is used to reduce a word to its root form or a common base form. So, am, are, is converts to be; going, gone converts to go and so on. The difference between stemming and lemmatization is the output of this conversion. Lemmatization reduces a word to its equivalent dictionary word, which might actually not be the case with stemming. Amuse stems to amus whereas its lemma is amuse.
In this part, I check two types of stemming methods that are available- Porter and Snowball. So, passed a review to the functions created to generate stemmed equivalents and see the results. According to the outputs, the Snowball stemming seems to generate more readable output, while also cleaning up the texts( lower-case and removing special characters). So, I use the Snowball stemming for further use.
from nltk import SnowballStemmer
from nltk import PorterStemmer
from nltk import sent_tokenize, word_tokenize
stopwords = nltk.corpus.stopwords.words('english')
ss = SnowballStemmer('english')
ps= PorterStemmer()
def sentencePorterStem(sentence):
token_words= word_tokenize(sentence)
stem_sentence=[]
for word in token_words:
if word not in stopwords:
stem_sentence.append(ps.stem(word))
stem_sentence.append(' ')
return ''.join(stem_sentence)
def sentenceSnowballStem(sentence):
token_words= word_tokenize(sentence)
stem_sentence=[]
for word in token_words:
if word not in stopwords:
stem_sentence.append(ss.stem(word))
stem_sentence.append(' ')
return ''.join(stem_sentence)
sen= str(sentiment_df['text'][3])
ps_sen= sentencePorterStem(sen)
ss_sen= sentenceSnowballStem(sen)
print(sen)
print('Porter Stem- '+ ps_sen)
print('Snowball Stem- '+ ss_sen)
sentiment_df['text_stem']= sentiment_df['text'].apply(sentenceSnowballStem)
Vectorization
Now, with the stemmed reviews, it is required to create vectorized versions of the text. Vectorization is the process of converting sentences into vector or array of numbers.
There are two methods we can go with, count vectorization and TF-IDF vectorization.
- Count Vectorization
In the simplest of terms, it is creating an array whose values are the number of occurences of a word in a sentence. Consider the two sentences- I am a human and I enjoy doing human things (two very human-like sentences).
After stemming both sentences to a very minimal level, if I were to count the number of occurences of each word in the first sentence, I get something like this-
I- 1, be- 1, a- 1, human- 1.
Now, for the next sentence-
I-1, enjoy- 1, be- 1, human-1, things-1.
Now, creating a list of all words in two sentences- {I, be, enjoy, a, human, things}.
Creating an array where the words are the columns and the sentences are the rows, we get something like this-
array([[1, 1, 0, 0, 1, 0],
[1, 1, 1, 0, 1, 1] ]
This is the count vectorizer equivalent of the two sentences.
- TF-IDF Vectorization
It is a combination of two concepts- TF (Term Frequency) and IDF (Inverse Document Frequency).
Term frequency is as the name suggests, the frequency of a word occuring in that document. So, it is the ratio of the number of occurences of the word in the document by the total number of words in the document.
Inverse document frequency is the log base 10 of the ratio of number of documents by the number of documents that the word appears in. The idea behind this is to find words that are more important compared to others.
So, if a word occurs in all the documents, it is not really an important word.
In this part, I generate the TF-IDF version of the whole review dataset to understand the method and see the output that is generated by it. Since TF-IDF and Count vectorization both use similar concepts and generate similar output format, I decided to check the TF-IDF version.
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
sbstem_vectorizer= TfidfVectorizer()
textfeatures= sbstem_vectorizer.fit_transform(sentiment_df['text_stem'])
Check list of feature names generated by the version. The vocabulary dictionary has the terms which are mapped to the feature indices.
from itertools import islice
def take(n, iterable):
"Return first n items of the iterable as a list"
return list(islice(iterable, n))
print(take(20, sbstem_vectorizer.vocabulary_.items()))
pd.DataFrame(textfeatures.toarray()).head(15)
Change the dataframe column names to the terms to make sense of the dataframe.
text_vect_df= pd.DataFrame(textfeatures.toarray(), columns= sbstem_vectorizer.vocabulary_)
text_vect_df.head(15)
#text_vect_df['product'].unique()
Basic sentiment analysis using NLTK.
Using 3 sentences of my own, I use NLTK library to identify which sentence are positive, negative or neutral. This can be done by the 'compound' value in the polarity score.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
pos_sen= 'I am very happy. I absolutely love it. Great job!'
print('positive - '+ str(sia.polarity_scores(pos_sen)))
neg_sen= 'It is so disgusting. I am very angry. I will murder him.'
print('negative - '+ str(sia.polarity_scores(neg_sen)))
neutral_sen= 'I am writing python in jupyter notebook.'
print('negative - '+ str(sia.polarity_scores(neutral_sen)))
dataset=sentiment_df[['text_stem', 'rating']]
dataset['rating'] = dataset['rating'].apply(lambda x:'Positive' if x>=4 else 'Negative')
def SentimentCoeff(sentence):
score = sia.polarity_scores(sentence)
return score['compound']
dataset['sentiment_coeff']= dataset['text_stem'].apply(SentimentCoeff)
dataset.head()
sns.boxenplot(x= 'rating', y= 'sentiment_coeff', data= dataset)
So, after using the sentiment analyzer available in NLTK, the positive review seems more logical- higher sentiment coefficient for these reviews, whereas the sentiment coefficient seems more neutral for negative reviews. But, there is also the way categorization of these reviews is done may shed some light here. Reviews with 3 or 2 stars are now negative reviews, whereas the actual review might not be that negative.
Preparing for model
Let's start with the data preparation for creating a logistic regression model. We will start by splitting the whole data into 2 parts for training and testing with 70-30 split.
x= dataset['text_stem']
y= dataset['rating']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
Using TF-IDF and Count vector formats as inputs to the Logistic regression model to check which version provides better output, along with finding the optimal parameters using GridSearchCV.
tfidf_vectorizer= TfidfVectorizer()
x_train_features_tfidf= tfidf_vectorizer.fit_transform(x_train)
x_test_features_tfidf= tfidf_vectorizer.transform(x_test)
count_vectorizer= CountVectorizer()
x_train_features_count= count_vectorizer.fit_transform(x_train)
x_test_features_count= count_vectorizer.transform(x_test)
logreg_params={'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
grid_logreg= GridSearchCV(LogisticRegression(solver= 'liblinear'), logreg_params, cv=5)
grid_logreg.fit(x_train_features_count, y_train)
logreg= grid_logreg.best_estimator_
print('Best Penalty:', grid_logreg.best_estimator_.get_params()['penalty'])
print('Best C:', grid_logreg.best_estimator_.get_params()['C'])
c_value= grid_logreg.best_estimator_.get_params()['C']
pen_value= grid_logreg.best_estimator_.get_params()['penalty']
logreg=LogisticRegression(C= c_value, penalty= pen_value, solver= 'liblinear')
logreg.fit(x_train_features_count, y_train)
y_pred= logreg.predict(x_test_features_count)
F1 score and Confusion Matrix
In two-class classification algorithms, model evaluation is done by calculating metrics such as precision, recall and F1 score.
Starting with the basics, in a two- class supervised classification algorithm, the data can be categorized in either the correct class or the incorrect class. If it is classified in the incorrect class, it would lead to an error. Depending upon the targeted and predicted class, it can be classified as a Type-I error or a Type-II error.
The diagram gives a pretty good picture of the classification metrics. Type-I error is also known as False Positive, while Type-II error is also known as False Negative.
Precision is ratio of True Positive and actual results- that is True Positive and False Positive, while Recall is ratio of True Positive and predicted results- that is True Positive and False Negative. Accuracy is the average of True Positive and True Negative.
F1 Score is the harmonic mean of Precision and Recall. It calculates in such a way that both the metrics are given equal weightage, so there is no need to sacrifice one metric over the other.
Confusion matrix is a plot similar to the diagram above, with the counts in each predicted category.
print('Recall Score: {:.2f}'.format(recall_score(y_test, y_pred, pos_label= 'Positive')))
print('Precision Score: {:.2f}'.format(precision_score(y_test, y_pred, pos_label= 'Positive')))
print('F1 Score: {:.2f}'.format(f1_score(y_test, y_pred, pos_label= 'Positive')))
print('Accuracy Score: {:.2f}'.format(accuracy_score(y_test, y_pred)))
Plotting confusion matrix of the predicted and actual values.
cnf_matrix = confusion_matrix(y_test, y_pred)
df_cnf_matrix= pd.DataFrame(cnf_matrix)
sns.heatmap(df_cnf_matrix, annot=True, fmt='g', cmap="Blues")
plt.ylabel('True label')
plt.xlabel('Predicted label')
logreg_params={'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
#Added solver saga to repress warnings- saga recommended by sklearn
grid_logreg= GridSearchCV(LogisticRegression(solver= 'liblinear'), logreg_params, cv=5)
grid_logreg.fit(x_train_features_tfidf, y_train)
logreg= grid_logreg.best_estimator_
print('Best Penalty:', grid_logreg.best_estimator_.get_params()['penalty'])
print('Best C:', grid_logreg.best_estimator_.get_params()['C'])
c_value_tf= grid_logreg.best_estimator_.get_params()['C']
pen_value_tf= grid_logreg.best_estimator_.get_params()['penalty']
logreg_tf=LogisticRegression(C= c_value_tf, penalty= pen_value_tf, solver= 'liblinear')
logreg_tf.fit(x_train_features_tfidf, y_train)
y_pred_tf= logreg_tf.predict(x_test_features_tfidf)
print('Recall Score: {:.2f}'.format(recall_score(y_test, y_pred_tf, pos_label= 'Positive')))
print('Precision Score: {:.2f}'.format(precision_score(y_test, y_pred_tf, pos_label= 'Positive')))
print('F1 Score: {:.2f}'.format(f1_score(y_test, y_pred_tf, pos_label= 'Positive')))
print('Accuracy Score: {:.2f}'.format(accuracy_score(y_test, y_pred_tf)))
cnf_matrix = confusion_matrix(y_test, y_pred_tf)
df_cnf_matrix= pd.DataFrame(cnf_matrix)
sns.heatmap(df_cnf_matrix, annot=True, fmt='g', cmap="Blues")
plt.ylabel('True label')
plt.xlabel('Predicted label')
Overall, it seems that the two versions generate similar results. The F1 scores are the same and they have similar scores for the rest of the parameters. So, for this dataset any vectorization of the text would work well.
Final Words
So, we walked through various parts of a data science projects- data cleaning, exploration, data visualization, data manipulation for creating a model. We also looked at various techniques used for text analytics- vectorization and sentiment analysis.
Looking back, I would like to explore this dataset more using similar machine learning techniques, especially in the model part. There are several other classification algorithms such as Random Forest and Naive Bayes which I would like to read upon and possibly implement with the dataset to compare the different algorithms. There is also a possibility of including neural network systems which can be checked upon as well.