Amazon_sentiment_analysis

Full sentiment analysis project, based on Amazon reviews. Training of machine learning models to be able to detect the positive or negative sentiment of a review.

The project is done in 6 parts:

Scraping reviews from Amazon to create a dataset
Cleaning of the dataset and preparation for model training
Exploratory analysis of the scraped reviews
Machine learning model creation and training
Model testing against new reviews
Analysis of the results

Business case

A company needs to know what its customers think about their products in order to asses their level of satisfaction. Internet provides a place for customers to express themselves, positively but also negatively. If the number of reviews is small, the analysis of the customers' satisfaction might be doable manually. But what if it it not the case and we have hundred or thousand of reviews?

That is where machine learning intervenes.

The idea is to classify reviews based on their general sentiment - negative or positive. For this, we need to train Machine Learning (ML) algorithms through supervised training and Natural Language Processing (NLP) techniques. We first create a dataset of review for which we already know the general sentiment (4 to 5 stars - positive, 1 or 2 stars - negative) and then train our ML models against these scores. We are then able to extrapolate this results and can assess the sentiment of a review for which we don't have a score!

For this example, let's say that we are an online seller of electronic equipement and that we want to see what our customers say about us on social medias, on our website, ... Any platforms on internet where the customer cannot leave a rating, and thus forbid us to directly assess the satisfaction level.

Reviews scraping

The first part is to create a dataset for which we already have labels (rating).

Here we scraped the first 100 pages of items in the computers category of Amazon. For each of these items, we fetch the first 10 reviews of each stars, which allows us to create a dataset of almost 20 000 different reviews.

More details here.

Data cleaning

Once we have a large dataset, the next part is to clean it. Indeed when we work with NLP it is important to have normalized data. The main steps are as follow:

Tokenize the text
Remove punctuation and other special characters
Normalize the case (every words need to be in lower case)
Remove stop words

More details here.

Data exploration

This part is about understanding our dataset and identify potential trends in it, in order to choose the best parameters for our ML classifiers.

More details here.

Models creation and testing

We train 6 different ML classifiers with scikit learn:

logistic regression
decision tree
random forest
passive agressive
support vector
naive bayes

Based on the preliminary results, we can then test our best model against new data, to check how it would behave in a real scenario.

More details here.

Results analysis

How did each the 6 models fared ? Which one performed the best ?

See the results here.

Utilisation

All the commands below must be launched independently. You can also take the data from this repository directly, without having to run the program. For example you can skip the scraping and the cleaning by taking the cleaned data in data_cleaning\data.

The only command that is necessary to launch is the one to create the models, as the resulting file is too big to host on github.

To Launch the initial scraping: python main.py scraping

To clean the results: python main.py cleaning

To view the data analysis: python main.py data_exploration

To create the classifiers: python main.py create_models

To test one of the model: python main.py test_model

To view the results analysis: python main.py analyze_results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amazon_sentiment_analysis

Business case

Reviews scraping

Data cleaning

Data exploration

Models creation and testing

Results analysis

Utilisation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Amazon_sentiment_analysis

Business case

Reviews scraping

Data cleaning

Data exploration

Models creation and testing

Results analysis

Utilisation