Stock-Sentiment

Sentiment Analysis of TESLA stocks using Recurrent Neural Network

Project Description

This project aims at classifying tweets into positive, neutral, and negative sentiment and building a classification model to predict the sentiment of tweets.

Data Cleaning and Preprocessing

TESLA tweets are scrapped from Twitter using snscrape module. Data consists of TESLA related tweets from 21-04-2021 to 21-05-2021. Data is in .csv format and has 59,025 records with three columns (date, content, and username). Data has no missing values.

remove_url(text) : Removes URLs from tweet text

remove_mentions(text) : Removes mentions from tweets

remove_hashtags(text) : Removes hastags from tweets

remove_whitespace(text) : Removes extra whitespaces from tweet text

The above modules uses regular expressions to carry out the related tasks.

Duplicates were dropped.

Labelling tweets with sentiment score

VADER sentitment library is used to attach a sentiment score (compound score) to every single tweet. VADER uses an inbuilt dictionary of words (keys) and the corresponding sentiment score (values). This dictionary can easily be updated depending on the domain we are working. For example here we added words like bear and bull (which are commonly used in stock market) along with corresponding sentiment score.

VADER takes into account emojis, punctuations, and capitalization to generate the compund score and hence we do not remove emojis and punctuations at this point. Also note we did not convert tweet to lower case yet.

compound score < 0 ----> Negative Sentiment

compound scaore = 0 ----> Neutral Sentiment

comopund score > 0 -----> Positive Sentiment

Using the above criterion tweets are labelled 1, 0, and 2 for negative, neutral, and positive sentiment respectively. For this another feature 'sentiment' is created.

A simple group by sentiment score shows dataset is higly imbalanced.

1 ---> 11467 records

0 ---> 20609 records

2 ---> 26949 records

Compound scores were resampled day wise and mean compound score for everyday was computed. A simple line graph was plotted to see how average sentiment varies everyday.

Data Cleaning for Classification Task

Apart from removing URLs, mentions, and hastags we need to remove punctuations, special characters, and emoticons from tweet before we go for classification task.

Every character in the tweet was changed to lowercase and then stop words using nltk library, punctuations, special characters, and emoticons were removed.

Each word was lemmatized (changed to its root form) using Text blob lemmatizer.

Balancing the dataset

Since accuarcy was considered as the evaluation metric it is important to balance the dataset first.

Dataset was balanced by taking a random sample of 11466 records from data with class 0 and 2. This makes all three classes to have 11466 records. Now dataset has total of 34398 records.

Tokenizing the tweets

Tokenizer with 5000 words were used and fitted on the cleaned tweets to obtain corresponding text vector. Text vector was padded with zeros so that each text vector has same length.

Training the model

Dataset was split into training set (0.90) and validation set (0.10).

A Sequential model was trained with an Embedding layer of 5000 words, dropout layer with 0.80, LSTM layer with tanh activation function and the last output layer with 3 nodes and sigmoid activation function.

Categorical cross entropy was used as the loss function and adam as the optimizer.

With batch size of 150 records model was fitted for 15 epochs with Early stopping (for regularization).

After 6 epochs traing was stopped with 0.89 % training accuracy and 0.85 % vaildation accuarcy.

Evaluating different models

Different Sequential models were trained to comapare the results

Sequential model with Embedding layer of 5000 words and 0.80 dropout

Training accuracy: 0.89

Validation accuracy: 0.85

Sequential model with Embedding layer of 3000 words

Training accuracy: 0.88

Validation accuracy: 0.84

Sequential model with Embedding layer of 3000 words and L2 regularization

Training accuracy: 0.88

Validation accuracy: 0.83

Sequential model with Embedding layer of 5000 words

Training accuracy: 0.90

Validation accuracy: 0.84

Clerly first model performed the best ( low bias and low variance) and the last model performed the worst (high varaince model)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
TESLA-Sentiment-Analyzer-3-Classes.ipynb		TESLA-Sentiment-Analyzer-3-Classes.ipynb
TESLA-tweets-2021-05-21-to-2021-05-20.csv		TESLA-tweets-2021-05-21-to-2021-05-20.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stock-Sentiment

Project Description

Data Cleaning and Preprocessing

Labelling tweets with sentiment score

Data Cleaning for Classification Task

Balancing the dataset

Tokenizing the tweets

Training the model

Evaluating different models

About

Releases

Packages

Languages

Pradyumn-30/Stock-Sentiment

Folders and files

Latest commit

History

Repository files navigation

Stock-Sentiment

Project Description

Data Cleaning and Preprocessing

Labelling tweets with sentiment score

Data Cleaning for Classification Task

Balancing the dataset

Tokenizing the tweets

Training the model

Evaluating different models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages