This project aims to analyze Twitter sentiment data using Apache Spark for data processing and R for visualization. The goal is to understand public sentiment on various entities by analyzing tweets and generating insightful visualizations.
- Abstract
- Project Structure
- Environment Setup
- Running the Project
- File Descriptions
- Output Descriptions
- Purpose
- Contributing
- License
Twitter-Sentiment-Analysis/
├── data/
│ ├── tweets.csv
│ └── outputs/
├── src/
│ └── main/
│ └── java/
│ └── com/
│ └── example/
│ └── TwitterSentimentAnalysis.java
├── R/
│ └── visualization.R
├── resources/
│ └── log4j.properties
├── pom.xml
└── README.md
- Java 8 or higher
- Apache Maven
- Apache Hadoop
- Apache Spark
- R and RStudio
- Required R packages:
ggplot2,dplyr,wordcloud,RColorBrewer,gridExtra,grid,png
-
Install Hadoop:
- Follow the official Hadoop installation guide.
-
Install Spark:
- Follow the official Spark installation guide.
git clone https://github.com/adivishnu-a/Twitter-Sentiment-Analysis.git
cd Twitter-Sentiment-Analysis- Open IntelliJ IDEA.
- Open the cloned repository as a Maven project.
- Run the
TwitterSentimentAnalysisclass located in TwitterSentimentAnalysis.java
Rscript R/visualization.R- Description: Contains the tweet data without headers.
- Columns:
TweetID: Unique identifier for each tweet.Entity: The subject or entity being discussed in the tweet.Sentiment: The sentiment expressed in the tweet (e.g., Positive, Negative, Neutral).TweetContent: The actual text content of the tweet.
- Description: Main Java file for data processing using Apache Spark.
- Key Functions:
- Data Loading and Cleaning: Loads tweets from
tweets.csv, removes duplicates. - Sentiment Analysis: Calculates the percentage of each sentiment.
- Entity Analysis: Identifies the top entities by tweet count.
- Additional Insights: Calculates average tweet length by sentiment, top words in positive and negative tweets, sentiment distribution by entity, and sentiment distribution for top entities.
- Output: Saves the results to CSV files in the outputs directory.
- Data Loading and Cleaning: Loads tweets from
- Description: R script for generating visualizations from the CSV files.
- Key Functions:
- Visualization: Generates plots and word clouds from the CSV files.
- PDF Report: Creates a PDF report with all the plots and tables, each on a separate page, displaying only the top 10 rows of each table with captions.
- Description: Configuration file for logging levels for Spark and Hadoop.
- Description: Maven configuration file for managing project dependencies.
The Java code processes the tweet data and generates the following CSV files in the outputs directory:
sentiment_percentage: Contains the percentage of each sentiment.top_entities: Contains the top entities by tweet count.avg_tweet_length_by_sentiment: Contains the average tweet length by sentiment.top_positive_words: Contains the top positive words.top_negative_words: Contains the top negative words.sentiment_by_entity: Contains the sentiment distribution by entity.sentiment_for_top_entities: Contains the sentiment distribution for top entities.
The R script generates the following visualizations and saves them in the outputs directory:
sentiment_percentage_distribution.png: Pie chart of sentiment percentage distribution.top_entities.png: Bar chart of the top entities by tweet count.avg_tweet_length_by_sentiment.png: Bar chart of the average tweet length by sentiment.top_positive_words.png: Bar chart of the top positive words.top_negative_words.png: Bar chart of the top negative words.sentiment_by_entity.png: Bar chart of the sentiment distribution by entity.sentiment_for_top_entities.png: Bar chart of the sentiment distribution for top entities.positive_wordcloud.png: Word cloud of positive words.negative_wordcloud.png: Word cloud of negative words.RPlots.pdf: PDF report containing all the plots and tables, each on a separate page, displaying only the top 10 rows of each table with captions.
The purpose of this project is to provide insights into public sentiment on various entities using Twitter data. This can be useful for businesses, politicians, and organizations to make informed decisions based on public opinion.
Contributions are welcome! If you have any suggestions, bug reports, or feature requests, please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for more details.