This project analyzes user feedback and reviews on recipes to uncover sentiment trends and user engagement patterns. By leveraging Apache Spark, natural language processing (NLP), and machine learning models, the study aims to improve recipe recommendations and enhance customer satisfaction.
- Source: UCI Machine Learning Repository
- File Size: 5.8 MB
- Format: CSV
- Instances: 18,182
- Features: 15
- Missing Values: Present in the text column
- Recipe Details: Name, ranking, unique ID
- User Information: ID, reputation score
- Review Data: Comment ID, timestamp, replies, thumbs up/down, star rating (1-5 scale, 0 if not rated)
- Text Data: Full review comment for NLP-based sentiment analysis
The dataset, sourced from an online platform, represents real-world user feedback on recipes. This data is useful for sentiment analysis, user behavior modeling, and recommendation systems. By understanding sentiment and engagement patterns, platforms can improve user experience and boost interaction.
- Initialize Spark Session – Set up a distributed environment for large-scale processing.
- Load Dataset – Read and explore the CSV file using
pyspark.sql. - Data Type Casting – Convert numeric fields to appropriate data types (
IntegerType,TimestampType). - Data Cleaning – Handle missing values in text fields and remove redundant data.
- Correlation Analysis – Analyze relationships between user reputation, thumbs up, and ratings.
- Feature Engineering – Extract additional time-based features (year, month, day) for trend analysis.
- Top 20 Recipes by Rating
- Top 20 Recipes by Thumbs Up
- Review Length vs. Rating
- Sentiment Trends Over Time
- Tool Used:
TextBlobfor sentiment polarity scoring. - Processing Steps:
- Tokenization, stopword removal, and lemmatization applied to review text.
- Polarity scores assigned to each review to classify sentiment as positive, neutral, or negative.
- Positive reviews correlate with higher engagement (thumbs up).
- Long reviews tend to show stronger sentiment (either highly positive or negative).
- Noun and Adjective Analysis: Identified frequent descriptive words in reviews.
- Word Cloud Visualization: Generated insights into common themes in positive and negative reviews.
Develop sentiment classification models using textual and engagement data.
| Model | Precision | Recall | F1 Score |
|---|---|---|---|
| Random Forest | 73.5% | 85.7% | 79.1% |
| Logistic Regression | 73.5% | 85.7% | 79.1% |
| Decision Tree | 81.8% | 85.9% | 80.0% |
- Random Forest provided robust generalization but had slightly lower precision.
- Logistic Regression acted as a simple baseline model.
- Decision Tree outperformed others, achieving the highest precision and recall.
- Text Feature Extraction: TF-IDF vectorization of review text.
- User Engagement Features: Integration of thumbs up, thumbs down, and star ratings.
- Model Training and Evaluation: Used
SparkMLpipelines for training, testing, and hyperparameter tuning. - Cross-Validation: Optimized model parameters for improved accuracy.
- Achieved the highest precision (81.8%), meaning fewer false positive classifications.
- Maintained strong recall and F1-score, balancing accuracy and interpretability.
- Promote High-Rated Recipes: Recipes with high ratings and engagement should be prioritized.
- Flag Low-Rated Recipes: Negative reviews can be analyzed for potential improvements.
- Seasonal Rating Trends: Identifying fluctuations in user sentiment over time can inform content strategies.
- Keyword Analysis in Negative Reviews: Helps in identifying common complaints and areas for improvement.
- Advanced NLP Models: Implement BERT or LSTM for better contextual understanding.
- Real-Time Sentiment Analysis: Deploy models for live content moderation and recommendation systems.
- User Segmentation: Tailor recipe recommendations based on user preferences and past interactions.