The aim of this project was to predict house prices using various attributes such as square footage, number of bedrooms, and number of bathrooms. We implemented and explored different supervised learning regression models to accomplish this task. The performance of these models was evaluated using various metrics, which helped us determine the most suitable model and parameters for this prediction task.
The dataset includes various features relevant to predicting house prices.
- Loaded the raw data from JSON files and merged data into a single dataframe.
- Handled missing values through appropriate methods.
- Split the data into training and test sets.
- EDA steps are documented in
notebooks/1 - EDA.ipynb.
- Analyzed the distribution of key features.
- Examined relationships between features and the target variable (house prices).
- Identified potential outliers and anomalies.
- Model selection steps are documented in
notebooks/2 - model_selection.ipynb.
- Decision Tree
- Random Forest
- XGBoost
- Other relevant regression models
- Split the training data into training and validation sets.
- Trained multiple models on the training set.
- Evaluated models using metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (MSE), and R-squared.
- Selected the best-performing models for further tuning.
- Tuning steps are documented in
notebooks/3 - tuning_pipeline.ipynb.
- Hyperparameter tuning using techniques such as Grid Search and Random Search.
- Validated model performance using cross-validation.
- Finalized the best model based on performance metrics.
- Compared different tuned models to determine the best performer.
- Selected the best model for feature selection.
- Feature selection steps are documented in
notebooks/3 - tuning_pipeline.ipynb.
- Analyzed the top 10 important features of the best model.
- Used visualization for the top features
- Chose the best model to move on to the final pipeline.
- Created a scalable pipeline for model training and deployment.
- Automated the data preprocessing, model training, and evaluation process.
- Saved the final model/pipeline for deployment.
- Root Mean Squared Error (MSE): 44555.6
- Mean Absolute Error (MAE): 7110.6
- R² Score: 0.993517
This repository provides the complete workflow for predicting house prices, from data cleaning to deploying the final model. Each step is documented in detail to ensure reproducibility and ease of understanding.
- Collaborating with group member using Github: Lack of experience with GitHub requires more communication between team members which requires more time to get work done
- Encoding tags from huge list
- Target encoding of city, understanding how to prevent data leakage
- Hyperparameter tuning and understanding what to change
- Target encoding of city using appropriate Pandas methods
- Refine hyperparameters
- Refine pipeline
- More EDA and interactive plots for better visualization
- Gain more exposure to team work on GitHub
- Use references from existing projects related to our project and improve our models
