Data Science Midterm Project

Project/Goals

The aim of this project was to predict house prices using various attributes such as square footage, number of bedrooms, and number of bathrooms. We implemented and explored different supervised learning regression models to accomplish this task. The performance of these models was evaluated using various metrics, which helped us determine the most suitable model and parameters for this prediction task.

Process

Data Cleaning

Source of Data

The dataset includes various features relevant to predicting house prices.

Loaded the raw data from JSON files and merged data into a single dataframe.
Handled missing values through appropriate methods.
Split the data into training and test sets.

Exploratory Data Analysis (EDA)

Notebooks

EDA steps are documented in notebooks/1 - EDA.ipynb.

Steps

Analyzed the distribution of key features.
Examined relationships between features and the target variable (house prices).
Identified potential outliers and anomalies.

Model Selection

Notebooks

Model selection steps are documented in notebooks/2 - model_selection.ipynb.

Models Tested:

Decision Tree
Random Forest
XGBoost
Other relevant regression models

Steps

Split the training data into training and validation sets.
Trained multiple models on the training set.
Evaluated models using metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (MSE), and R-squared.
Selected the best-performing models for further tuning.

Model Tuning

Notebooks

Tuning steps are documented in notebooks/3 - tuning_pipeline.ipynb.

Steps:

Hyperparameter tuning using techniques such as Grid Search and Random Search.
Validated model performance using cross-validation.
Finalized the best model based on performance metrics.
Compared different tuned models to determine the best performer.
Selected the best model for feature selection.

Feature Selection

Notebooks

Feature selection steps are documented in notebooks/3 - tuning_pipeline.ipynb.

Actions Taken:

Analyzed the top 10 important features of the best model.
Used visualization for the top features
Chose the best model to move on to the final pipeline.

MLOps Pipeline

Steps:

Created a scalable pipeline for model training and deployment.
Automated the data preprocessing, model training, and evaluation process.
Saved the final model/pipeline for deployment.

Results

Best Performing Model: Decision Tree Regressor

Root Mean Squared Error (MSE): 44555.6
Mean Absolute Error (MAE): 7110.6
R² Score: 0.993517

This repository provides the complete workflow for predicting house prices, from data cleaning to deploying the final model. Each step is documented in detail to ensure reproducibility and ease of understanding.

Challenges

Collaborating with group member using Github: Lack of experience with GitHub requires more communication between team members which requires more time to get work done
Encoding tags from huge list
Target encoding of city, understanding how to prevent data leakage
Hyperparameter tuning and understanding what to change

Future Goals

Target encoding of city using appropriate Pandas methods
Refine hyperparameters
Refine pipeline
More EDA and interactive plots for better visualization
Gain more exposure to team work on GitHub
Use references from existing projects related to our project and improve our models

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
images		images
models		models
notebooks		notebooks
README.md		README.md
assignment.md		assignment.md
image.png		image.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Midterm Project

Project/Goals

Process

Data Cleaning

Source of Data

Exploratory Data Analysis (EDA)

Notebooks

Steps

Model Selection

Notebooks

Models Tested:

Steps

Model Tuning

Notebooks

Steps:

Feature Selection

Notebooks

Actions Taken:

MLOps Pipeline

Steps:

Results

Best Performing Model: Decision Tree Regressor

Challenges

Future Goals

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Science Midterm Project

Project/Goals

Process

Data Cleaning

Source of Data

Exploratory Data Analysis (EDA)

Notebooks

Steps

Model Selection

Notebooks

Models Tested:

Steps

Model Tuning

Notebooks

Steps:

Feature Selection

Notebooks

Actions Taken:

MLOps Pipeline

Steps:

Results

Best Performing Model: Decision Tree Regressor

Challenges

Future Goals

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages