Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions Evaluation/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Use an official Python runtime as a parent image
FROM python:3.9-slim

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Expose the port Streamlit runs on
EXPOSE 8501

# Command to run the Streamlit app
CMD ["streamlit", "run", "streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]
43 changes: 43 additions & 0 deletions Evaluation/LR3 datasets/Test Real estate.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
359,2013.167,1.1,193.5845,6,24.96571,121.54089,45.1
351,2013.0,13.2,492.2313,5,24.96515,121.53737,42.3
374,2013.083,0.0,274.0144,1,24.9748,121.53059,52.2
400,2012.917,12.7,170.1289,1,24.97371,121.52984,37.3
370,2012.667,20.2,2185.128,3,24.96322,121.51237,22.8
73,2013.583,32.5,424.5442,8,24.97587,121.53913,36.3
263,2012.917,15.9,289.3248,5,24.98203,121.54348,53.0
141,2013.25,16.2,289.3248,5,24.98203,121.54348,51.4
94,2012.917,31.9,1146.329,0,24.9492,121.53076,16.1
71,2013.583,6.6,90.45606,9,24.97433,121.5431,59.0
119,2013.5,25.3,1583.722,3,24.96622,121.51709,30.6
132,2013.5,4.0,2147.376,3,24.96299,121.51284,30.7
337,2012.833,5.1,1867.233,2,24.98407,121.51748,35.6
56,2012.833,31.7,1160.632,0,24.94968,121.53009,13.7
127,2013.083,38.6,804.6897,4,24.97838,121.53477,62.9
377,2013.417,14.7,1717.193,2,24.96447,121.51649,30.5
57,2013.417,33.6,371.2495,8,24.97254,121.54059,41.9
292,2012.833,3.4,56.47425,7,24.95744,121.53711,54.4
366,2012.917,17.3,2261.432,4,24.96182,121.51222,29.5
85,2013.083,15.1,383.2805,7,24.96735,121.54464,43.7
117,2013.0,30.9,6396.283,1,24.94375,121.47883,12.2
10,2013.417,17.9,1783.18,3,24.96731,121.51486,22.1
375,2013.25,5.4,390.5684,5,24.97937,121.54245,49.5
138,2013.5,13.6,319.0708,6,24.96495,121.54277,47.4
321,2012.75,13.5,4197.349,0,24.93885,121.50383,18.6
403,2012.833,12.7,187.4823,1,24.97388,121.52981,28.5
232,2012.833,16.2,4074.736,0,24.94235,121.50357,14.7
91,2012.833,0.0,274.0144,1,24.9748,121.53059,45.4
95,2012.917,40.9,167.5989,5,24.9663,121.54026,41.0
174,2013.083,41.3,401.8807,4,24.98326,121.5446,35.1
31,2013.5,25.9,4519.69,0,24.94826,121.49587,22.1
142,2013.333,5.1,1559.827,3,24.97213,121.51627,28.9
105,2012.667,32.7,392.4459,6,24.96398,121.5425,30.5
80,2013.0,18.0,1414.837,1,24.95182,121.54887,26.5
34,2013.25,16.5,323.655,6,24.97841,121.54281,49.3
291,2013.083,37.7,490.3446,0,24.97217,121.53471,37.0
287,2012.917,5.9,90.45606,9,24.97433,121.5431,56.3
410,2013.0,13.7,4082.015,0,24.94155,121.50381,15.4
223,2013.583,30.6,431.1114,10,24.98123,121.53743,48.5
362,2013.083,41.4,281.205,8,24.97345,121.54093,63.3
16,2013.583,35.7,579.2083,2,24.9824,121.54619,50.5
312,2013.167,21.3,537.7971,4,24.97425,121.53814,42.2
373 changes: 373 additions & 0 deletions Evaluation/LR3 datasets/Train Real estate.csv

Large diffs are not rendered by default.

23 changes: 23 additions & 0 deletions Evaluation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Data Science Evaluation Application

This repository contains a Streamlit-based web application developed as part of a data science evaluation task. The application includes two primary tasks:

1. **Real Estate Price Prediction**: A linear regression model is used to predict real estate prices based on features such as transaction date, house age, distance to MRT stations, number of convenience stores, latitude, and longitude.

2. **Time Series Analysis of Household Power Consumption**: A time series model is used to analyze and forecast household power consumption data.

## Summary

The application is designed to allow users to upload their datasets, perform exploratory data analysis (EDA) with interactive visualizations, build and evaluate models, and compare the models' predictions with actual values.

### Key Features:
- **Upload Dataset**: Users can upload their datasets, which are then processed by the application for analysis.
- **EDA and Visualization**: The application provides interactive visualizations using Altair to help users gain insights from their data.
- **Model Building and Evaluation**: For the real estate task, a linear regression model is built and evaluated. For the time series task, a suitable time series model is selected and evaluated.

### Installation and Usage

1. **Clone the Repository**:
```bash
git clone https://github.com/your-username/Data-Science-Evaluation.git
cd Data-Science-Evaluation
Binary file added Evaluation/TS1 datasets/TS1 (1).zip
Binary file not shown.
297 changes: 297 additions & 0 deletions Evaluation/lr3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,297 @@
import warnings
warnings.filterwarnings("ignore")

import streamlit as st
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
import joblib


def main():
# Title and Description
st.title("Linear Regression for Real Estate Price Prediction")
st.write("""
Welcome to the Real Estate Price Prediction application. This tool is designed to help you explore real estate data, perform in-depth data analysis, and build predictive models to forecast house prices. In this evaluation task, we will walk through the process of data cleaning, exploratory data analysis (EDA), model building, and making predictions with detailed explanations and insights.
""")

# Data Upload Section
st.header("1. Upload Your Dataset")
st.write("""
In this section, you'll upload your training and test datasets. The application will automatically clean and prepare the data for analysis.
""")
train_file = st.file_uploader("Upload the training dataset (CSV)", type=["csv"])
test_file = st.file_uploader("Upload the test dataset (CSV)", type=["csv"])

if train_file is not None and test_file is not None:
# Read the datasets
train_data = pd.read_csv(train_file)
test_data = pd.read_csv(test_file)

# Display the first few rows of the data
st.subheader("Training Data Overview")
st.write("""
Below is a preview of the training data that will be used to build the model. We will analyze the features, clean the data, and prepare it for modeling.
""")
st.write(train_data.head())

st.subheader("Test Data Overview")
st.write("""
Below is a preview of the test data that will be used to evaluate the model's performance. We will ensure this data is consistent with the training data.
""")
st.write(test_data.head())

# Display the column names
st.subheader("Column Names in Training Data")
st.write(train_data.columns)

st.subheader("Column Names in Test Data")
st.write(test_data.columns)

# Identify numeric and non-numeric columns
numeric_columns = train_data.select_dtypes(include=[np.number]).columns.tolist()
non_numeric_columns = train_data.select_dtypes(exclude=[np.number]).columns.tolist()

st.subheader("Numeric Columns in Training Data")
st.write(numeric_columns)

st.subheader("Non-Numeric Columns in Training Data")
st.write(non_numeric_columns)

# Data Cleaning and Feature Engineering
st.header("2. Data Cleaning and Feature Engineering")
st.write("""
In this section, we undertake several crucial steps to prepare the data for effective modeling. Proper data cleaning and feature engineering are fundamental to building a robust predictive model. Heres what well do:
- **Handle Missing Values**: Missing data can introduce bias or inaccuracies in the model. We fill missing values with the mean of the respective columns to maintain the integrity of the dataset.
- **Drop Non-Numeric Columns**: Non-numeric columns are excluded from the analysis at this stage to focus on features that directly contribute to the numerical prediction of house prices. This simplification ensures that the model can be trained efficiently.
- **Scale Features**: Scaling the numeric features standardizes the range of independent variables or features of data. This step is essential for algorithms that calculate distances between data points, such as in regression models. It ensures that all features contribute equally to the model.
""")

# Handle non-numeric columns (For now, we'll drop them)
if non_numeric_columns:
st.write(f"**Dropped Non-Numeric Columns:** {non_numeric_columns}")
train_data = train_data.drop(columns=non_numeric_columns)
test_data = test_data.drop(columns=non_numeric_columns)

# Handle missing values
train_data.fillna(train_data.mean(), inplace=True)
test_data.fillna(test_data.mean(), inplace=True)

# Feature Scaling
scaler = StandardScaler()
train_data[numeric_columns] = scaler.fit_transform(train_data[numeric_columns])
test_data[numeric_columns] = scaler.transform(test_data[numeric_columns])

st.write("""
**Data Cleaning and Feature Engineering completed.** The following steps have been successfully applied:
- Missing values have been handled to ensure no gaps in the data.
- Non-numeric columns have been dropped, allowing us to focus on the numerical aspects of the dataset.
- All features have been scaled, ensuring they are on a common scale, which is crucial for the accuracy and performance of our regression model.

The data is now pre-processed and ready for the next stage: Exploratory Data Analysis (EDA). This preparation sets a solid foundation for building a reliable and accurate predictive model.
""")

# Exploratory Data Analysis (EDA)
st.header("3. Exploratory Data Analysis (EDA)")
st.write("""
In this section, we explore the relationships and distributions within the dataset. Understanding these patterns helps in making informed decisions during model building.
""")

# Interactive Correlation Heatmap
st.subheader("Correlation Heatmap")
st.write("""
The correlation heatmap below shows the relationships between numeric features. A high correlation (close to 1 or -1) between features can indicate multicollinearity, which we need to address in the modeling stage.
""")
corr_matrix = pd.DataFrame(train_data, columns=numeric_columns).corr().stack().reset_index()
corr_matrix.columns = ['Feature 1', 'Feature 2', 'Correlation']

heatmap = alt.Chart(corr_matrix).mark_rect().encode(
x='Feature 1:O',
y='Feature 2:O',
color=alt.Color('Correlation:Q', scale=alt.Scale(scheme='blueorange')),
tooltip=['Feature 1', 'Feature 2', 'Correlation']
).properties(
width=600,
height=600
)
st.altair_chart(heatmap, use_container_width=True)

# Interactive Scatter Plots
st.subheader("Pairwise Scatter Plots")
st.write("""
These scatter plots illustrate the relationships between each feature and the target variable, 'Y house price of unit area'. Analyzing these plots helps us understand which features are most influential in predicting house prices.
""")
for feature in numeric_columns:
scatter_plot = alt.Chart(train_data).mark_circle(size=60).encode(
x=alt.X(feature, scale=alt.Scale(zero=False)),
y=alt.Y('Y house price of unit area', scale=alt.Scale(zero=False)),
tooltip=[feature, 'Y house price of unit area']
).interactive().properties(
title=f'Scatter plot of {feature} vs Y house price of unit area',
width=600,
height=400
)
st.altair_chart(scatter_plot, use_container_width=True)

# Interactive Histogram
st.subheader("Distribution of Target Variable")
st.write("""
The histogram below shows the distribution of the target variable, 'Y house price of unit area'. This analysis helps us understand the range and skewness of house prices in the dataset.
""")
hist = alt.Chart(train_data).mark_bar().encode(
alt.X('Y house price of unit area:Q', bin=True),
y='count()',
tooltip=['count()']
).properties(
title='Distribution of Y house price of unit area',
width=600,
height=400
).interactive()
st.altair_chart(hist, use_container_width=True)

# Interactive Box Plots
st.subheader("Box Plots of Numeric Features")
st.write("""
The box plots below help in identifying the spread and outliers in the data. Outliers can sometimes distort model predictions and might need to be treated separately.
""")
for feature in numeric_columns:
box_plot = alt.Chart(train_data).mark_boxplot().encode(
x=alt.X('Y house price of unit area:Q'),
y=alt.Y(feature + ':Q'),
tooltip=[feature, 'Y house price of unit area']
).properties(
title=f'Box plot of Y house price of unit area by {feature}',
width=600,
height=400
)
st.altair_chart(box_plot, use_container_width=True)

# Model Building
st.header("4. Model Building")
st.write("""
In this section, we build and evaluate different linear models: standard Linear Regression, Lasso Regression, and Ridge Regression. These models are chosen to explore how regularization techniques (Lasso and Ridge) affect the model's performance.
""")

X_train = train_data.drop(columns=['Y house price of unit area'])
y_train = train_data['Y house price of unit area']

X_test = test_data.drop(columns=['Y house price of unit area'])
y_test = test_data['Y house price of unit area']

X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Create Polynomial Features
poly = PolynomialFeatures(degree=2, include_bias=False)

# Models to Evaluate
models = {
"Linear Regression": LinearRegression(),
"Ridge Regression": Ridge(),
"Lasso Regression": Lasso()
}

# Evaluate each model
results = {}
for name, model in models.items():
pipeline = Pipeline([
('poly_features', poly),
('regression', model)
])

pipeline.fit(X_train_split, y_train_split)
y_val_pred = pipeline.predict(X_val)
mse = mean_squared_error(y_val, y_val_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_val, y_val_pred)

results[name] = {
"RMSE": rmse,
"R-squared": r2
}

# Display the results
st.subheader("Model Evaluation Results")
for model_name, metrics in results.items():
st.write(f"**{model_name}:**")
st.write(f"- Validation RMSE: {metrics['RMSE']:.4f}")
st.write(f"- Validation R-squared: {metrics['R-squared']:.4f}")

# Choose the best model based on RMSE
best_model_name = min(results, key=lambda k: results[k]["RMSE"])
best_model = models[best_model_name]
pipeline = Pipeline([
('poly_features', poly),
('regression', best_model)
])
pipeline.fit(X_train, y_train)

# Save the best model
joblib.dump(pipeline, 'trained_linear_model.pkl')

# Prediction on the test set
y_test_pred = pipeline.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(y_test, y_test_pred)

st.write(f"**Test Results for {best_model_name}:**")
st.write(f"- Test RMSE: {test_rmse:.4f}")
st.write(f"- Test R-squared: {test_r2:.4f}")

# Actual vs Predicted
st.header("5. Actual vs Predicted")
st.write("""
The following chart compares the actual house prices to the predicted prices on the test set. This comparison allows us to visually assess the accuracy of our model. A perfect model would have all points lying on the 45-degree line, indicating that the predicted values match the actual values exactly.
""")

actual_vs_predicted = pd.DataFrame({
'Actual': y_test,
'Predicted': y_test_pred
})

scatter_actual_vs_predicted = alt.Chart(actual_vs_predicted).mark_circle(size=60).encode(
x=alt.X('Actual', scale=alt.Scale(zero=False)),
y=alt.Y('Predicted', scale=alt.Scale(zero=False)),
tooltip=['Actual', 'Predicted']
).interactive().properties(
title='Actual vs Predicted House Prices',
width=600,
height=400
)

st.altair_chart(scatter_actual_vs_predicted, use_container_width=True)

st.write("""
The scatter plot above shows the relationship between actual and predicted house prices. The closer the points are to the diagonal line, the better the model's predictions. Deviations from this line indicate discrepancies between the actual and predicted values, which can be further analyzed to improve model performance.
""")

# Prediction Section
st.header("6. Predict House Price")
st.write("""
Use the inputs below to predict the house price per unit area based on the trained model. This feature allows you to experiment with different inputs and see how the model responds.
""")

house_age = st.number_input("House Age", min_value=0, max_value=100)
distance_to_mrt = st.number_input("Distance to MRT Station", min_value=0)
convenience_stores = st.number_input("Number of Convenience Stores", min_value=0)
latitude = st.number_input("Latitude")
longitude = st.number_input("Longitude")

if st.button("Predict House Price"):
features = poly.transform(scaler.transform([[house_age, distance_to_mrt, convenience_stores, latitude, longitude]]))
prediction = pipeline.predict(features)
st.write(f"**Predicted House Price per Unit Area:** {prediction[0]:.2f}")

st.write("""
This section allows you to predict house prices using the model trained earlier. By inputting the relevant features (house age, distance to the nearest MRT station, number of convenience stores nearby, latitude, and longitude), the model will estimate the price per unit area of the house.
""")

if __name__ == "__main__":
main()
8 changes: 8 additions & 0 deletions Evaluation/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
streamlit
pandas
numpy
matplotlib
seaborn
altair
scikit-learn
joblib
Loading