Evaluation Task #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

jnulia wants to merge 1 commit into khoulaCode:main from jnulia:evaluation_lujain

Evaluation/Dockerfile

-Original file line number
+Diff line change
@@ -0,0 +1,17 @@
+    # Use an official Python runtime as a parent image
+    FROM python:3.9-slim
+    # Set the working directory in the container
+    WORKDIR /app
+    # Copy the current directory contents into the container at /app
+    COPY . /app
+    # Install any needed packages specified in requirements.txt
+    RUN pip install --no-cache-dir -r requirements.txt
+    # Expose the port Streamlit runs on
+    EXPOSE 8501
+    # Command to run the Streamlit app
+    CMD ["streamlit", "run", "streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]

Evaluation/LR3 datasets/Test Real estate.csv

-Original file line number
+Diff line change
@@ -0,0 +1,43 @@
+    No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
+,2013.167,1.1,193.5845,6,24.96571,121.54089,45.1
+,2013.0,13.2,492.2313,5,24.96515,121.53737,42.3
+,2013.083,0.0,274.0144,1,24.9748,121.53059,52.2
+,2012.917,12.7,170.1289,1,24.97371,121.52984,37.3
+,2012.667,20.2,2185.128,3,24.96322,121.51237,22.8
+,2013.583,32.5,424.5442,8,24.97587,121.53913,36.3
+,2012.917,15.9,289.3248,5,24.98203,121.54348,53.0
+,2013.25,16.2,289.3248,5,24.98203,121.54348,51.4
+,2012.917,31.9,1146.329,0,24.9492,121.53076,16.1
+,2013.583,6.6,90.45606,9,24.97433,121.5431,59.0
+,2013.5,25.3,1583.722,3,24.96622,121.51709,30.6
+,2013.5,4.0,2147.376,3,24.96299,121.51284,30.7
+,2012.833,5.1,1867.233,2,24.98407,121.51748,35.6
+,2012.833,31.7,1160.632,0,24.94968,121.53009,13.7
+,2013.083,38.6,804.6897,4,24.97838,121.53477,62.9
+,2013.417,14.7,1717.193,2,24.96447,121.51649,30.5
+,2013.417,33.6,371.2495,8,24.97254,121.54059,41.9
+,2012.833,3.4,56.47425,7,24.95744,121.53711,54.4
+,2012.917,17.3,2261.432,4,24.96182,121.51222,29.5
+,2013.083,15.1,383.2805,7,24.96735,121.54464,43.7
+,2013.0,30.9,6396.283,1,24.94375,121.47883,12.2
+,2013.417,17.9,1783.18,3,24.96731,121.51486,22.1
+,2013.25,5.4,390.5684,5,24.97937,121.54245,49.5
+,2013.5,13.6,319.0708,6,24.96495,121.54277,47.4
+,2012.75,13.5,4197.349,0,24.93885,121.50383,18.6
+,2012.833,12.7,187.4823,1,24.97388,121.52981,28.5
+,2012.833,16.2,4074.736,0,24.94235,121.50357,14.7
+,2012.833,0.0,274.0144,1,24.9748,121.53059,45.4
+,2012.917,40.9,167.5989,5,24.9663,121.54026,41.0
+,2013.083,41.3,401.8807,4,24.98326,121.5446,35.1
+,2013.5,25.9,4519.69,0,24.94826,121.49587,22.1
+,2013.333,5.1,1559.827,3,24.97213,121.51627,28.9
+,2012.667,32.7,392.4459,6,24.96398,121.5425,30.5
+,2013.0,18.0,1414.837,1,24.95182,121.54887,26.5
+,2013.25,16.5,323.655,6,24.97841,121.54281,49.3
+,2013.083,37.7,490.3446,0,24.97217,121.53471,37.0
+,2012.917,5.9,90.45606,9,24.97433,121.5431,56.3
+,2013.0,13.7,4082.015,0,24.94155,121.50381,15.4
+,2013.583,30.6,431.1114,10,24.98123,121.53743,48.5
+,2013.083,41.4,281.205,8,24.97345,121.54093,63.3
+,2013.583,35.7,579.2083,2,24.9824,121.54619,50.5
+,2013.167,21.3,537.7971,4,24.97425,121.53814,42.2

Evaluation/LR3 datasets/Train Real estate.csv

Large diffs are not rendered by default.

Evaluation/README.md

-Original file line number
+Diff line change
@@ -0,0 +1,23 @@
+    # Data Science Evaluation Application
+    This repository contains a Streamlit-based web application developed as part of a data science evaluation task. The application includes two primary tasks:
+. **Real Estate Price Prediction**: A linear regression model is used to predict real estate prices based on features such as transaction date, house age, distance to MRT stations, number of convenience stores, latitude, and longitude.
+. **Time Series Analysis of Household Power Consumption**: A time series model is used to analyze and forecast household power consumption data.
+    ## Summary
+    The application is designed to allow users to upload their datasets, perform exploratory data analysis (EDA) with interactive visualizations, build and evaluate models, and compare the models' predictions with actual values.
+    ### Key Features:
+    - **Upload Dataset**: Users can upload their datasets, which are then processed by the application for analysis.
+    - **EDA and Visualization**: The application provides interactive visualizations using Altair to help users gain insights from their data.
+    - **Model Building and Evaluation**: For the real estate task, a linear regression model is built and evaluated. For the time series task, a suitable time series model is selected and evaluated.
+    ### Installation and Usage
+. **Clone the Repository**:
+       ```bash
+       git clone https://github.com/your-username/Data-Science-Evaluation.git
+       cd Data-Science-Evaluation

Evaluation/TS1 datasets/TS1 (1).zip

Binary file not shown.

Evaluation/lr3.py

-Original file line number
+Diff line change
@@ -0,0 +1,297 @@
+    import warnings
+    warnings.filterwarnings("ignore")
+    import streamlit as st
+    import pandas as pd
+    import numpy as np
+    import matplotlib.pyplot as plt
+    import seaborn as sns
+    import altair as alt
+    from sklearn.model_selection import train_test_split, GridSearchCV
+    from sklearn.preprocessing import StandardScaler, PolynomialFeatures
+    from sklearn.linear_model import LinearRegression, Ridge, Lasso
+    from sklearn.metrics import mean_squared_error, r2_score
+    from sklearn.pipeline import Pipeline
+    import joblib
+    def main():
+        # Title and Description
+        st.title("Linear Regression for Real Estate Price Prediction")
+        st.write("""
+        Welcome to the Real Estate Price Prediction application. This tool is designed to help you explore real estate data, perform in-depth data analysis, and build predictive models to forecast house prices. In this evaluation task, we will walk through the process of data cleaning, exploratory data analysis (EDA), model building, and making predictions with detailed explanations and insights.
+        """)
+        # Data Upload Section
+        st.header("1. Upload Your Dataset")
+        st.write("""
+        In this section, you'll upload your training and test datasets. The application will automatically clean and prepare the data for analysis.
+        """)
+        train_file = st.file_uploader("Upload the training dataset (CSV)", type=["csv"])
+        test_file = st.file_uploader("Upload the test dataset (CSV)", type=["csv"])
+        if train_file is not None and test_file is not None:
+            # Read the datasets
+            train_data = pd.read_csv(train_file)
+            test_data = pd.read_csv(test_file)
+            # Display the first few rows of the data
+            st.subheader("Training Data Overview")
+            st.write("""
+            Below is a preview of the training data that will be used to build the model. We will analyze the features, clean the data, and prepare it for modeling.
+            """)
+            st.write(train_data.head())
+            st.subheader("Test Data Overview")
+            st.write("""
+            Below is a preview of the test data that will be used to evaluate the model's performance. We will ensure this data is consistent with the training data.
+            """)
+            st.write(test_data.head())
+            # Display the column names
+            st.subheader("Column Names in Training Data")
+            st.write(train_data.columns)
+            st.subheader("Column Names in Test Data")
+            st.write(test_data.columns)
+            # Identify numeric and non-numeric columns
+            numeric_columns = train_data.select_dtypes(include=[np.number]).columns.tolist()
+            non_numeric_columns = train_data.select_dtypes(exclude=[np.number]).columns.tolist()
+            st.subheader("Numeric Columns in Training Data")
+            st.write(numeric_columns)
+            st.subheader("Non-Numeric Columns in Training Data")
+            st.write(non_numeric_columns)
+            # Data Cleaning and Feature Engineering
+            st.header("2. Data Cleaning and Feature Engineering")
+            st.write("""
+            In this section, we undertake several crucial steps to prepare the data for effective modeling. Proper data cleaning and feature engineering are fundamental to building a robust predictive model. Heres what well do:
+            - **Handle Missing Values**: Missing data can introduce bias or inaccuracies in the model. We fill missing values with the mean of the respective columns to maintain the integrity of the dataset.
+            - **Drop Non-Numeric Columns**: Non-numeric columns are excluded from the analysis at this stage to focus on features that directly contribute to the numerical prediction of house prices. This simplification ensures that the model can be trained efficiently.
+            - **Scale Features**: Scaling the numeric features standardizes the range of independent variables or features of data. This step is essential for algorithms that calculate distances between data points, such as in regression models. It ensures that all features contribute equally to the model.
+            """)
+            # Handle non-numeric columns (For now, we'll drop them)
+            if non_numeric_columns:
+                st.write(f"**Dropped Non-Numeric Columns:** {non_numeric_columns}")
+                train_data = train_data.drop(columns=non_numeric_columns)
+                test_data = test_data.drop(columns=non_numeric_columns)
+            # Handle missing values
+            train_data.fillna(train_data.mean(), inplace=True)
+            test_data.fillna(test_data.mean(), inplace=True)
+            # Feature Scaling
+            scaler = StandardScaler()
+            train_data[numeric_columns] = scaler.fit_transform(train_data[numeric_columns])
+            test_data[numeric_columns] = scaler.transform(test_data[numeric_columns])
+            st.write("""
+            **Data Cleaning and Feature Engineering completed.** The following steps have been successfully applied:
+            - Missing values have been handled to ensure no gaps in the data.
+            - Non-numeric columns have been dropped, allowing us to focus on the numerical aspects of the dataset.
+            - All features have been scaled, ensuring they are on a common scale, which is crucial for the accuracy and performance of our regression model.
+            The data is now pre-processed and ready for the next stage: Exploratory Data Analysis (EDA). This preparation sets a solid foundation for building a reliable and accurate predictive model.
+            """)
+            # Exploratory Data Analysis (EDA)
+            st.header("3. Exploratory Data Analysis (EDA)")
+            st.write("""
+            In this section, we explore the relationships and distributions within the dataset. Understanding these patterns helps in making informed decisions during model building.
+            """)
+            # Interactive Correlation Heatmap
+            st.subheader("Correlation Heatmap")
+            st.write("""
+            The correlation heatmap below shows the relationships between numeric features. A high correlation (close to 1 or -1) between features can indicate multicollinearity, which we need to address in the modeling stage.
+            """)
+            corr_matrix = pd.DataFrame(train_data, columns=numeric_columns).corr().stack().reset_index()
+            corr_matrix.columns = ['Feature 1', 'Feature 2', 'Correlation']
+            heatmap = alt.Chart(corr_matrix).mark_rect().encode(
+                x='Feature 1:O',
+                y='Feature 2:O',
+                color=alt.Color('Correlation:Q', scale=alt.Scale(scheme='blueorange')),
+                tooltip=['Feature 1', 'Feature 2', 'Correlation']
+            ).properties(
+                width=600,
+                height=600
+            )
+            st.altair_chart(heatmap, use_container_width=True)
+            # Interactive Scatter Plots
+            st.subheader("Pairwise Scatter Plots")
+            st.write("""
+            These scatter plots illustrate the relationships between each feature and the target variable, 'Y house price of unit area'. Analyzing these plots helps us understand which features are most influential in predicting house prices.
+            """)
+            for feature in numeric_columns:
+                scatter_plot = alt.Chart(train_data).mark_circle(size=60).encode(
+                    x=alt.X(feature, scale=alt.Scale(zero=False)),
+                    y=alt.Y('Y house price of unit area', scale=alt.Scale(zero=False)),
+                    tooltip=[feature, 'Y house price of unit area']
+                ).interactive().properties(
+                    title=f'Scatter plot of {feature} vs Y house price of unit area',
+                    width=600,
+                    height=400
+                )
+                st.altair_chart(scatter_plot, use_container_width=True)
+            # Interactive Histogram
+            st.subheader("Distribution of Target Variable")
+            st.write("""
+            The histogram below shows the distribution of the target variable, 'Y house price of unit area'. This analysis helps us understand the range and skewness of house prices in the dataset.
+            """)
+            hist = alt.Chart(train_data).mark_bar().encode(
+                alt.X('Y house price of unit area:Q', bin=True),
+                y='count()',
+                tooltip=['count()']
+            ).properties(
+                title='Distribution of Y house price of unit area',
+                width=600,
+                height=400
+            ).interactive()
+            st.altair_chart(hist, use_container_width=True)
+            # Interactive Box Plots
+            st.subheader("Box Plots of Numeric Features")
+            st.write("""
+            The box plots below help in identifying the spread and outliers in the data. Outliers can sometimes distort model predictions and might need to be treated separately.
+            """)
+            for feature in numeric_columns:
+                box_plot = alt.Chart(train_data).mark_boxplot().encode(
+                    x=alt.X('Y house price of unit area:Q'),
+                    y=alt.Y(feature + ':Q'),
+                    tooltip=[feature, 'Y house price of unit area']
+                ).properties(
+                    title=f'Box plot of Y house price of unit area by {feature}',
+                    width=600,
+                    height=400
+                )
+                st.altair_chart(box_plot, use_container_width=True)
+            # Model Building
+            st.header("4. Model Building")
+            st.write("""
+            In this section, we build and evaluate different linear models: standard Linear Regression, Lasso Regression, and Ridge Regression. These models are chosen to explore how regularization techniques (Lasso and Ridge) affect the model's performance.
+            """)
+            X_train = train_data.drop(columns=['Y house price of unit area'])
+            y_train = train_data['Y house price of unit area']
+            X_test = test_data.drop(columns=['Y house price of unit area'])
+            y_test = test_data['Y house price of unit area']
+            X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
+            # Create Polynomial Features
+            poly = PolynomialFeatures(degree=2, include_bias=False)
+            # Models to Evaluate
+            models = {
+                "Linear Regression": LinearRegression(),
+                "Ridge Regression": Ridge(),
+                "Lasso Regression": Lasso()
+            }
+            # Evaluate each model
+            results = {}
+            for name, model in models.items():
+                pipeline = Pipeline([
+                    ('poly_features', poly),
+                    ('regression', model)
+                ])
+                pipeline.fit(X_train_split, y_train_split)
+                y_val_pred = pipeline.predict(X_val)
+                mse = mean_squared_error(y_val, y_val_pred)
+                rmse = np.sqrt(mse)
+                r2 = r2_score(y_val, y_val_pred)
+                results[name] = {
+                    "RMSE": rmse,
+                    "R-squared": r2
+                }
+            # Display the results
+            st.subheader("Model Evaluation Results")
+            for model_name, metrics in results.items():
+                st.write(f"**{model_name}:**")
+                st.write(f"- Validation RMSE: {metrics['RMSE']:.4f}")
+                st.write(f"- Validation R-squared: {metrics['R-squared']:.4f}")
+            # Choose the best model based on RMSE
+            best_model_name = min(results, key=lambda k: results[k]["RMSE"])
+            best_model = models[best_model_name]
+            pipeline = Pipeline([
+                ('poly_features', poly),
+                ('regression', best_model)
+            ])
+            pipeline.fit(X_train, y_train)
+            # Save the best model
+            joblib.dump(pipeline, 'trained_linear_model.pkl')
+            # Prediction on the test set
+            y_test_pred = pipeline.predict(X_test)
+            test_mse = mean_squared_error(y_test, y_test_pred)
+            test_rmse = np.sqrt(test_mse)
+            test_r2 = r2_score(y_test, y_test_pred)
+            st.write(f"**Test Results for {best_model_name}:**")
+            st.write(f"- Test RMSE: {test_rmse:.4f}")
+            st.write(f"- Test R-squared: {test_r2:.4f}")
+            # Actual vs Predicted
+            st.header("5. Actual vs Predicted")
+            st.write("""
+            The following chart compares the actual house prices to the predicted prices on the test set. This comparison allows us to visually assess the accuracy of our model. A perfect model would have all points lying on the 45-degree line, indicating that the predicted values match the actual values exactly.
+            """)
+            actual_vs_predicted = pd.DataFrame({
+                'Actual': y_test,
+                'Predicted': y_test_pred
+            })
+            scatter_actual_vs_predicted = alt.Chart(actual_vs_predicted).mark_circle(size=60).encode(
+                x=alt.X('Actual', scale=alt.Scale(zero=False)),
+                y=alt.Y('Predicted', scale=alt.Scale(zero=False)),
+                tooltip=['Actual', 'Predicted']
+            ).interactive().properties(
+                title='Actual vs Predicted House Prices',
+                width=600,
+                height=400
+            )
+            st.altair_chart(scatter_actual_vs_predicted, use_container_width=True)
+            st.write("""
+            The scatter plot above shows the relationship between actual and predicted house prices. The closer the points are to the diagonal line, the better the model's predictions. Deviations from this line indicate discrepancies between the actual and predicted values, which can be further analyzed to improve model performance.
+            """)
+            # Prediction Section
+            st.header("6. Predict House Price")
+            st.write("""
+            Use the inputs below to predict the house price per unit area based on the trained model. This feature allows you to experiment with different inputs and see how the model responds.
+            """)
+            house_age = st.number_input("House Age", min_value=0, max_value=100)
+            distance_to_mrt = st.number_input("Distance to MRT Station", min_value=0)
+            convenience_stores = st.number_input("Number of Convenience Stores", min_value=0)
+            latitude = st.number_input("Latitude")
+            longitude = st.number_input("Longitude")
+            if st.button("Predict House Price"):
+                features = poly.transform(scaler.transform([[house_age, distance_to_mrt, convenience_stores, latitude, longitude]]))
+                prediction = pipeline.predict(features)
+                st.write(f"**Predicted House Price per Unit Area:** {prediction[0]:.2f}")
+            st.write("""
+            This section allows you to predict house prices using the model trained earlier. By inputting the relevant features (house age, distance to the nearest MRT station, number of convenience stores nearby, latitude, and longitude), the model will estimate the price per unit area of the house.
+            """)
+    if __name__ == "__main__":
+        main()

Evaluation/requirements.txt

-Original file line number
+Diff line change
@@ -0,0 +1,8 @@
+    streamlit
+    pandas
+    numpy
+    matplotlib
+    seaborn
+    altair
+    scikit-learn
+    joblib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Task #8

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

Evaluation Task #8

Are you sure you want to change the base?

Uh oh!

Evaluation Task #8

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!