Merge pull request #418 from SimranShaikh20/data_analysis

Data analysis added eda enhancement to streamlit app.py file
recodehive · Oct 27, 2024 · 7ed6137 · 7ed6137
2 parents 0d70e89 + e40388f
commit 7ed6137
Show file tree

Hide file tree

Showing 3 changed files with 786 additions and 3 deletions.
diff --git a/OpenSourceEda.ipynb b/OpenSourceEda.ipynb
diff --git a/opensource_analysis/README b/opensource_analysis/README
@@ -18,3 +18,82 @@ streamlit run app.py
 
 ## Access the App
 Open the URL http://localhost:8501 in your web browser to access the Streamlit app
+
+
+# Survey Data EDA and Machine Learning App
+
+This repository contains an application built using **Streamlit** to explore and analyze survey data from developers. The app performs **Exploratory Data Analysis (EDA)** and includes visualizations of key data features. Additionally, it can be extended to support **machine learning** tasks like prediction.
+
+## Table of Contents
+- [Overview](#overview)
+- [Installation](#installation)
+- [Dataset](#dataset)
+- [Features](#features)
+  - [1. Data Loading](#1-data-loading)
+  - [2. Basic Information](#2-basic-information)
+  - [3. Categorical Value Counts](#3-categorical-value-counts)
+  - [4. Visualizations](#4-visualizations)
+  - [5. Correlation Heatmap](#5-correlation-heatmap)
+  - [6. Cumulative Distribution](#6-cumulative-distribution)
+- [Usage](#usage)
+- [Future Improvements](#future-improvements)
+- [License](#license)
+
+## Overview
+This project provides an **interactive web-based application** that allows users to explore a dataset of developer survey results. The app is built using **Streamlit** and includes several exploratory data analysis features, such as visualizing distributions of different variables (e.g., salary, job satisfaction, age). It also displays the relationship between various factors, such as job satisfaction and company size, and can be extended to machine learning tasks.
+
+## Dataset
+The dataset used in this project is a sample from the **2018 Developer Survey Results**. It contains various columns such as:
+- `Country`: Respondent's country.
+- `Employment`: Employment status of the respondent.
+- `ConvertedSalary`: Salary converted into USD.
+- `DevType`: Developer types (e.g., web developer, data scientist).
+- `LanguageWorkedWith`: Programming languages the respondent has worked with.
+- `CompanySize`: Size of the company the respondent works for.
+- `JobSatisfaction`: Job satisfaction rating on a scale.
+- `CareerSatisfaction`: Career satisfaction rating.
+
+## Features
+
+### 1. Data Loading
+The application loads the dataset (CSV file) and fills in missing values where necessary. If the file is not found, an error message will be displayed on the app.
+
+### 2. Basic Information
+Displays essential information about the dataset, including:
+- General structure of the data (`df.info()`).
+- Descriptive statistics (`df.describe()`).
+
+### 3. Categorical Value Counts
+For the categorical columns (`Country`, `Employment`, `DevType`, `LanguageWorkedWith`), the app shows the distribution of values using value counts and percentages.
+
+### 4. Visualizations
+The app provides the following visualizations to explore the data:
+- **Salary Distribution**: A histogram with kernel density estimation (KDE) to visualize salary distribution.
+- **Job Satisfaction Analysis**: Bar charts for `JobSatisfaction` and `CareerSatisfaction`.
+- **Programming Languages**: The top 10 most-used programming languages among respondents.
+- **Job Satisfaction by Company Size**: A box plot showing the relationship between company size and job satisfaction.
+- **Age Distribution**: A histogram with KDE to show the age distribution of respondents.
+- **Country Distribution**: A line plot showing the top 10 countries by the number of respondents.
+- **Employment Status**: A pie chart showing the employment status distribution.
+- **Database Usage**: A bar chart of the top 10 databases used by respondents.
+- **Job Satisfaction by Gender**: A bar chart comparing job satisfaction across genders.
+
+### 5. Correlation Heatmap
+Displays a heatmap showing the correlation between numerical variables in the dataset.
+
+### 6. Cumulative Distribution
+Provides an **Empirical Cumulative Distribution Function (ECDF)** plot for the first numerical column in the dataset.
+
+## Usage
+After launching the app:
+1. The app loads the dataset and displays key information and visualizations on the home page.
+2. Navigate through the sections to explore different parts of the dataset interactively.
+3. The app is designed to be modular, allowing for future extensions, such as adding machine learning models for prediction tasks.
+
+## Future Improvements
+- Implement a **machine learning model** to predict job satisfaction or salary based on features like `Country`, `Employment`, `DevType`, etc.
+- Enhance the **EDA** with more detailed visualizations and insights.
+- Allow users to **upload their own dataset** for customized analysis.
+
+## License
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.
diff --git a/opensource_analysis/app.py b/opensource_analysis/app.py
@@ -64,8 +64,8 @@
 
         # Evaluate the model
         y_pred = model.predict(X_test)
-        classification_rep = classification_report(y_test, y_pred)
-        roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
+        classification_rep = classification_report(y_test, y_pred, zero_division=1)
+        roc_auc = roc_auc_score(pd.get_dummies(y_test).values[:, 1], model.predict_proba(X_test)[:, 1])
 
         # Get feature importance
         importances = model.named_steps['classifier'].feature_importances_
@@ -94,7 +94,7 @@
 
         # Plot ROC Curve
         st.header('ROC Curve')
-        y_test_binary = y_test.map({'No': 0, 'Yes': 1})
+        y_test_binary = pd.get_dummies(y_test).values[:, 1]  # Convert to binary
         fpr, tpr, _ = roc_curve(y_test_binary, model.predict_proba(X_test)[:, 1])
         roc_auc = auc(fpr, tpr)
         fig, ax = plt.subplots()
@@ -151,5 +151,113 @@
             except Exception as e:
                 st.error(f"An error occurred during prediction: {e}")
 
+        # ================== EDA Enhancements ==================
+        st.header('Enhanced Exploratory Data Analysis (EDA)')
+
+        # Load full dataset for EDA
+        eda_data = pd.read_csv(file_path)
+
+        # Salary Analysis
+        st.subheader('Salary Distribution')
+        eda_data['ConvertedSalary'] = pd.to_numeric(eda_data['ConvertedSalary'], errors='coerce')
+        fig, ax = plt.subplots()
+        sns.histplot(eda_data['ConvertedSalary'].dropna(), kde=True, ax=ax)
+        ax.set_title('Distribution of Salaries')
+        ax.set_xlabel('Salary (USD)')
+        st.pyplot(fig)
+
+        # Job Satisfaction Analysis
+        satisfaction_cols = ['JobSatisfaction', 'CareerSatisfaction']
+        for col in satisfaction_cols:
+            st.subheader(f'Distribution of {col}')
+            fig, ax = plt.subplots()
+            eda_data[col].value_counts().plot(kind='bar', ax=ax)
+            ax.set_title(f'Distribution of {col}')
+            ax.set_xlabel('Satisfaction Level')
+            ax.set_ylabel('Count')
+            st.pyplot(fig)
+
+        # Programming Languages Analysis
+        st.subheader('Top 10 Programming Languages')
+        languages = eda_data['LanguageWorkedWith'].str.split(';', expand=True).stack()
+        fig, ax = plt.subplots()
+        languages.value_counts().head(10).plot(kind='bar', ax=ax)
+        ax.set_title('Top 10 Programming Languages')
+        ax.set_xlabel('Language')
+        ax.set_ylabel('Count')
+        st.pyplot(fig)
+
+        # Job Satisfaction by Company Size
+        st.subheader('Job Satisfaction by Company Size')
+        fig, ax = plt.subplots()
+        sns.boxplot(x='CompanySize', y='JobSatisfaction', data=eda_data, ax=ax)
+        ax.set_title('Job Satisfaction by Company Size')
+        ax.set_xlabel('Company Size')
+        ax.set_ylabel('Job Satisfaction')
+        st.pyplot(fig)
+
+        # Age Distribution
+        st.subheader('Age Distribution of Respondents')
+        fig, ax = plt.subplots()
+        sns.histplot(eda_data['Age'], kde=True, ax=ax)
+        ax.set_title('Age Distribution of Respondents')
+        ax.set_xlabel('Age')
+        st.pyplot(fig)
+
+        # Top 10 Countries of Respondents
+        st.subheader('Top 10 Countries of Respondents')
+        country_counts = eda_data['Country'].value_counts().head(10)
+        fig, ax = plt.subplots()
+        ax.plot(country_counts.index, country_counts.values, marker='o')
+        ax.set_title('Top 10 Countries of Respondents')
+        ax.set_xlabel('Country')
+        ax.set_ylabel('Number of Respondents')
+        st.pyplot(fig)
+
+        # Employment Status Distribution
+        st.header("Employment Status Distribution")
+        employment_counts = eda_data['Employment'].value_counts()
+        fig, ax = plt.subplots()
+        ax.pie(employment_counts.values, labels=employment_counts.index, autopct='%1.1f%%')
+        ax.set_title('Employment Status Distribution')
+        ax.axis('equal')
+        st.pyplot(fig)
+
+        # Databases Used
+        st.header("Top 10 Databases Used")
+        databases = eda_data['DatabaseWorkedWith'].str.split(';', expand=True).stack()
+        db_counts = databases.value_counts().head(10)
+        fig, ax = plt.subplots()
+        db_counts.plot(kind='barh', ax=ax)
+        ax.set_xlabel('Number of Users')
+        ax.set_ylabel('Database')
+        st.pyplot(fig)
+
+        # Job Satisfaction by Gender
+        st.header("Job Satisfaction by Gender")
+        job_sat_gender = pd.crosstab(eda_data['JobSatisfaction'], eda_data['Gender'])
+        fig, ax = plt.subplots()
+        job_sat_gender.plot(kind='bar', ax=ax)
+        ax.set_title('Job Satisfaction by Gender')
+        ax.set_xlabel('Job Satisfaction Level')
+        st.pyplot(fig)
+
+        # Correlation Heatmap
+        st.header("Correlation Heatmap of Numeric Variables")
+        numeric_columns = eda_data.select_dtypes(include=['int64', 'float64']).columns
+        fig, ax = plt.subplots()
+        sns.heatmap(eda_data[numeric_columns].corr(), annot=True, cmap='coolwarm', ax=ax)
+        ax.set_title('Correlation Heatmap of Numeric Variables')
+        st.pyplot(fig)
+
+        # Cumulative Distribution
+        st.header(f"Cumulative Distribution of {numeric_columns[0]}")
+        fig, ax = plt.subplots()
+        sns.ecdfplot(data=eda_data, x=numeric_columns[0], ax=ax)
+        ax.set_title(f'Cumulative Distribution of {numeric_columns[0]}')
+        ax.set_xlabel(numeric_columns[0])
+        ax.set_ylabel('Cumulative Proportion')
+        st.pyplot(fig)
+
     except Exception as e:
         st.error(f"An error occurred while loading data: {e}")