Skip to content

Commit

Permalink
Merge pull request #418 from SimranShaikh20/data_analysis
Browse files Browse the repository at this point in the history
Data analysis added eda enhancement to streamlit app.py file
  • Loading branch information
sanjay-kv authored Oct 27, 2024
2 parents 0d70e89 + e40388f commit 7ed6137
Show file tree
Hide file tree
Showing 3 changed files with 786 additions and 3 deletions.
596 changes: 596 additions & 0 deletions OpenSourceEda.ipynb

Large diffs are not rendered by default.

79 changes: 79 additions & 0 deletions opensource_analysis/README
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,82 @@ streamlit run app.py

## Access the App
Open the URL http://localhost:8501 in your web browser to access the Streamlit app


# Survey Data EDA and Machine Learning App

This repository contains an application built using **Streamlit** to explore and analyze survey data from developers. The app performs **Exploratory Data Analysis (EDA)** and includes visualizations of key data features. Additionally, it can be extended to support **machine learning** tasks like prediction.

## Table of Contents
- [Overview](#overview)
- [Installation](#installation)
- [Dataset](#dataset)
- [Features](#features)
- [1. Data Loading](#1-data-loading)
- [2. Basic Information](#2-basic-information)
- [3. Categorical Value Counts](#3-categorical-value-counts)
- [4. Visualizations](#4-visualizations)
- [5. Correlation Heatmap](#5-correlation-heatmap)
- [6. Cumulative Distribution](#6-cumulative-distribution)
- [Usage](#usage)
- [Future Improvements](#future-improvements)
- [License](#license)

## Overview
This project provides an **interactive web-based application** that allows users to explore a dataset of developer survey results. The app is built using **Streamlit** and includes several exploratory data analysis features, such as visualizing distributions of different variables (e.g., salary, job satisfaction, age). It also displays the relationship between various factors, such as job satisfaction and company size, and can be extended to machine learning tasks.

## Dataset
The dataset used in this project is a sample from the **2018 Developer Survey Results**. It contains various columns such as:
- `Country`: Respondent's country.
- `Employment`: Employment status of the respondent.
- `ConvertedSalary`: Salary converted into USD.
- `DevType`: Developer types (e.g., web developer, data scientist).
- `LanguageWorkedWith`: Programming languages the respondent has worked with.
- `CompanySize`: Size of the company the respondent works for.
- `JobSatisfaction`: Job satisfaction rating on a scale.
- `CareerSatisfaction`: Career satisfaction rating.

## Features

### 1. Data Loading
The application loads the dataset (CSV file) and fills in missing values where necessary. If the file is not found, an error message will be displayed on the app.

### 2. Basic Information
Displays essential information about the dataset, including:
- General structure of the data (`df.info()`).
- Descriptive statistics (`df.describe()`).

### 3. Categorical Value Counts
For the categorical columns (`Country`, `Employment`, `DevType`, `LanguageWorkedWith`), the app shows the distribution of values using value counts and percentages.

### 4. Visualizations
The app provides the following visualizations to explore the data:
- **Salary Distribution**: A histogram with kernel density estimation (KDE) to visualize salary distribution.
- **Job Satisfaction Analysis**: Bar charts for `JobSatisfaction` and `CareerSatisfaction`.
- **Programming Languages**: The top 10 most-used programming languages among respondents.
- **Job Satisfaction by Company Size**: A box plot showing the relationship between company size and job satisfaction.
- **Age Distribution**: A histogram with KDE to show the age distribution of respondents.
- **Country Distribution**: A line plot showing the top 10 countries by the number of respondents.
- **Employment Status**: A pie chart showing the employment status distribution.
- **Database Usage**: A bar chart of the top 10 databases used by respondents.
- **Job Satisfaction by Gender**: A bar chart comparing job satisfaction across genders.

### 5. Correlation Heatmap
Displays a heatmap showing the correlation between numerical variables in the dataset.

### 6. Cumulative Distribution
Provides an **Empirical Cumulative Distribution Function (ECDF)** plot for the first numerical column in the dataset.

## Usage
After launching the app:
1. The app loads the dataset and displays key information and visualizations on the home page.
2. Navigate through the sections to explore different parts of the dataset interactively.
3. The app is designed to be modular, allowing for future extensions, such as adding machine learning models for prediction tasks.

## Future Improvements
- Implement a **machine learning model** to predict job satisfaction or salary based on features like `Country`, `Employment`, `DevType`, etc.
- Enhance the **EDA** with more detailed visualizations and insights.
- Allow users to **upload their own dataset** for customized analysis.

## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.
114 changes: 111 additions & 3 deletions opensource_analysis/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,8 @@

# Evaluate the model
y_pred = model.predict(X_test)
classification_rep = classification_report(y_test, y_pred)
roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
classification_rep = classification_report(y_test, y_pred, zero_division=1)
roc_auc = roc_auc_score(pd.get_dummies(y_test).values[:, 1], model.predict_proba(X_test)[:, 1])

# Get feature importance
importances = model.named_steps['classifier'].feature_importances_
Expand Down Expand Up @@ -94,7 +94,7 @@

# Plot ROC Curve
st.header('ROC Curve')
y_test_binary = y_test.map({'No': 0, 'Yes': 1})
y_test_binary = pd.get_dummies(y_test).values[:, 1] # Convert to binary
fpr, tpr, _ = roc_curve(y_test_binary, model.predict_proba(X_test)[:, 1])
roc_auc = auc(fpr, tpr)
fig, ax = plt.subplots()
Expand Down Expand Up @@ -151,5 +151,113 @@
except Exception as e:
st.error(f"An error occurred during prediction: {e}")

# ================== EDA Enhancements ==================
st.header('Enhanced Exploratory Data Analysis (EDA)')

# Load full dataset for EDA
eda_data = pd.read_csv(file_path)

# Salary Analysis
st.subheader('Salary Distribution')
eda_data['ConvertedSalary'] = pd.to_numeric(eda_data['ConvertedSalary'], errors='coerce')
fig, ax = plt.subplots()
sns.histplot(eda_data['ConvertedSalary'].dropna(), kde=True, ax=ax)
ax.set_title('Distribution of Salaries')
ax.set_xlabel('Salary (USD)')
st.pyplot(fig)

# Job Satisfaction Analysis
satisfaction_cols = ['JobSatisfaction', 'CareerSatisfaction']
for col in satisfaction_cols:
st.subheader(f'Distribution of {col}')
fig, ax = plt.subplots()
eda_data[col].value_counts().plot(kind='bar', ax=ax)
ax.set_title(f'Distribution of {col}')
ax.set_xlabel('Satisfaction Level')
ax.set_ylabel('Count')
st.pyplot(fig)

# Programming Languages Analysis
st.subheader('Top 10 Programming Languages')
languages = eda_data['LanguageWorkedWith'].str.split(';', expand=True).stack()
fig, ax = plt.subplots()
languages.value_counts().head(10).plot(kind='bar', ax=ax)
ax.set_title('Top 10 Programming Languages')
ax.set_xlabel('Language')
ax.set_ylabel('Count')
st.pyplot(fig)

# Job Satisfaction by Company Size
st.subheader('Job Satisfaction by Company Size')
fig, ax = plt.subplots()
sns.boxplot(x='CompanySize', y='JobSatisfaction', data=eda_data, ax=ax)
ax.set_title('Job Satisfaction by Company Size')
ax.set_xlabel('Company Size')
ax.set_ylabel('Job Satisfaction')
st.pyplot(fig)

# Age Distribution
st.subheader('Age Distribution of Respondents')
fig, ax = plt.subplots()
sns.histplot(eda_data['Age'], kde=True, ax=ax)
ax.set_title('Age Distribution of Respondents')
ax.set_xlabel('Age')
st.pyplot(fig)

# Top 10 Countries of Respondents
st.subheader('Top 10 Countries of Respondents')
country_counts = eda_data['Country'].value_counts().head(10)
fig, ax = plt.subplots()
ax.plot(country_counts.index, country_counts.values, marker='o')
ax.set_title('Top 10 Countries of Respondents')
ax.set_xlabel('Country')
ax.set_ylabel('Number of Respondents')
st.pyplot(fig)

# Employment Status Distribution
st.header("Employment Status Distribution")
employment_counts = eda_data['Employment'].value_counts()
fig, ax = plt.subplots()
ax.pie(employment_counts.values, labels=employment_counts.index, autopct='%1.1f%%')
ax.set_title('Employment Status Distribution')
ax.axis('equal')
st.pyplot(fig)

# Databases Used
st.header("Top 10 Databases Used")
databases = eda_data['DatabaseWorkedWith'].str.split(';', expand=True).stack()
db_counts = databases.value_counts().head(10)
fig, ax = plt.subplots()
db_counts.plot(kind='barh', ax=ax)
ax.set_xlabel('Number of Users')
ax.set_ylabel('Database')
st.pyplot(fig)

# Job Satisfaction by Gender
st.header("Job Satisfaction by Gender")
job_sat_gender = pd.crosstab(eda_data['JobSatisfaction'], eda_data['Gender'])
fig, ax = plt.subplots()
job_sat_gender.plot(kind='bar', ax=ax)
ax.set_title('Job Satisfaction by Gender')
ax.set_xlabel('Job Satisfaction Level')
st.pyplot(fig)

# Correlation Heatmap
st.header("Correlation Heatmap of Numeric Variables")
numeric_columns = eda_data.select_dtypes(include=['int64', 'float64']).columns
fig, ax = plt.subplots()
sns.heatmap(eda_data[numeric_columns].corr(), annot=True, cmap='coolwarm', ax=ax)
ax.set_title('Correlation Heatmap of Numeric Variables')
st.pyplot(fig)

# Cumulative Distribution
st.header(f"Cumulative Distribution of {numeric_columns[0]}")
fig, ax = plt.subplots()
sns.ecdfplot(data=eda_data, x=numeric_columns[0], ax=ax)
ax.set_title(f'Cumulative Distribution of {numeric_columns[0]}')
ax.set_xlabel(numeric_columns[0])
ax.set_ylabel('Cumulative Proportion')
st.pyplot(fig)

except Exception as e:
st.error(f"An error occurred while loading data: {e}")

0 comments on commit 7ed6137

Please sign in to comment.