This repository contains Jupyter notebooks demonstrating various feature engineering and exploratory data analysis (EDA) techniques commonly used in machine learning projects.
Feature engineering is a crucial step in the machine learning pipeline that involves transforming raw data into features that better represent the underlying problem to predictive models, resulting in improved model accuracy on unseen data.
- Python 3.7 or higher
- pip (Python package installer)
- Jupyter Notebook or JupyterLab
- Clone the repository
git clone https://github.com/Abhinavexists/Feature_Engineering.git
- Navigate to the project directory
cd Feature_Engineering
- Create a virtual environment (recommended)
# Windows
python -m venv venv
venv\Scripts\activate
# Linux/Mac
python3 -m venv venv
source venv/bin/activate
- Install required packages
pip install -r requirements.txt
Feature_Engineering/
├── Exploratory_data_analysis/
│ ├── Data_Cleaning.ipynb
│ ├── Flight_price.ipynb
│ ├── flight_price.xlsx
│ ├── googleplaystore.csv
│ ├── googleplaystorecleaned.csv
│ ├── Red_Wine.ipynb
│ └── winequality-red.csv
├── Handle_missing_values.ipynb
├── Handling_and_Outliers.ipynb
├── Handling_Imbalanced.ipynb
├── Label_Encoding.ipynb
├── LICENSE
├── OHE_Encoding.ipynb
├── Ordinal_Encoding.ipynb
├── README.md
├── SMOTE.ipynb
└── Target_guided_Ordinal.ipynb
- Data cleaning and preprocessing techniques
- Exploratory data analysis on various datasets:
- Flight price prediction
- Google Play Store apps
- Red wine quality analysis
- Handling Missing Values
- Various imputation methods
- Deletion strategies
- Handling Outliers
- Detection and treatment methods
- IQR method implementation
- Handling Imbalanced Datasets
- SMOTE (Synthetic Minority Over-sampling Technique)
- Upsampling and downsampling methods
- Label Encoding
- One-Hot Encoding (OHE)
- Ordinal Encoding
- Target-guided Ordinal Encoding
The repository includes a comprehensive example of feature engineering for flight price prediction, including:
- Date and time feature extraction
- Duration calculation
- Handling categorical variables
- Route analysis
- Stop information processing
Features processed include:
- Airline
- Source/Destination cities
- Departure/Arrival times
- Flight duration
- Number of stops
- Additional flight information
This dataset focuses on Portuguese "Vinho Verde" red wine variants, analyzing physicochemical properties to predict wine quality. The analysis includes:
- Fixed acidity
- Volatile acidity
- Citric acid
- Residual sugar
- Chlorides
- Free sulfur dioxide
- Total sulfur dioxide
- Density
- pH
- Sulphates
- Alcohol
- Quality (score between 0 and 10)
- Correlation analysis between wine properties
- Distribution analysis of wine qualities
- Feature relationships visualization
- Outlier detection
- Quality prediction modeling
# Split date into components
df[['Date','Month','Year']] = df['Date_of_Journey'].str.split('/',expand=True)
df[['Date', 'Month', 'Year']] = df[['Date', 'Month', 'Year']].astype(int)
# Convert duration to minutes
df['Duration_Time'] = (df['Duration'].str.extract('(?:(\d+)h)?\s*(?:(\d+)m)?')
.fillna(0)
.astype(int)
.apply(lambda x:x[0]*60 + x[1], axis=1))
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
transformed = encoder.fit_transform(df[['Airline','Source','Destination']]).toarray()
Feel free to contribute to this repository by:
- Forking the repository
- Creating a new branch for your feature
- Submitting a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
Abhinav Kumar Singh
- GitHub: @Abhinavexists