Feature Engineering

Feature engineering is the process of using domain knowledge to extract or create features from raw data that make machine learning algorithms work better.
It is a crucial step in the data preprocessing pipeline, as the quality and relevance of features directly impact the performance of predictive models.

Theoretical Foundations

Understanding Features

Features are the input variables used by machine learning models to make predictions.
Each feature represents a specific aspect of the data.

Type of Feature	Detail
Numerical	Continuous values (e.g. height, weight) or discrete values (e.g. counts).
Categorical	Non-numerical values that represent categories (e.g. color, brand).
Ordinal	Categorical variables with a clear ordering (e.g. high school < bachelor < master).
Binary	Variables that can take on one of two possible values (e.g. yes/no, true/false).

Feature Representation

The way features are represented can greatly affect a model’s ability to learn.
Different algorithms require different types of feature representations.

Models	Detail
Linear	Perform well with linearly separable data.
Tree	Naturally handle non-linear relationships but can benefit from well-defined feature engineering.

Curse of Dimensionality

As the number of features increases, the volume of the space increases leading to sparsity.
In high-dimensional space, data points become less similar making it difficult for algorithms to generalize well.
Effective feature engineering can mitigate this by reducing dimensionality.

Feature Importance

Understanding which features contribute most to the model’s predictions can guide feature selection and engineering efforts.
Techniques like feature importance scores from tree-based models or recursive feature elimination can aid this process.

Concepts Covered

Data Standardization

Data standardization involves scaling your data to have a mean of zero and a standard deviation of one.
This process is particularly useful when features have different units and scales.

Example

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

Data Normalization

Normalization scales the features to a range between 0 and 1.
This technique is beneficial for algorithms that rely on distance measurements, like k-NN.

Example

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

Encoding Categorical Data

Categorical data needs to be converted into numerical format for most machine learning algorithms.
Common techniques include one-hot encoding and label encoding.

Example

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X_categorical).toarray()

Sklearn ColumnTransformer

The ColumnTransformer allows you to apply different preprocessing steps to different columns of your dataset in a concise way.

Example

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ]
)

X_transformed = preprocessor.fit_transform(X)

Sklearn Pipeline

The Pipeline class enables you to streamline the preprocessing and modeling steps into a single object, ensuring that all steps are applied consistently.

Example

from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', SomeEstimator())
])

pipeline.fit(X_train, y_train)

Handling Mixed Variables

When your dataset contains both numerical and categorical variables, it's important to apply appropriate preprocessing to each type.
Use ColumnTransformer as mentioned above for effective handling.

Missing Categorical Data

Handling missing data in categorical variables can be done by replacing them with the most frequent category or using advanced techniques like KNN imputation.

Example

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
X_imputed = imputer.fit_transform(X_categorical)

KNNImputer

The KNNImputer uses the k-nearest neighbors algorithm to impute missing values, considering the values of similar data points.

Example

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X)

SimpleImputer

The SimpleImputer is a straightforward way to handle missing values using different strategies (mean, median, most frequent, constant).

Example

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X_numeric)

Outlier Detection

Outliers can significantly impact the performance of machine learning models. Several techniques can be employed for outlier detection.

Using IQR

The Interquartile Range (IQR) method detects outliers by calculating the range between the first (Q1) and third quartiles (Q3).

Example

Q1 = np.percentile(X, 25)
Q3 = np.percentile(X, 75)
IQR = Q3 - Q1
outliers = (X < (Q1 - 1.5 * IQR)) | (X > (Q3 + 1.5 * IQR))

Using Z-Score

Z-score measures how many standard deviations an element is from the mean. A common threshold is 3 to -3.

Example

X_mean = X.mean()
X_std = X.std()

X_Zscore = (X - X.mean())/X.std()

outliers = (X_Zscore > 3) | (X_Zscore < -3)

Using Winsorization

Winsorization involves capping extreme values to reduce the impact of outliers.

Example

upper_limit = np.percentile(X, 95)
lower_limit = np.percentile(X, 5)

outliers = (X < lower_limit) | (X > upper_limit)

Function Transformer

The FunctionTransformer allows you to apply any custom function to your data as part of a pipeline.

Example

from sklearn.preprocessing import FunctionTransformer

def custom_function(X):
    return X ** 2

transformer = FunctionTransformer(func=custom_function)
X_transformed = transformer.fit_transform(X)

Power Transformer

The PowerTransformer can help stabilize variance and make the data more Gaussian-like.
This is useful for improving the performance of models that assume normally distributed data.

Example

from sklearn.preprocessing import PowerTransformer

transformer = PowerTransformer()
X_transformed = transformer.fit_transform(X)

Imbalance Data

Imbalanced data refers to a situation where the distribution of classes within a dataset is not uniform.
This is particularly common in classification problems where one class significantly outnumbers the other class.

Example

# Under Sampling
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler()
X_rus, y_rus = rus.fit_resample(X_train, y_train)

# Over Sampling
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()
X_ros, y_ros = ros.fit_resample(X_train, y_train)

# Synthetic Minority Over-sampling Technique
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_smote, y_smote = smote.fit_resample(X_train, y_train)

Principal Component Analysis

Principal component analysis (PCA) reduces the number of dimensions in large datasets to principal components that retain most of the original information.
It does this by transforming potentially correlated variables into a smaller set of variables called principal components.

Example

# Importing PCA
from sklearn.decomposition import PCA

# Creating PCA Object for 2 Features
pca = PCA(n_components=2)

# Fit and Transform on Training Data
X_train_pca = pca.fit_transform(X_train_scaled)

# Transforming Testing Data
X_test_pca = pca.transform(X_test_scaled)

Getting Started

Clone this repository to your local machine by using the following command :
```
git clone https://github.com/TheMrityunjayPathak/Feature-Engineering.git
```
Install the Jupyter Notebook :
```
pip install notebook
```
Launch the Jupyter Notebook :
```
jupyter notebook
```
Open the desired notebook from the repository in your Jupyter Environment and start coding!

Contributing

Contributions are Welcome!
If you'd like to contribute to this repository, feel free to submit a pull request.

License

This repository is licensed under the MIT License.
You are free to use, modify and distribute the code in this repository.

^ Scroll to Top ^

Files

README.md

Latest commit

History

README.md

File metadata and controls

Feature Engineering

Theoretical Foundations

Concepts Covered

Data Standardization

Example

Data Normalization

Example

Encoding Categorical Data

Example

Sklearn ColumnTransformer

Example

Sklearn Pipeline

Example

Handling Mixed Variables

Missing Categorical Data

Example

KNNImputer

Example

SimpleImputer

Example

Outlier Detection

Using IQR

Example

Using Z-Score

Example

Using Winsorization

Example

Function Transformer

Example

Power Transformer

Example

Imbalance Data

Example

Principal Component Analysis

Example

Getting Started

Contributing

License