Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding categorical data #33

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 165 additions & 0 deletions K Means Clustering/.ipynb_checkpoints/KMeansCluster-checkpoint.ipynb

Large diffs are not rendered by default.

165 changes: 165 additions & 0 deletions K Means Clustering/KMeansCluster.ipynb

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ To learn data science, the CRISP-DM is a good approach:
### Data Transformation
- [01 - Scaling Numerical Data](tutorials/scale_numerical_data.ipynb)
- [02 - Encoding Categorical Data](tutorials/encode_categorial_data.ipynb)
- [03 - One-hot encoding](encode categorical data\Howtodealwithcategoricaldata.ipynb)

### 💿 Datasets (for exploration)

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "f84263f6",
"metadata": {},
"source": [
"# What is categorical data?\n",
"Categorical data refers to a type of data that represents categories or labels and cannot be measured in numerical form. Categorical data is often used to group items into discrete classes."
]
},
{
"cell_type": "markdown",
"id": "ee58394a",
"metadata": {},
"source": [
"# Why to care about encoding it?\n",
"Categorical data, being non-numeric, needs to be converted into a numerical format for some Machine Learning algorithms to process and make predictions. And also numerical data is often more efficiently processed by machine learning algorithms compared to categorical data."
]
},
{
"cell_type": "markdown",
"id": "d2f5c77d",
"metadata": {},
"source": [
"# Technique for encoding categorical data\n",
"ONE HOT ENCODING:"
]
},
{
"cell_type": "markdown",
"id": "50757c99",
"metadata": {},
"source": [
"One-Hot Encoding is a popular technique for handling categorical data, especially when the categories don't have an inherent order.In this method, we map each category to a vector that contains 1 and 0 denoting the presence or absence of the feature."
]
},
{
"cell_type": "markdown",
"id": "2aa91180",
"metadata": {},
"source": [
"# Implementation in python"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2cd25240",
"metadata": {},
"outputs": [],
"source": [
"#Import the neccessary libraries\n",
"import pandas as pd\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4c0ef183",
"metadata": {},
"outputs": [],
"source": [
"#Loading the dataset (Let's use a pre-existing dataset from seaborn library, the 'titanic' dataset.)\n",
"titanic = sns.load_dataset('titanic')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ad1581b4",
"metadata": {},
"outputs": [],
"source": [
"# Selecting a few columns for demonstration purpose.\n",
"titanic = titanic[['sex', 'embark_town', 'alone','survived']]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "75f13a51",
"metadata": {},
"outputs": [],
"source": [
"# Perform one-hot encoding\n",
"titanic_encoded = pd.get_dummies(titanic, columns=['sex', 'embark_town', 'alone'])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "81febb44",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" sex_female sex_male embark_town_Cherbourg embark_town_Queenstown \\\n",
"0 0 1 0 0 \n",
"1 1 0 1 0 \n",
"2 1 0 0 0 \n",
"3 1 0 0 0 \n",
"4 0 1 0 0 \n",
"\n",
" embark_town_Southampton alone_False alone_True \n",
"0 1 1 0 \n",
"1 0 1 0 \n",
"2 1 0 1 \n",
"3 1 1 0 \n",
"4 1 0 1 \n"
]
}
],
"source": [
"print(titanic_encoded.head())"
]
},
{
"cell_type": "markdown",
"id": "b76cba16",
"metadata": {},
"source": [
"As you can see it creates new columns for each unique value in the 'sex', 'embark_town', and 'alone' columns. A row will have a 1 in the column for its category and 0 in the others."
]
},
{
"cell_type": "markdown",
"id": "3a6e78c1",
"metadata": {},
"source": [
"# NOTE:\n",
"One-hot encoding can significantly increase the dimensionality of the dataset if the categorical variable has many unique values. This can lead to an increase in memory and computational requirements, and potentially degrade the performance of the model."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab4e1026",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
160 changes: 160 additions & 0 deletions encode categorical data/Howtodealwithcategoricaldata.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "ec277721",
"metadata": {},
"source": [
"# What is categorical data?\n",
"Categorical data refers to a type of data that represents categories or labels and cannot be measured in numerical form. Categorical data is often used to group items into discrete classes."
]
},
{
"cell_type": "markdown",
"id": "7c4879f7",
"metadata": {},
"source": [
"# Why to care about encoding it?\n",
"Categorical data, being non-numeric, needs to be converted into a numerical format for some Machine Learning algorithms to process and make predictions. And also numerical data is often more efficiently processed by machine learning algorithms compared to categorical data."
]
},
{
"cell_type": "markdown",
"id": "32c48a2f",
"metadata": {},
"source": [
"# Technique for encoding categorical data\n",
"ONE HOT ENCODING:"
]
},
{
"cell_type": "markdown",
"id": "ef4c94a9",
"metadata": {},
"source": [
"One-Hot Encoding is a popular technique for handling categorical data, especially when the categories don't have an inherent order.In this method, we map each category to a vector that contains 1 and 0 denoting the presence or absence of the feature."
]
},
{
"cell_type": "markdown",
"id": "b02b995a",
"metadata": {},
"source": [
"# Implementation in python"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "afb8066b",
"metadata": {},
"outputs": [],
"source": [
"#Import the neccessary libraries\n",
"import pandas as pd\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "10429444",
"metadata": {},
"outputs": [],
"source": [
"#Loading the dataset (Let's use a pre-existing dataset from seaborn library, the 'titanic' dataset.)\n",
"titanic = sns.load_dataset('titanic')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "02e6bc1d",
"metadata": {},
"outputs": [],
"source": [
"# Selecting a few columns for demonstration purpose.\n",
"titanic = titanic[['sex', 'embark_town', 'alone','survived']]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "859d3d1e",
"metadata": {},
"outputs": [],
"source": [
"# Perform one-hot encoding\n",
"titanic_encoded = pd.get_dummies(titanic, columns=['sex', 'embark_town', 'alone'])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "04a413f2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" sex_female sex_male embark_town_Cherbourg embark_town_Queenstown \\\n",
"0 0 1 0 0 \n",
"1 1 0 1 0 \n",
"2 1 0 0 0 \n",
"3 1 0 0 0 \n",
"4 0 1 0 0 \n",
"\n",
" embark_town_Southampton alone_False alone_True \n",
"0 1 1 0 \n",
"1 0 1 0 \n",
"2 1 0 1 \n",
"3 1 1 0 \n",
"4 1 0 1 \n"
]
}
],
"source": [
"print(titanic_encoded.head())"
]
},
{
"cell_type": "markdown",
"id": "bb0075ca",
"metadata": {},
"source": [
"As you can see it creates new columns for each unique value in the 'sex', 'embark_town', and 'alone' columns. A row will have a 1 in the column for its category and 0 in the others."
]
},
{
"cell_type": "markdown",
"id": "620cd592",
"metadata": {},
"source": [
"# NOTE:\n",
"One-hot encoding can significantly increase the dimensionality of the dataset if the categorical variable has many unique values. This can lead to an increase in memory and computational requirements, and potentially degrade the performance of the model."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}