Data-Centric-AI-Community · Axikop · Nov 15, 2023 · Nov 17, 2023
diff --git a/K Means Clustering/.ipynb_checkpoints/KMeansCluster-checkpoint.ipynb b/K Means Clustering/.ipynb_checkpoints/KMeansCluster-checkpoint.ipynb
diff --git a/K Means Clustering/KMeansCluster.ipynb b/K Means Clustering/KMeansCluster.ipynb
diff --git a/README.md b/README.md
@@ -90,6 +90,7 @@ To learn data science, the CRISP-DM is a good approach:
 ### Data Transformation
 - [01 - Scaling Numerical Data](tutorials/scale_numerical_data.ipynb)
 - [02 - Encoding Categorical Data](tutorials/encode_categorial_data.ipynb)
+- [03 - One-hot encoding](encode categorical data\Howtodealwithcategoricaldata.ipynb)
 
 ### 💿 Datasets (for exploration)
 

diff --git a/encode categorical data/.ipynb_checkpoints/Howtodealwithcategoricaldata-checkpoint.ipynb b/encode categorical data/.ipynb_checkpoints/Howtodealwithcategoricaldata-checkpoint.ipynb
@@ -0,0 +1,168 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f84263f6",
+   "metadata": {},
+   "source": [
+    "# What is categorical data?\n",
+    "Categorical data refers to a type of data that represents categories or labels and cannot be measured in numerical form. Categorical data is often used to group items into discrete classes."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ee58394a",
+   "metadata": {},
+   "source": [
+    "# Why to care about encoding it?\n",
+    "Categorical data, being non-numeric, needs to be converted into a numerical format for some Machine Learning algorithms to process and make predictions. And also numerical data is often more efficiently processed by machine learning algorithms compared to categorical data."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2f5c77d",
+   "metadata": {},
+   "source": [
+    "# Technique for encoding categorical data\n",
+    "ONE HOT ENCODING:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "50757c99",
+   "metadata": {},
+   "source": [
+    "One-Hot Encoding is a popular technique for handling categorical data, especially when the categories don't have an inherent order.In this method, we map each category to a vector that contains 1 and 0 denoting the presence or absence of the feature."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2aa91180",
+   "metadata": {},
+   "source": [
+    "# Implementation in python"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "2cd25240",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Import the neccessary libraries\n",
+    "import pandas as pd\n",
+    "import seaborn as sns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "4c0ef183",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Loading the dataset (Let's use a pre-existing dataset from seaborn library, the 'titanic' dataset.)\n",
+    "titanic = sns.load_dataset('titanic')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "ad1581b4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Selecting a few columns for demonstration purpose.\n",
+    "titanic = titanic[['sex', 'embark_town', 'alone','survived']]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "75f13a51",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Perform one-hot encoding\n",
+    "titanic_encoded = pd.get_dummies(titanic, columns=['sex', 'embark_town', 'alone'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "81febb44",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "   sex_female  sex_male  embark_town_Cherbourg  embark_town_Queenstown  \\\n",
+      "0           0         1                      0                       0   \n",
+      "1           1         0                      1                       0   \n",
+      "2           1         0                      0                       0   \n",
+      "3           1         0                      0                       0   \n",
+      "4           0         1                      0                       0   \n",
+      "\n",
+      "   embark_town_Southampton  alone_False  alone_True  \n",
+      "0                        1            1           0  \n",
+      "1                        0            1           0  \n",
+      "2                        1            0           1  \n",
+      "3                        1            1           0  \n",
+      "4                        1            0           1  \n"
+     ]
+    }
+   ],
+   "source": [
+    "print(titanic_encoded.head())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b76cba16",
+   "metadata": {},
+   "source": [
+    "As you can see it creates new columns for each unique value in the 'sex', 'embark_town', and 'alone' columns. A row will have a 1 in the column for its category and 0 in the others."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3a6e78c1",
+   "metadata": {},
+   "source": [
+    "# NOTE:\n",
+    "One-hot encoding can significantly increase the dimensionality of the dataset if the categorical variable has many unique values. This can lead to an increase in memory and computational requirements, and potentially degrade the performance of the model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ab4e1026",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/encode categorical data/Howtodealwithcategoricaldata.ipynb b/encode categorical data/Howtodealwithcategoricaldata.ipynb
@@ -0,0 +1,160 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ec277721",
+   "metadata": {},
+   "source": [
+    "# What is categorical data?\n",
+    "Categorical data refers to a type of data that represents categories or labels and cannot be measured in numerical form. Categorical data is often used to group items into discrete classes."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7c4879f7",
+   "metadata": {},
+   "source": [
+    "# Why to care about encoding it?\n",
+    "Categorical data, being non-numeric, needs to be converted into a numerical format for some Machine Learning algorithms to process and make predictions. And also numerical data is often more efficiently processed by machine learning algorithms compared to categorical data."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "32c48a2f",
+   "metadata": {},
+   "source": [
+    "# Technique for encoding categorical data\n",
+    "ONE HOT ENCODING:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ef4c94a9",
+   "metadata": {},
+   "source": [
+    "One-Hot Encoding is a popular technique for handling categorical data, especially when the categories don't have an inherent order.In this method, we map each category to a vector that contains 1 and 0 denoting the presence or absence of the feature."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b02b995a",
+   "metadata": {},
+   "source": [
+    "# Implementation in python"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "afb8066b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Import the neccessary libraries\n",
+    "import pandas as pd\n",
+    "import seaborn as sns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "10429444",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Loading the dataset (Let's use a pre-existing dataset from seaborn library, the 'titanic' dataset.)\n",
+    "titanic = sns.load_dataset('titanic')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "02e6bc1d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Selecting a few columns for demonstration purpose.\n",
+    "titanic = titanic[['sex', 'embark_town', 'alone','survived']]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "859d3d1e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Perform one-hot encoding\n",
+    "titanic_encoded = pd.get_dummies(titanic, columns=['sex', 'embark_town', 'alone'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "04a413f2",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "   sex_female  sex_male  embark_town_Cherbourg  embark_town_Queenstown  \\\n",
+      "0           0         1                      0                       0   \n",
+      "1           1         0                      1                       0   \n",
+      "2           1         0                      0                       0   \n",
+      "3           1         0                      0                       0   \n",
+      "4           0         1                      0                       0   \n",
+      "\n",
+      "   embark_town_Southampton  alone_False  alone_True  \n",
+      "0                        1            1           0  \n",
+      "1                        0            1           0  \n",
+      "2                        1            0           1  \n",
+      "3                        1            1           0  \n",
+      "4                        1            0           1  \n"
+     ]
+    }
+   ],
+   "source": [
+    "print(titanic_encoded.head())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bb0075ca",
+   "metadata": {},
+   "source": [
+    "As you can see it creates new columns for each unique value in the 'sex', 'embark_town', and 'alone' columns. A row will have a 1 in the column for its category and 0 in the others."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "620cd592",
+   "metadata": {},
+   "source": [
+    "# NOTE:\n",
+    "One-hot encoding can significantly increase the dimensionality of the dataset if the categorical variable has many unique values. This can lead to an increase in memory and computational requirements, and potentially degrade the performance of the model."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}