Skip to content

Adapt Core API tutorials to Khiops V11 #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 31, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 30 additions & 25 deletions Core Basics 1 - Train, Evaluate and Deploy a Classifier.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
" print(\"\")\n",
"\n",
"\n",
"# If there are any issues you may Khiops status with the following command\n",
"# If there are any issues, you may print Khiops status with the following command:\n",
"# kh.get_runner().print_status()"
]
},
Expand All @@ -43,12 +43,12 @@
"## Training a Classifier\n",
"We'll train a classifier for the `Iris` dataset. This is a classical dataset containing the data of different plants belonging to the genus _Iris_. It contains 150 records, 50 for each of three variants of _Iris_: _Setosa_, _Virginica_ and _Versicolor_. The records for each sample contain the length and width of its petal and sepal. The standard task for this dataset is to construct a classifier for the type of _Iris_ taking as inputs the length and width characteristics.\n",
"\n",
"Now to train a classifier with Khiops we use two types of files:\n",
"Now to train a classifier with Khiops, we use two types of files:\n",
"- A plain-text delimited data file (for example a `csv` file)\n",
"- A _dictionary_ file which describes the schema of the above data table (`.kdic` file extension)\n",
"\n",
"\n",
"Let's save into variables the locations of these files for the `Iris` dataset and then take a look at their contents:"
"Let's save, into variables, the locations of these files for the `Iris` dataset and then take a look at their contents:"
]
},
{
Expand All @@ -70,7 +70,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the _Iris_ variant information is in the column `Class`. Now let's specify directory to save our results:"
"Note that the _Iris_ variant information is in the column `Class`. Now let's specify the path to the analysis report file."
]
},
{
Expand All @@ -79,17 +79,18 @@
"metadata": {},
"outputs": [],
"source": [
"iris_results_dir = os.path.join(\"exercises\", \"Iris\")\n",
"print(f\"Iris results directory: {iris_results_dir}\")"
"analysis_report_file_path_Iris = os.path.join(\"exercises\", \"Iris\", \"AnalysisReport.khj\")\n",
"\n",
"print(f\"Iris analysis report file path: {analysis_report_file_path_Iris}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are now ready to train the classifier with the Khiops function `train_predictor`. This method returns a tuple containing the location of two files:\n",
"- the modeling report (`AllReports.khj`): A JSON file containing information such as the informativeness of each variable, those selected for the model and performance metrics.\n",
"- model's _dictionary_ file (`Modeling.kdic`): This file is an enriched version of the initial dictionary file that contains the model. It can be used to make predictions on new data."
"- the modeling report (`AnalysisReport.khj`): A JSON file containing information such as the informativeness of each variable, those selected for the model and performance metrics. It is saved into `analysis_report_file_path_Iris` variable that we just defined.\n",
"- model's _dictionary_ file (`AnalysisReport.model.kdic`): This file is an enriched version of the initial dictionary file that contains the model. It can be used to make predictions on new data."
]
},
{
Expand All @@ -103,7 +104,7 @@
" dictionary_name=\"Iris\",\n",
" data_table_path=iris_data_file,\n",
" target_variable=\"Class\",\n",
" results_dir=iris_results_dir,\n",
" analysis_report_file_path=analysis_report_file_path_Iris,\n",
" max_trees=0, # by default Khiops constructs 10 decision tree variables\n",
")\n",
"print(f\"Iris report file: {iris_report}\")\n",
Expand All @@ -114,7 +115,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You can verify that the result files were created in `iris_results_dir`. In the next sections, we'll use the file at `iris_report` to assess the models' performances and the file at `iris_model_kdic` to deploy it. Now we can see the report with the Khiops Visualization app:"
"Note that `iris_report` (the first element of the tuple returned by train_predictor) is identical to `analysis_report_file_path_Iris`. \n",
"\n",
"In the next sections, we'll use the file at `iris_report` to assess the models' performances and the file at `iris_model_kdic` to deploy it. Now we can have a look at the report with the Khiops Visualization app:"
]
},
{
Expand All @@ -133,9 +136,9 @@
"source": [
"### Exercise\n",
"\n",
"We'll repeat the examples on this notebook with the `Adult` dataset. It contains characteristics of the adult population in USA such as age, gender and education and its task is to predict the variable `class`, which indicates if the individual earns `more` or `less` than 50,000 dollars.\n",
"We'll repeat the previous steps on the `Adult` dataset. This dataset contains characteristics of the adult population in USA such as age, gender and education and its task is to predict the variable `class`, which indicates if the individual earns `more` or `less` than 50,000 dollars.\n",
"\n",
"Let's start by putting into variables the paths for the `Adult` dataset:"
"Let's start by putting, into variables, the paths for the `Adult` dataset:"
]
},
{
Expand Down Expand Up @@ -173,7 +176,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We now save the results directory for this exercise:"
"We now specify the path to the analysis report file for this exercise:"
]
},
{
Expand All @@ -182,16 +185,19 @@
"metadata": {},
"outputs": [],
"source": [
"adult_results_dir = os.path.join(\"exercises\", \"Adult\")\n",
"print(f\"Adult results directory: {adult_results_dir}\")"
"analysis_report_file_path_Adult = os.path.join(\n",
" \"exercises\", \"Adult\", \"AnalysisReport.khj\"\n",
")\n",
"\n",
"print(f\"Adult analysis report file path: {analysis_report_file_path_Adult}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Train a classifier for the `Adult` database\n",
"Note the name of the target variable is `class` (**in lower case!**). Do not forget to set `max_trees=0`. Save the resulting file locations into the variables `adult_report` and `adult_model_kdic` and print them"
"Note the name of the target variable is `class` (**in lower case!**). Do not forget to set `max_trees=0`. Save the resulting file locations into the variables `adult_report` and `adult_model_kdic` and print them."
]
},
{
Expand All @@ -207,7 +213,7 @@
" dictionary_name=\"Adult\",\n",
" data_table_path=adult_data_file,\n",
" target_variable=\"class\",\n",
" results_dir=adult_results_dir,\n",
" analysis_report_file_path=analysis_report_file_path_Adult,\n",
" max_trees=0,\n",
")\n",
"print(f\"Adult report file: {adult_report}\")\n",
Expand Down Expand Up @@ -239,7 +245,7 @@
"source": [
"## Accessing a Classifiers' Basic Evaluation Metrics\n",
"\n",
"We access the classifier's evaluation metrics by loading file at `iris_report` file with the Khiops function `read_analysis_results_file`:"
"We access the classifier's evaluation metrics by loading the file at `iris_report` with the Khiops function `read_analysis_results_file`:"
]
},
{
Expand Down Expand Up @@ -292,7 +298,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"These objects are of class `PredictorPerformance` and have `accuracy` and `auc` attributes for these metrics:"
"These objects are of class `PredictorPerformance`. They have access to `accuracy` and `auc` attributes:"
]
},
{
Expand Down Expand Up @@ -376,7 +382,7 @@
"metadata": {},
"source": [
"## Deploying a Classifier\n",
"We are going to deploy the `Iris` classifier we have just trained on the same dataset (normally we would do this on new data). We saved the model in the file `iris_model_kdic`. This file is usually large and incomprehensible, so you should know what you are doing before editing it. Just this time let's take a quick look at its contents:"
"We are going to deploy the `Iris` classifier we have just trained on the same dataset (normally we would do this on new data). We saved the model in the file `iris_model_kdic`. This file is usually large and incomprehensible, so you should know what you are doing before editing it. Let's take a quick look at its contents:"
]
},
{
Expand All @@ -392,12 +398,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the modeling dictionary contains 5 used variables:\n",
"- `Class` : The original target of the dataset\n",
"Note that the modeling dictionary contains 4 used variables:\n",
"- `PredictedClass` : The class with the highest probability according to the model\n",
"- `ProbClassIris-setosa`, `ProbClassIris-versicolor`, `ProbClassIris-virginica`: The probabilities of each class according to the model\n",
"\n",
"These will be the columns of the output table when deploying the model:"
"These will be the columns of the table obtained after deploying the model. This table will be saved at `iris_deployment_file`."
]
},
{
Expand All @@ -406,7 +411,7 @@
"metadata": {},
"outputs": [],
"source": [
"iris_deployment_file = os.path.join(iris_results_dir, \"iris_deployment.txt\")\n",
"iris_deployment_file = os.path.join(\"exercises\", \"Iris\", \"iris_deployment.txt\")\n",
"kh.deploy_model(\n",
" iris_model_kdic,\n",
" dictionary_name=\"SNB_Iris\",\n",
Expand Down Expand Up @@ -434,7 +439,7 @@
},
"outputs": [],
"source": [
"adult_deployment_file = os.path.join(adult_results_dir, \"adult_deployment.txt\")\n",
"adult_deployment_file = os.path.join(\"exercises\", \"Adult\", \"adult_deployment.txt\")\n",
"kh.deploy_model(\n",
" adult_model_kdic,\n",
" dictionary_name=\"SNB_Adult\",\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -78,9 +78,9 @@
"```\n",
"The `HeadlineId` variable is special because it is a _key_ that links a particular headline to its words (a 1:n relation).\n",
"\n",
"*Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only for pedagogical purporses.*\n",
"*Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only used for pedagogical purporses.*\n",
"\n",
"To train a classifier with Khiops in this multi-table setup, this schema must be codified in the dictionary file. Let's check the contents of the `HeadlineSarcasm` dictionary file:"
"To train a classifier with Khiops in this multi-table setup, this schema must be coded in a dictionary file. Let's check the contents of the `HeadlineSarcasm` dictionary file:"
]
},
{
Expand All @@ -101,11 +101,11 @@
"metadata": {},
"source": [
"As in the single-table case the `.kdic`file describes the schema for both tables, but note the following differences:\n",
"- The dictionary for the table `Headline` is prefixed by the `Root` keyword to indicate that is the main one.\n",
"- For both tables, their dictionary names are followed by `(HeadlineId)` to indicate that `HeadlineId` is the key of these tables.\n",
"- The schema for the main table contains an extra special variable defined with the statement `Table(Words) HeadlineWords`. This is, in addition to sharing the same key variable, is necessary to indicate the `1:n` relationship between the main and secondary table.\n",
"- The dictionary for the table `Headline` is prefixed by the `Root` keyword. It is here optional and simply tags the main dictionary `Headline` representing the statistical instances.\n",
"- For both tables, dictionary names are followed by `(HeadlineId)` to indicate that `HeadlineId` is their key.\n",
"- The schema of the main table contains an extra special variable defined with the statement `Table(Words) HeadlineWords`. This is, in addition to sharing the same key variable, necessary to indicate the `1:n` relationship between the main and secondary table.\n",
"\n",
"Now let's store the location main and secondary tables and peek their contents:"
"Now let's store the location of the main and secondary tables and peek their contents:"
]
},
{
Expand All @@ -117,7 +117,7 @@
"sarcasm_headlines_file = os.path.join(\"data\", \"HeadlineSarcasm\", \"Headlines.txt\")\n",
"sarcasm_words_file = os.path.join(\"data\", \"HeadlineSarcasm\", \"HeadlineWords.txt\")\n",
"\n",
"print(f\"HeadlineSarcasm main table file: {sarcasm_headlines_file}\")\n",
"print(f\"HeadlineSarcasm main table file location: {sarcasm_headlines_file}\")\n",
"print(\"\")\n",
"peek(sarcasm_headlines_file, n=3)\n",
"\n",
Expand All @@ -133,20 +133,20 @@
"The call to the `train_predictor` will be very similar to the single-table case but there are some differences. \n",
"\n",
"The first is that we must pass the path of the extra secondary data table. This is done with the `additional_data_tables` parameter that is a Python dictionary containing key-value pairs for each table. More precisely:\n",
"- keys describe *data paths* of secondary tables. In this case only ``Headline`HeadlineWords``\n",
"- values describe the *file paths* of secondary tables. In this case only the file path we stored in `sarcasm_words_file`\n",
"- keys describe *data paths* of secondary tables. In this case only, it is ``HeadlineWords``\n",
"- values describe the *file paths* of secondary tables. In this case only, it is the file path we stored in `sarcasm_words_file`\n",
"\n",
"*Note: For understanding what data paths are see the \"Multi-Table Tasks\" section of the Khiops `core.api` documentation*\n",
"*Note: To understand what data paths are, please check the \"Multi-Table Tasks\" section of the Khiops `core.api` documentation*\n",
"\n",
"Secondly, we specify how many features/aggregates Khiops will create with its multi-table AutoML mode. For the `HeadlineSarcasm` dataset Khiops can create features such as:\n",
"Secondly, we must specify how many features/aggregates Khiops will create (at most) with its multi-table AutoML mode. For the `HeadlineSarcasm` dataset Khiops can create features such as:\n",
"- *Number of different words in the headline* \n",
"- *Most common word in the headline before the third one*\n",
"- *Number of times the word 'the' appears*\n",
"- ...\n",
"\n",
"It will then evaluate, select and combine the created features to build a classifier. We'll ask to create `1000` of these features (the default is `100`).\n",
"\n",
"With these considerations, let's setup the some extra variables and train the classifier:"
"With these considerations, let's now train the classifier:"
]
},
{
Expand All @@ -155,15 +155,17 @@
"metadata": {},
"outputs": [],
"source": [
"sarcasm_results_dir = os.path.join(\"exercises\", \"HeadlineSarcasm\")\n",
"analysis_report_file_path_Sarcasm = os.path.join(\n",
" \"exercises\", \"HeadlineSarcasm\", \"AnalysisReport.khj\"\n",
")\n",
"\n",
"sarcasm_report, sarcasm_model_kdic = kh.train_predictor(\n",
" sarcasm_kdic,\n",
" dictionary_name=\"Headline\", # This must be the main/root dictionary\n",
" data_table_path=sarcasm_headlines_file, # This must be the data file for the main table\n",
" target_variable=\"IsSarcasm\",\n",
" results_dir=sarcasm_results_dir,\n",
" additional_data_tables={\"Headline`HeadlineWords\": sarcasm_words_file},\n",
" analysis_report_file_path=analysis_report_file_path_Sarcasm,\n",
" additional_data_tables={\"HeadlineWords\": sarcasm_words_file},\n",
" max_constructed_variables=1000, # by default Khiops constructs 100 variables for AutoML multi-table\n",
" max_trees=0, # by default Khiops constructs 10 decision tree variables\n",
")\n",
Expand Down Expand Up @@ -192,7 +194,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"*Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this you may use the Khiops `sort_data_table` function or your favorite software. The examples of this tutorial have their tables pre-sorted.*"
"*Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this, you may use the Khiops `sort_data_table` function. The examples of this tutorial have their tables pre-sorted.*"
]
},
{
Expand All @@ -201,7 +203,7 @@
"source": [
"### Exercise time!\n",
"\n",
"Repeat the previous steps with the `AccidentsSummary` dataset. It describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:\n",
"Repeat the previous steps with the `AccidentsSummary` dataset. This dataset describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:\n",
"```\n",
"+---------------+\n",
"|Accidents |\n",
Expand All @@ -220,7 +222,7 @@
" +---1:n--->|... |\n",
" +---------------+\n",
"```\n",
"So for each accident we have its characteristics (such as `Gravity` or `Light` conditions) and those of each involved vehicle (its `Direction` or `PassengerNumber`). The main task for this dataset is to predict the variable `Gravity` that has two possible values:`Lethal` and `NonLethal`.\n",
"For each accident, we have its characteristics (such as `Gravity` or `Light` conditions) and those of each involved vehicle (its `Direction` or `PassengerNumber`). The main task for this dataset is to predict the variable `Gravity` that has two possible values:`Lethal` and `NonLethal`.\n",
"\n",
"We first save the paths of the `AccidentsSummary` dictionary file and data table files into variables:"
]
Expand Down Expand Up @@ -275,7 +277,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We now save the results directory for this exercise:"
"We now define the path of the modeling report for this exercise:"
]
},
{
Expand All @@ -284,8 +286,9 @@
"metadata": {},
"outputs": [],
"source": [
"accidents_results_dir = os.path.join(\"exercises\", \"AccidentSummary\")\n",
"print(f\"AccidentsSummary exercise results directory: {accidents_results_dir}\")"
"analysis_report_file_path_Accidents = os.path.join(\n",
" \"exercises\", \"AccidentSummary\", \"AnalysisReport.khj\"\n",
")"
]
},
{
Expand All @@ -297,7 +300,7 @@
"\n",
"Do not forget:\n",
"- The target variable is `Gravity`\n",
"- The key for the `additional_data_tables` parameter is ``Accident`Vehicles`` and its value that of `vehicles_data_file`\n",
"- The key for the `additional_data_tables` parameter is ``Vehicles`` and its value that of `vehicles_data_file`\n",
"- Set `max_trees=0`"
]
},
Expand All @@ -314,8 +317,8 @@
" dictionary_name=\"Accident\",\n",
" data_table_path=accidents_data_file,\n",
" target_variable=\"Gravity\",\n",
" results_dir=accidents_results_dir,\n",
" additional_data_tables={\"Accident`Vehicles\": vehicles_data_file},\n",
" analysis_report_file_path=analysis_report_file_path_Accidents,\n",
" additional_data_tables={\"Vehicles\": vehicles_data_file},\n",
" max_constructed_variables=1000,\n",
" max_trees=0,\n",
")\n",
Expand Down
Loading