Skip to content

Commit d23b991

Browse files
committed
Adapt Core API tutorials to Khiops V11
1 parent 2d39c8e commit d23b991

4 files changed

+94
-67
lines changed

Core Basics 1 - Train, Evaluate and Deploy a Classifier.ipynb

Lines changed: 35 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
" print(\"\")\n",
3333
"\n",
3434
"\n",
35-
"# If there are any issues you may Khiops status with the following command\n",
35+
"# If there are any issues, you may print Khiops status with the following command:\n",
3636
"# kh.get_runner().print_status()"
3737
]
3838
},
@@ -43,12 +43,12 @@
4343
"## Training a Classifier\n",
4444
"We'll train a classifier for the `Iris` dataset. This is a classical dataset containing the data of different plants belonging to the genus _Iris_. It contains 150 records, 50 for each of three variants of _Iris_: _Setosa_, _Virginica_ and _Versicolor_. The records for each sample contain the length and width of its petal and sepal. The standard task for this dataset is to construct a classifier for the type of _Iris_ taking as inputs the length and width characteristics.\n",
4545
"\n",
46-
"Now to train a classifier with Khiops we use two types of files:\n",
46+
"Now to train a classifier with Khiops, we use two types of files:\n",
4747
"- A plain-text delimited data file (for example a `csv` file)\n",
4848
"- A _dictionary_ file which describes the schema of the above data table (`.kdic` file extension)\n",
4949
"\n",
5050
"\n",
51-
"Let's save into variables the locations of these files for the `Iris` dataset and then take a look at their contents:"
51+
"Let's save, into variables, the locations of these files for the `Iris` dataset and then take a look at their contents:"
5252
]
5353
},
5454
{
@@ -70,7 +70,7 @@
7070
"cell_type": "markdown",
7171
"metadata": {},
7272
"source": [
73-
"Note that the _Iris_ variant information is in the column `Class`. Now let's specify directory to save our results:"
73+
"Note that the _Iris_ variant information is in the column `Class`. Now let's specify the path to the analysis report file."
7474
]
7575
},
7676
{
@@ -79,17 +79,18 @@
7979
"metadata": {},
8080
"outputs": [],
8181
"source": [
82-
"iris_results_dir = os.path.join(\"exercises\", \"Iris\")\n",
83-
"print(f\"Iris results directory: {iris_results_dir}\")"
82+
"analysis_report_file_path_Iris = os.path.join(\"exercises\", \"Iris\", \"AnalysisReport.khj\")\n",
83+
"\n",
84+
"print(f\"Iris analysis report file path: {analysis_report_file_path_Iris}\")"
8485
]
8586
},
8687
{
8788
"cell_type": "markdown",
8889
"metadata": {},
8990
"source": [
9091
"We are now ready to train the classifier with the Khiops function `train_predictor`. This method returns a tuple containing the location of two files:\n",
91-
"- the modeling report (`AllReports.khj`): A JSON file containing information such as the informativeness of each variable, those selected for the model and performance metrics.\n",
92-
"- model's _dictionary_ file (`Modeling.kdic`): This file is an enriched version of the initial dictionary file that contains the model. It can be used to make predictions on new data."
92+
"- the modeling report (`AnalysisReport.khj`): A JSON file containing information such as the informativeness of each variable, those selected for the model and performance metrics. It is saved into analysis_report_file_path_Iris variable that we just defined.\n",
93+
"- model's _dictionary_ file (`AnalysisReport.model.kdic`): This file is an enriched version of the initial dictionary file that contains the model. It can be used to make predictions on new data."
9394
]
9495
},
9596
{
@@ -103,7 +104,7 @@
103104
" dictionary_name=\"Iris\",\n",
104105
" data_table_path=iris_data_file,\n",
105106
" target_variable=\"Class\",\n",
106-
" results_dir=iris_results_dir,\n",
107+
" analysis_report_file_path=analysis_report_file_path_Iris,\n",
107108
" max_trees=0, # by default Khiops constructs 10 decision tree variables\n",
108109
")\n",
109110
"print(f\"Iris report file: {iris_report}\")\n",
@@ -114,7 +115,7 @@
114115
"cell_type": "markdown",
115116
"metadata": {},
116117
"source": [
117-
"You can verify that the result files were created in `iris_results_dir`. In the next sections, we'll use the file at `iris_report` to assess the models' performances and the file at `iris_model_kdic` to deploy it. Now we can see the report with the Khiops Visualization app:"
118+
"In the next sections, we'll use the file at `iris_report` to assess the models' performances and the file at `iris_model_kdic` to deploy it. Now we can have a look at the report with the Khiops Visualization app:"
118119
]
119120
},
120121
{
@@ -133,9 +134,9 @@
133134
"source": [
134135
"### Exercise\n",
135136
"\n",
136-
"We'll repeat the examples on this notebook with the `Adult` dataset. It contains characteristics of the adult population in USA such as age, gender and education and its task is to predict the variable `class`, which indicates if the individual earns `more` or `less` than 50,000 dollars.\n",
137+
"We'll repeat the previous steps on the `Adult` dataset. This dataset contains characteristics of the adult population in USA such as age, gender and education and its task is to predict the variable `class`, which indicates if the individual earns `more` or `less` than 50,000 dollars.\n",
137138
"\n",
138-
"Let's start by putting into variables the paths for the `Adult` dataset:"
139+
"Let's start by putting, into variables, the paths for the `Adult` dataset:"
139140
]
140141
},
141142
{
@@ -173,7 +174,7 @@
173174
"cell_type": "markdown",
174175
"metadata": {},
175176
"source": [
176-
"We now save the results directory for this exercise:"
177+
"We now specify the path to the analysis report file for this exercise:"
177178
]
178179
},
179180
{
@@ -182,16 +183,19 @@
182183
"metadata": {},
183184
"outputs": [],
184185
"source": [
185-
"adult_results_dir = os.path.join(\"exercises\", \"Adult\")\n",
186-
"print(f\"Adult results directory: {adult_results_dir}\")"
186+
"analysis_report_file_path_Adult = os.path.join(\n",
187+
" \"exercises\", \"Adult\", \"AnalysisReport.khj\"\n",
188+
")\n",
189+
"\n",
190+
"print(f\"Adult analysis report file path: {analysis_report_file_path_Adult}\")"
187191
]
188192
},
189193
{
190194
"cell_type": "markdown",
191195
"metadata": {},
192196
"source": [
193197
"#### Train a classifier for the `Adult` database\n",
194-
"Note the name of the target variable is `class` (**in lower case!**). Do not forget to set `max_trees=0`. Save the resulting file locations into the variables `adult_report` and `adult_model_kdic` and print them"
198+
"Note the name of the target variable is `class` (**in lower case!**). Do not forget to set `max_trees=0`. Save the resulting file locations into the variables `adult_report` and `adult_model_kdic` and print them."
195199
]
196200
},
197201
{
@@ -207,7 +211,7 @@
207211
" dictionary_name=\"Adult\",\n",
208212
" data_table_path=adult_data_file,\n",
209213
" target_variable=\"class\",\n",
210-
" results_dir=adult_results_dir,\n",
214+
" analysis_report_file_path=analysis_report_file_path_Adult,\n",
211215
" max_trees=0,\n",
212216
")\n",
213217
"print(f\"Adult report file: {adult_report}\")\n",
@@ -239,7 +243,7 @@
239243
"source": [
240244
"## Accessing a Classifiers' Basic Evaluation Metrics\n",
241245
"\n",
242-
"We access the classifier's evaluation metrics by loading file at `iris_report` file with the Khiops function `read_analysis_results_file`:"
246+
"We access the classifier's evaluation metrics by loading the file at `iris_report` with the Khiops function `read_analysis_results_file`:"
243247
]
244248
},
245249
{
@@ -292,7 +296,7 @@
292296
"cell_type": "markdown",
293297
"metadata": {},
294298
"source": [
295-
"These objects are of class `PredictorPerformance` and have `accuracy` and `auc` attributes for these metrics:"
299+
"These objects are of class `PredictorPerformance`. They have access to `accuracy` and `auc` attributes:"
296300
]
297301
},
298302
{
@@ -376,7 +380,7 @@
376380
"metadata": {},
377381
"source": [
378382
"## Deploying a Classifier\n",
379-
"We are going to deploy the `Iris` classifier we have just trained on the same dataset (normally we would do this on new data). We saved the model in the file `iris_model_kdic`. This file is usually large and incomprehensible, so you should know what you are doing before editing it. Just this time let's take a quick look at its contents:"
383+
"We are going to deploy the `Iris` classifier we have just trained on the same dataset (normally we would do this on new data). We saved the model in the file `iris_model_kdic`. This file is usually large and incomprehensible, so you should know what you are doing before editing it. Let's take a quick look at its contents:"
380384
]
381385
},
382386
{
@@ -392,12 +396,11 @@
392396
"cell_type": "markdown",
393397
"metadata": {},
394398
"source": [
395-
"Note that the modeling dictionary contains 5 used variables:\n",
396-
"- `Class` : The original target of the dataset\n",
399+
"Note that the modeling dictionary contains 4 used variables:\n",
397400
"- `PredictedClass` : The class with the highest probability according to the model\n",
398401
"- `ProbClassIris-setosa`, `ProbClassIris-versicolor`, `ProbClassIris-virginica`: The probabilities of each class according to the model\n",
399402
"\n",
400-
"These will be the columns of the output table when deploying the model:"
403+
"These will be the columns of the table obtained after deploying the model. This table will be saved at `iris_deployment_file`."
401404
]
402405
},
403406
{
@@ -406,7 +409,7 @@
406409
"metadata": {},
407410
"outputs": [],
408411
"source": [
409-
"iris_deployment_file = os.path.join(iris_results_dir, \"iris_deployment.txt\")\n",
412+
"iris_deployment_file = os.path.join(\"exercises\", \"Iris\", \"iris_deployment.txt\")\n",
410413
"kh.deploy_model(\n",
411414
" iris_model_kdic,\n",
412415
" dictionary_name=\"SNB_Iris\",\n",
@@ -434,7 +437,7 @@
434437
},
435438
"outputs": [],
436439
"source": [
437-
"adult_deployment_file = os.path.join(adult_results_dir, \"adult_deployment.txt\")\n",
440+
"adult_deployment_file = os.path.join(\"exercises\", \"Adult\", \"adult_deployment.txt\")\n",
438441
"kh.deploy_model(\n",
439442
" adult_model_kdic,\n",
440443
" dictionary_name=\"SNB_Adult\",\n",
@@ -443,6 +446,13 @@
443446
")\n",
444447
"peek(adult_deployment_file)"
445448
]
449+
},
450+
{
451+
"cell_type": "code",
452+
"execution_count": null,
453+
"metadata": {},
454+
"outputs": [],
455+
"source": []
446456
}
447457
],
448458
"metadata": {

Core Basics 2 - Train a Classifier on a Star Multi-Table Dataset.ipynb

Lines changed: 27 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -78,9 +78,9 @@
7878
"```\n",
7979
"The `HeadlineId` variable is special because it is a _key_ that links a particular headline to its words (a 1:n relation).\n",
8080
"\n",
81-
"*Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only for pedagogical purporses.*\n",
81+
"*Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only used for pedagogical purporses.*\n",
8282
"\n",
83-
"To train a classifier with Khiops in this multi-table setup, this schema must be codified in the dictionary file. Let's check the contents of the `HeadlineSarcasm` dictionary file:"
83+
"To train a classifier with Khiops in this multi-table setup, this schema must be coded in a dictionary file. Let's check the contents of the `HeadlineSarcasm` dictionary file:"
8484
]
8585
},
8686
{
@@ -101,11 +101,11 @@
101101
"metadata": {},
102102
"source": [
103103
"As in the single-table case the `.kdic`file describes the schema for both tables, but note the following differences:\n",
104-
"- The dictionary for the table `Headline` is prefixed by the `Root` keyword to indicate that is the main one.\n",
105-
"- For both tables, their dictionary names are followed by `(HeadlineId)` to indicate that `HeadlineId` is the key of these tables.\n",
106-
"- The schema for the main table contains an extra special variable defined with the statement `Table(Words) HeadlineWords`. This is, in addition to sharing the same key variable, is necessary to indicate the `1:n` relationship between the main and secondary table.\n",
104+
"- The dictionary for the table `Headline` is prefixed by the `Root` keyword to indicate that it is the main one.\n",
105+
"- For both tables, dictionary names are followed by `(HeadlineId)` to indicate that `HeadlineId` is their key.\n",
106+
"- The schema of the main table contains an extra special variable defined with the statement `Table(Words) HeadlineWords`. This is, in addition to sharing the same key variable, necessary to indicate the `1:n` relationship between the main and secondary table.\n",
107107
"\n",
108-
"Now let's store the location main and secondary tables and peek their contents:"
108+
"Now let's store the location of the main and secondary tables and peek their contents:"
109109
]
110110
},
111111
{
@@ -117,7 +117,7 @@
117117
"sarcasm_headlines_file = os.path.join(\"data\", \"HeadlineSarcasm\", \"Headlines.txt\")\n",
118118
"sarcasm_words_file = os.path.join(\"data\", \"HeadlineSarcasm\", \"HeadlineWords.txt\")\n",
119119
"\n",
120-
"print(f\"HeadlineSarcasm main table file: {sarcasm_headlines_file}\")\n",
120+
"print(f\"HeadlineSarcasm main table file location: {sarcasm_headlines_file}\")\n",
121121
"print(\"\")\n",
122122
"peek(sarcasm_headlines_file, n=3)\n",
123123
"\n",
@@ -133,20 +133,20 @@
133133
"The call to the `train_predictor` will be very similar to the single-table case but there are some differences. \n",
134134
"\n",
135135
"The first is that we must pass the path of the extra secondary data table. This is done with the `additional_data_tables` parameter that is a Python dictionary containing key-value pairs for each table. More precisely:\n",
136-
"- keys describe *data paths* of secondary tables. In this case only ``Headline`HeadlineWords``\n",
137-
"- values describe the *file paths* of secondary tables. In this case only the file path we stored in `sarcasm_words_file`\n",
136+
"- keys describe *data paths* of secondary tables. In this case only, it is ``HeadlineWords``\n",
137+
"- values describe the *file paths* of secondary tables. In this case only, it is the file path we stored in `sarcasm_words_file`\n",
138138
"\n",
139-
"*Note: For understanding what data paths are see the \"Multi-Table Tasks\" section of the Khiops `core.api` documentation*\n",
139+
"*Note: To understand what data paths are, please check the \"Multi-Table Tasks\" section of the Khiops `core.api` documentation*\n",
140140
"\n",
141-
"Secondly, we specify how many features/aggregates Khiops will create with its multi-table AutoML mode. For the `HeadlineSarcasm` dataset Khiops can create features such as:\n",
141+
"Secondly, we must specify how many features/aggregates Khiops will create (at most) with its multi-table AutoML mode. For the `HeadlineSarcasm` dataset Khiops can create features such as:\n",
142142
"- *Number of different words in the headline* \n",
143143
"- *Most common word in the headline before the third one*\n",
144144
"- *Number of times the word 'the' appears*\n",
145145
"- ...\n",
146146
"\n",
147147
"It will then evaluate, select and combine the created features to build a classifier. We'll ask to create `1000` of these features (the default is `100`).\n",
148148
"\n",
149-
"With these considerations, let's setup the some extra variables and train the classifier:"
149+
"With these considerations, let's setup some extra variable and train the classifier:"
150150
]
151151
},
152152
{
@@ -155,15 +155,17 @@
155155
"metadata": {},
156156
"outputs": [],
157157
"source": [
158-
"sarcasm_results_dir = os.path.join(\"exercises\", \"HeadlineSarcasm\")\n",
158+
"analysis_report_file_path_Sarcasm = os.path.join(\n",
159+
" \"exercises\", \"HeadlineSarcasm\", \"AnalysisReport.khj\"\n",
160+
")\n",
159161
"\n",
160162
"sarcasm_report, sarcasm_model_kdic = kh.train_predictor(\n",
161163
" sarcasm_kdic,\n",
162164
" dictionary_name=\"Headline\", # This must be the main/root dictionary\n",
163165
" data_table_path=sarcasm_headlines_file, # This must be the data file for the main table\n",
164166
" target_variable=\"IsSarcasm\",\n",
165-
" results_dir=sarcasm_results_dir,\n",
166-
" additional_data_tables={\"Headline`HeadlineWords\": sarcasm_words_file},\n",
167+
" analysis_report_file_path=analysis_report_file_path_Sarcasm,\n",
168+
" additional_data_tables={\"HeadlineWords\": sarcasm_words_file},\n",
167169
" max_constructed_variables=1000, # by default Khiops constructs 100 variables for AutoML multi-table\n",
168170
" max_trees=0, # by default Khiops constructs 10 decision tree variables\n",
169171
")\n",
@@ -192,7 +194,7 @@
192194
"cell_type": "markdown",
193195
"metadata": {},
194196
"source": [
195-
"*Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this you may use the Khiops `sort_data_table` function or your favorite software. The examples of this tutorial have their tables pre-sorted.*"
197+
"*Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this you, may use the Khiops `sort_data_table` function or your favorite software. The examples of this tutorial have their tables pre-sorted.*"
196198
]
197199
},
198200
{
@@ -201,7 +203,7 @@
201203
"source": [
202204
"### Exercise time!\n",
203205
"\n",
204-
"Repeat the previous steps with the `AccidentsSummary` dataset. It describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:\n",
206+
"Repeat the previous steps with the `AccidentsSummary` dataset. This dataset describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:\n",
205207
"```\n",
206208
"+---------------+\n",
207209
"|Accidents |\n",
@@ -220,7 +222,7 @@
220222
" +---1:n--->|... |\n",
221223
" +---------------+\n",
222224
"```\n",
223-
"So for each accident we have its characteristics (such as `Gravity` or `Light` conditions) and those of each involved vehicle (its `Direction` or `PassengerNumber`). The main task for this dataset is to predict the variable `Gravity` that has two possible values:`Lethal` and `NonLethal`.\n",
225+
"For each accident, we have its characteristics (such as `Gravity` or `Light` conditions) and those of each involved vehicle (its `Direction` or `PassengerNumber`). The main task for this dataset is to predict the variable `Gravity` that has two possible values:`Lethal` and `NonLethal`.\n",
224226
"\n",
225227
"We first save the paths of the `AccidentsSummary` dictionary file and data table files into variables:"
226228
]
@@ -275,7 +277,7 @@
275277
"cell_type": "markdown",
276278
"metadata": {},
277279
"source": [
278-
"We now save the results directory for this exercise:"
280+
"We now define the path of the modeling report for this exercise:"
279281
]
280282
},
281283
{
@@ -284,8 +286,9 @@
284286
"metadata": {},
285287
"outputs": [],
286288
"source": [
287-
"accidents_results_dir = os.path.join(\"exercises\", \"AccidentSummary\")\n",
288-
"print(f\"AccidentsSummary exercise results directory: {accidents_results_dir}\")"
289+
"analysis_report_file_path_Accidents = os.path.join(\n",
290+
" \"exercises\", \"AccidentSummary\", \"AnalysisReport.khj\"\n",
291+
")"
289292
]
290293
},
291294
{
@@ -297,7 +300,7 @@
297300
"\n",
298301
"Do not forget:\n",
299302
"- The target variable is `Gravity`\n",
300-
"- The key for the `additional_data_tables` parameter is ``Accident`Vehicles`` and its value that of `vehicles_data_file`\n",
303+
"- The key for the `additional_data_tables` parameter is ``Vehicles`` and its value that of `vehicles_data_file`\n",
301304
"- Set `max_trees=0`"
302305
]
303306
},
@@ -314,8 +317,8 @@
314317
" dictionary_name=\"Accident\",\n",
315318
" data_table_path=accidents_data_file,\n",
316319
" target_variable=\"Gravity\",\n",
317-
" results_dir=accidents_results_dir,\n",
318-
" additional_data_tables={\"Accident`Vehicles\": vehicles_data_file},\n",
320+
" analysis_report_file_path=analysis_report_file_path_Accidents,\n",
321+
" additional_data_tables={\"Vehicles\": vehicles_data_file},\n",
319322
" max_constructed_variables=1000,\n",
320323
" max_trees=0,\n",
321324
")\n",

0 commit comments

Comments
 (0)