|
78 | 78 | "```\n",
|
79 | 79 | "The `HeadlineId` variable is special because it is a _key_ that links a particular headline to its words (a 1:n relation).\n",
|
80 | 80 | "\n",
|
81 |
| - "*Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only for pedagogical purporses.*\n", |
| 81 | + "*Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only used for pedagogical purporses.*\n", |
82 | 82 | "\n",
|
83 |
| - "To train a classifier with Khiops in this multi-table setup, this schema must be codified in the dictionary file. Let's check the contents of the `HeadlineSarcasm` dictionary file:" |
| 83 | + "To train a classifier with Khiops in this multi-table setup, this schema must be coded in a dictionary file. Let's check the contents of the `HeadlineSarcasm` dictionary file:" |
84 | 84 | ]
|
85 | 85 | },
|
86 | 86 | {
|
|
101 | 101 | "metadata": {},
|
102 | 102 | "source": [
|
103 | 103 | "As in the single-table case the `.kdic`file describes the schema for both tables, but note the following differences:\n",
|
104 |
| - "- The dictionary for the table `Headline` is prefixed by the `Root` keyword to indicate that is the main one.\n", |
105 |
| - "- For both tables, their dictionary names are followed by `(HeadlineId)` to indicate that `HeadlineId` is the key of these tables.\n", |
106 |
| - "- The schema for the main table contains an extra special variable defined with the statement `Table(Words) HeadlineWords`. This is, in addition to sharing the same key variable, is necessary to indicate the `1:n` relationship between the main and secondary table.\n", |
| 104 | + "- The dictionary for the table `Headline` is prefixed by the `Root` keyword to indicate that it is the main one.\n", |
| 105 | + "- For both tables, dictionary names are followed by `(HeadlineId)` to indicate that `HeadlineId` is their key.\n", |
| 106 | + "- The schema of the main table contains an extra special variable defined with the statement `Table(Words) HeadlineWords`. This is, in addition to sharing the same key variable, necessary to indicate the `1:n` relationship between the main and secondary table.\n", |
107 | 107 | "\n",
|
108 |
| - "Now let's store the location main and secondary tables and peek their contents:" |
| 108 | + "Now let's store the location of the main and secondary tables and peek their contents:" |
109 | 109 | ]
|
110 | 110 | },
|
111 | 111 | {
|
|
117 | 117 | "sarcasm_headlines_file = os.path.join(\"data\", \"HeadlineSarcasm\", \"Headlines.txt\")\n",
|
118 | 118 | "sarcasm_words_file = os.path.join(\"data\", \"HeadlineSarcasm\", \"HeadlineWords.txt\")\n",
|
119 | 119 | "\n",
|
120 |
| - "print(f\"HeadlineSarcasm main table file: {sarcasm_headlines_file}\")\n", |
| 120 | + "print(f\"HeadlineSarcasm main table file location: {sarcasm_headlines_file}\")\n", |
121 | 121 | "print(\"\")\n",
|
122 | 122 | "peek(sarcasm_headlines_file, n=3)\n",
|
123 | 123 | "\n",
|
|
133 | 133 | "The call to the `train_predictor` will be very similar to the single-table case but there are some differences. \n",
|
134 | 134 | "\n",
|
135 | 135 | "The first is that we must pass the path of the extra secondary data table. This is done with the `additional_data_tables` parameter that is a Python dictionary containing key-value pairs for each table. More precisely:\n",
|
136 |
| - "- keys describe *data paths* of secondary tables. In this case only ``Headline`HeadlineWords``\n", |
137 |
| - "- values describe the *file paths* of secondary tables. In this case only the file path we stored in `sarcasm_words_file`\n", |
| 136 | + "- keys describe *data paths* of secondary tables. In this case only, it is ``HeadlineWords``\n", |
| 137 | + "- values describe the *file paths* of secondary tables. In this case only, it is the file path we stored in `sarcasm_words_file`\n", |
138 | 138 | "\n",
|
139 |
| - "*Note: For understanding what data paths are see the \"Multi-Table Tasks\" section of the Khiops `core.api` documentation*\n", |
| 139 | + "*Note: To understand what data paths are, please check the \"Multi-Table Tasks\" section of the Khiops `core.api` documentation*\n", |
140 | 140 | "\n",
|
141 |
| - "Secondly, we specify how many features/aggregates Khiops will create with its multi-table AutoML mode. For the `HeadlineSarcasm` dataset Khiops can create features such as:\n", |
| 141 | + "Secondly, we must specify how many features/aggregates Khiops will create (at most) with its multi-table AutoML mode. For the `HeadlineSarcasm` dataset Khiops can create features such as:\n", |
142 | 142 | "- *Number of different words in the headline* \n",
|
143 | 143 | "- *Most common word in the headline before the third one*\n",
|
144 | 144 | "- *Number of times the word 'the' appears*\n",
|
145 | 145 | "- ...\n",
|
146 | 146 | "\n",
|
147 | 147 | "It will then evaluate, select and combine the created features to build a classifier. We'll ask to create `1000` of these features (the default is `100`).\n",
|
148 | 148 | "\n",
|
149 |
| - "With these considerations, let's setup the some extra variables and train the classifier:" |
| 149 | + "With these considerations, let's setup some extra variable and train the classifier:" |
150 | 150 | ]
|
151 | 151 | },
|
152 | 152 | {
|
|
155 | 155 | "metadata": {},
|
156 | 156 | "outputs": [],
|
157 | 157 | "source": [
|
158 |
| - "sarcasm_results_dir = os.path.join(\"exercises\", \"HeadlineSarcasm\")\n", |
| 158 | + "analysis_report_file_path_Sarcasm = os.path.join(\n", |
| 159 | + " \"exercises\", \"HeadlineSarcasm\", \"AnalysisReport.khj\"\n", |
| 160 | + ")\n", |
159 | 161 | "\n",
|
160 | 162 | "sarcasm_report, sarcasm_model_kdic = kh.train_predictor(\n",
|
161 | 163 | " sarcasm_kdic,\n",
|
162 | 164 | " dictionary_name=\"Headline\", # This must be the main/root dictionary\n",
|
163 | 165 | " data_table_path=sarcasm_headlines_file, # This must be the data file for the main table\n",
|
164 | 166 | " target_variable=\"IsSarcasm\",\n",
|
165 |
| - " results_dir=sarcasm_results_dir,\n", |
166 |
| - " additional_data_tables={\"Headline`HeadlineWords\": sarcasm_words_file},\n", |
| 167 | + " analysis_report_file_path=analysis_report_file_path_Sarcasm,\n", |
| 168 | + " additional_data_tables={\"HeadlineWords\": sarcasm_words_file},\n", |
167 | 169 | " max_constructed_variables=1000, # by default Khiops constructs 100 variables for AutoML multi-table\n",
|
168 | 170 | " max_trees=0, # by default Khiops constructs 10 decision tree variables\n",
|
169 | 171 | ")\n",
|
|
192 | 194 | "cell_type": "markdown",
|
193 | 195 | "metadata": {},
|
194 | 196 | "source": [
|
195 |
| - "*Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this you may use the Khiops `sort_data_table` function or your favorite software. The examples of this tutorial have their tables pre-sorted.*" |
| 197 | + "*Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this you, may use the Khiops `sort_data_table` function or your favorite software. The examples of this tutorial have their tables pre-sorted.*" |
196 | 198 | ]
|
197 | 199 | },
|
198 | 200 | {
|
|
201 | 203 | "source": [
|
202 | 204 | "### Exercise time!\n",
|
203 | 205 | "\n",
|
204 |
| - "Repeat the previous steps with the `AccidentsSummary` dataset. It describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:\n", |
| 206 | + "Repeat the previous steps with the `AccidentsSummary` dataset. This dataset describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:\n", |
205 | 207 | "```\n",
|
206 | 208 | "+---------------+\n",
|
207 | 209 | "|Accidents |\n",
|
|
220 | 222 | " +---1:n--->|... |\n",
|
221 | 223 | " +---------------+\n",
|
222 | 224 | "```\n",
|
223 |
| - "So for each accident we have its characteristics (such as `Gravity` or `Light` conditions) and those of each involved vehicle (its `Direction` or `PassengerNumber`). The main task for this dataset is to predict the variable `Gravity` that has two possible values:`Lethal` and `NonLethal`.\n", |
| 225 | + "For each accident, we have its characteristics (such as `Gravity` or `Light` conditions) and those of each involved vehicle (its `Direction` or `PassengerNumber`). The main task for this dataset is to predict the variable `Gravity` that has two possible values:`Lethal` and `NonLethal`.\n", |
224 | 226 | "\n",
|
225 | 227 | "We first save the paths of the `AccidentsSummary` dictionary file and data table files into variables:"
|
226 | 228 | ]
|
|
275 | 277 | "cell_type": "markdown",
|
276 | 278 | "metadata": {},
|
277 | 279 | "source": [
|
278 |
| - "We now save the results directory for this exercise:" |
| 280 | + "We now define the path of the modeling report for this exercise:" |
279 | 281 | ]
|
280 | 282 | },
|
281 | 283 | {
|
|
284 | 286 | "metadata": {},
|
285 | 287 | "outputs": [],
|
286 | 288 | "source": [
|
287 |
| - "accidents_results_dir = os.path.join(\"exercises\", \"AccidentSummary\")\n", |
288 |
| - "print(f\"AccidentsSummary exercise results directory: {accidents_results_dir}\")" |
| 289 | + "analysis_report_file_path_Accidents = os.path.join(\n", |
| 290 | + " \"exercises\", \"AccidentSummary\", \"AnalysisReport.khj\"\n", |
| 291 | + ")" |
289 | 292 | ]
|
290 | 293 | },
|
291 | 294 | {
|
|
297 | 300 | "\n",
|
298 | 301 | "Do not forget:\n",
|
299 | 302 | "- The target variable is `Gravity`\n",
|
300 |
| - "- The key for the `additional_data_tables` parameter is ``Accident`Vehicles`` and its value that of `vehicles_data_file`\n", |
| 303 | + "- The key for the `additional_data_tables` parameter is ``Vehicles`` and its value that of `vehicles_data_file`\n", |
301 | 304 | "- Set `max_trees=0`"
|
302 | 305 | ]
|
303 | 306 | },
|
|
314 | 317 | " dictionary_name=\"Accident\",\n",
|
315 | 318 | " data_table_path=accidents_data_file,\n",
|
316 | 319 | " target_variable=\"Gravity\",\n",
|
317 |
| - " results_dir=accidents_results_dir,\n", |
318 |
| - " additional_data_tables={\"Accident`Vehicles\": vehicles_data_file},\n", |
| 320 | + " analysis_report_file_path=analysis_report_file_path_Accidents,\n", |
| 321 | + " additional_data_tables={\"Vehicles\": vehicles_data_file},\n", |
319 | 322 | " max_constructed_variables=1000,\n",
|
320 | 323 | " max_trees=0,\n",
|
321 | 324 | ")\n",
|
|
0 commit comments