You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/pages/usage.md
+6-4Lines changed: 6 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -208,10 +208,12 @@ A full list of parameters for the individual data types can be found below:
208
208
209
209
##### Creating training / testing and validation splits with `coderdata`
210
210
211
-
Using the `Dataset.train_test_validate()`functionthe dataset can be split into trining, testing and validation sets. The functionwillreturn a `Split` object (a python `@dataclass`) that contains three `Dataset` objects that can be adressed and retrieved by subscripting with eiter `Split.train`, `Split.test` or `Split.validate`.
211
+
`coderdata` provides two functions to generate dataset splits. `Dataset.split_train_other()`fora "two-way" split (useful if no validationin machine learning needs to be done) and `Dataset.split_train_test_validate()`for a "three-way" split. Both functions return`@dataclass` objects, that contain either `.train`&`.other` (`.split_train_other()`) or `.train`, `.test` and `.validate` (`.split_train_test_validate()`) attributes which reference `Dataset` objects.
212
+
213
+
Example uses of `.split_train_test_validate()` follow below. Note that both splitting functions share the same arguments with only `ratio` differing in so far that `.split_train_test_validate()` expects a touple with 3 elements whereas `.split_train_other` expects a 2 element tuple.
212
214
213
215
```python
214
-
>>> split = beataml.train_test_validate()
216
+
>>> split = beataml.split_train_test_validate()
215
217
>>> split.train.experiments.shape
216
218
(187020, 8)
217
219
>>> split.test.experiments.shape
@@ -227,15 +229,15 @@ By default the returned splits will be `mixed-set` (drugs and cancer samples can
227
229
- `drug-blind`: Splits according to drug association. Any sample associated with a drug will be unique to one of the splits. For example samples with association to drug A will only be present in the train split, but never intest or validate.
228
230
- `cancer-blind`: Splits according to cancer association. Equivalent to drug-blind, except cancer types will be unique to splits.
229
231
230
-
`ratio` can be used to adjust the split ratios using a 3 item tuple containing integers. For example `ratio=(5:3:2)` would result in a split where train, test and validate contain roughly 50%, 30% and 20% of the original data respectively.
232
+
`ratio` can be used to adjust the split ratios using a 3 item tuple containing integers (2 items for`.split_train_other`). For example `ratio=(5:3:2)` would resultin a split where train, test and validate contain roughly 50%, 30% and 20% of the original data respectively.
231
233
232
234
`random_state` defines a seed values forthe random number generator. Defining a `random_state` will guarantee reproducability as two runs with the same `random_state` will resultin the same splits.
233
235
234
236
`stratify_by` Defines if the training, testing and validation sets should be stratified. Stratification tries to maintain a similar distribution of feature classes across different splits. For example assuming a drug respones value threshold that defines positive and negative classes (e.g. reduced vs. no change in cancer cell viability) the splitting algorithm could attempt to assign the same amount of positive class instances as negative class instances to each split. Stratification is performed by `drug_response_value`. Any value other than `None` indicates stratification and defines which `drug_response_value` should be used as basis for the stratification. `None` indicates that no stratfication should be performed. Which type of stratification should be performe can further be customized with keyword arguments (`thresh`, `num_classes`, `quantiles`).
235
237
236
238
An example call to create a 70/20/10 drug-blind split that is stratified by `fit_auc` could look like this:
0 commit comments