Skip to content

Commit 596e1e7

Browse files
committed
updated documentation
1 parent 0b628ef commit 596e1e7

File tree

1 file changed

+6
-4
lines changed

1 file changed

+6
-4
lines changed

docs/pages/usage.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -208,10 +208,12 @@ A full list of parameters for the individual data types can be found below:
208208
209209
##### Creating training / testing and validation splits with `coderdata`
210210
211-
Using the `Dataset.train_test_validate()` function the dataset can be split into trining, testing and validation sets. The function will return a `Split` object (a python `@dataclass`) that contains three `Dataset` objects that can be adressed and retrieved by subscripting with eiter `Split.train`, `Split.test` or `Split.validate`.
211+
`coderdata` provides two functions to generate dataset splits. `Dataset.split_train_other()` for a "two-way" split (useful if no validation in machine learning needs to be done) and `Dataset.split_train_test_validate()` for a "three-way" split. Both functions return `@dataclass` objects, that contain either `.train` & `.other` (`.split_train_other()`) or `.train`, `.test` and `.validate` (`.split_train_test_validate()`) attributes which reference `Dataset` objects.
212+
213+
Example uses of `.split_train_test_validate()` follow below. Note that both splitting functions share the same arguments with only `ratio` differing in so far that `.split_train_test_validate()` expects a touple with 3 elements whereas `.split_train_other` expects a 2 element tuple.
212214
213215
```python
214-
>>> split = beataml.train_test_validate()
216+
>>> split = beataml.split_train_test_validate()
215217
>>> split.train.experiments.shape
216218
(187020, 8)
217219
>>> split.test.experiments.shape
@@ -227,15 +229,15 @@ By default the returned splits will be `mixed-set` (drugs and cancer samples can
227229
- `drug-blind`: Splits according to drug association. Any sample associated with a drug will be unique to one of the splits. For example samples with association to drug A will only be present in the train split, but never in test or validate.
228230
- `cancer-blind`: Splits according to cancer association. Equivalent to drug-blind, except cancer types will be unique to splits.
229231
230-
`ratio` can be used to adjust the split ratios using a 3 item tuple containing integers. For example `ratio=(5:3:2)` would result in a split where train, test and validate contain roughly 50%, 30% and 20% of the original data respectively.
232+
`ratio` can be used to adjust the split ratios using a 3 item tuple containing integers (2 items for `.split_train_other`). For example `ratio=(5:3:2)` would result in a split where train, test and validate contain roughly 50%, 30% and 20% of the original data respectively.
231233
232234
`random_state` defines a seed values for the random number generator. Defining a `random_state` will guarantee reproducability as two runs with the same `random_state` will result in the same splits.
233235
234236
`stratify_by` Defines if the training, testing and validation sets should be stratified. Stratification tries to maintain a similar distribution of feature classes across different splits. For example assuming a drug respones value threshold that defines positive and negative classes (e.g. reduced vs. no change in cancer cell viability) the splitting algorithm could attempt to assign the same amount of positive class instances as negative class instances to each split. Stratification is performed by `drug_response_value`. Any value other than `None` indicates stratification and defines which `drug_response_value` should be used as basis for the stratification. `None` indicates that no stratfication should be performed. Which type of stratification should be performe can further be customized with keyword arguments (`thresh`, `num_classes`, `quantiles`).
235237
236238
An example call to create a 70/20/10 drug-blind split that is stratified by `fit_auc` could look like this:
237239
```python
238-
>>> split = beataml.train_test_validate(
240+
>>> split = beataml.split_train_test_validate(
239241
... split_type='drug-blind',
240242
... ratio=[7,2,1],
241243
... random_state=42,

0 commit comments

Comments
 (0)