Skip to content

Conversation

@ndiamant
Copy link
Contributor

ml4ht.data.data_source.DataIndex generalizes the idea of a sample id by replacing an integer id with a dictionary of values to select data with.
The simplest DataIndex is something like {"sample_id": 2}, but you might also want to include something like the dates for your modalities: {"sample_id": 2, "ecg_date": 01-01-2000, "af_date": 02-01-2000}.

ml4ht.data.data_source.DataSource generalizes the data-getting side of ml4h TensorMaps and DataDescription.get_raw_data.
A DataSource returns a dictionary of model inputs, and a dictionary of model outputs. For example, ECGHD5Source might return {"ecg": np.array(...), "ecg_age": [12]}, {"AF": [0, 1]}.

In order to train using multiple DataSources, you can use ml4ht.data.data_source.TrainingDataset, which integrates with pytorchs DataLoader for multiprocessing capabilities.
If you want to skip errors, or change the indices each epoch, use ml4ht.data.data_source.TrainingIterableDataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants