Refactoring `tsvtool` #258

14thibea · 2021-12-13T16:56:57Z

14thibea
Dec 13, 2021
Collaborator

Hi everyone,

The goal of the current sprint session is to better handle the tsvtools, which are still very Alzheimer-oriented.
For example we definitely want to avoid hard-coded default values linked to this particular context (see Issue #253).
Moreover, the structure created by tsvtools is quite heavy, and it could be nice to have everything concatenated in one single TSV file. In the following sections, we detail the issues with each tsvtool independently.

This version is the updated version after the meeting of the ClinicaDL team

Independant functionnalities

`getlabels`

Context generalization

getlabels is very Alzheimer oriented as it seeks to find the labels corresponding to stable AD, stable CN, pMCI, sMCI and non-regressive MCI which are only relevant in Alzheimer's disease context.

We can choose to leave the command as is and tell users that if they want labels linked to another context they have to make them themselves.
Otherwise we can let the user define a progression pattern (CN --> MCI --> AD) and then output general labels composed of two parts:

the first part highlight the stability of the label
- s stable if it remains identical across all sessions in a given time,
- p progressive if it progresses to the following state in a given time (eg. MCI --> AD),
- r regressive if it regresses to the previous state in a given time (eg. MCI --> CN),
- uk unknown if there are not enough sessions to assess the reliability of the label but no changes were spotted,
- us unstable otherwise (multiple conversions / regressions).
the second part is the diagnosis itself

Then rAD correspond to Alzheimer's disease regressing to MCI or CN, whereas ukCN corresponds to CN participants who always remained CN but with not enough sessions to assess their stability. This framework is a bit complex and non-exhaustive, so maybe we can just stay that way.

Structure simplification

getlabels outputs one TSV file per label. Instead we could have only one TSV file with columns containing the value of the label.

Example of TSV produced by getlabels:

participant_id	session_id	group	subgroup	age	sex	...
sub-CLNC0001	ses-M00	MCI	sMCI	72	M	...
sub-CLNC0002	ses-M00	MCI	pMCI	65	F	...
sub-CLNC0002	ses-M06	MCI	pMCI	66	F	...
sub-CLNC0003	ses-M00	AD	AD	89	F	...

`split` and `kfold`

Context generalization

split integrates a command to avoid data leakage between subgroups. Indeed getlabels creates a TSV file for MCI participants and two other TSV files for sMCI and pMCI, and these ones are included in the MCI group.
With the new system columns group and subgroup may allow to take into account the inclusion of one group in another one, and it is not MCI-dependent anymore.

Structure simplification

If we work with one TSV file only as suggested in getlabels structure simplification, we would have to stratify the splits by the label to ensure their correct distributions.

The procedure would then not depend on the nature of the split (SingleSplit, Kfold), but on the nature of the set extracted (test or validation).

tsvtool test creates a new TSV file test.tsv containing the keys (participant_id, session_id) included in the test set.
tsvtool validation creates a new TSV file named according to the procedure performed (for example kfold-3).

it excludes the participants already selected in test.tsv if such TSV file exists
For each key, it explicits which set it belongs to for each split according to the following structure (example for a 2-fold validation):

participant_id	session_id	split_index	split_type
sub-CLNC0001	ses-M00	0	train
sub-CLNC0001	ses-M00	1	validation
sub-CLNC0002	ses-M00	0	train
sub-CLNC0002	ses-M00	1	validation
sub-CLNC0002	ses-M06	0	train
sub-CLNC0002	ses-M06	1	N/A
sub-CLNC0003	ses-M00	0	validation
sub-CLNC0003	ses-M00	1	train

To note, we only want the baseline sessions in the validation set, then some sessions may not belong to any set.

For the predict function, it would be possible to give a directory and ClinicaDL would automatically understand that it has to use the indices of test.tsv and metadata of the getlabels TSV file.

To ease the use of this new structure, a new tool should be implemented to be able to reconstruct the TSV file with all metadata corresponding to one split / the test set.

`analysis`

Context generalization

Must become diagnosis independent (is now meant to work on AD, CN, MCI...).
Also it was made to work on specific dementia scores (MMSE and CDR) which are handled specifically as they can be either continuous are discreete.
Maybe instead the user could give the set of discrete or continuous values they want to integrate in the analysis.
We can also rely on pandas to automatically detect which columns correspond to continuous values or discreete values.

BONUS: the table_to_latex function could automatically generate your population table ready to be copy-pasted in your LaTeX article!

Structure simplification

N/A

`restrict`

Specific to AD-DL analysis. Could be definitely removed.

Global structure

Before beginning this procedure, the user must have a BIDS.

Current version

The steps to be performed before clinicadl train are currently the following ones:

clinica iotools merge-tsv sums up the BIDS in one TSV file.

input: BIDS directory
output: TSV file

clinica iotools missing-mods explicits which modalities are available in the BIDS.

input: BIDS directory
output: directory

clinicadl tsvtool getlabels extracts Alzheimer's disease labels according to a modality in the BIDS - optional, as the users may have their own labels.

input: merge-tsv TSV file + missing-mods directory
output: directory containing a series of TSV files

clinicadl tsvtool split | kfold separates the test set, and then define the train / validation scheme

input: getlabels directory
output: directory containing a series of TSV file

Independently, the extract command must be run to define the mode (image, patch, slice or roi) and their corresponding parameters, as well as the parameters corresponding to the preprocessing pipeline wanted (t1-linear, pet-linear or custom).

Proposition

The steps depending on Clinica should be already integrated in the CAPS as the preprocessing pipelines were conducted, then the steps (1) and (2) could disappear.
(3) may be performed or not depending on the needs of the user (if not the user must provide an equivalent TSV file).
(4) could be renamed prepare-experiment and would allow to explicit the data repartition between train / validation and test sets.
(5) could be renamed prepare-data and would allow to prepare neuroimaging data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring `tsvtool` #258

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Refactoring tsvtool #258

14thibea Dec 13, 2021 Collaborator

Independant functionnalities

getlabels

Context generalization

Structure simplification

split and kfold

Context generalization

Structure simplification

analysis

Context generalization

Structure simplification

restrict

Global structure

Current version

Proposition

Replies: 0 comments

Refactoring `tsvtool` #258

14thibea
Dec 13, 2021
Collaborator

`getlabels`

`split` and `kfold`

`analysis`

`restrict`