Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] layout redone #134

Closed
wants to merge 1 commit into from
Closed

[WIP] layout redone #134

wants to merge 1 commit into from

Conversation

nbeliy
Copy link
Collaborator

@nbeliy nbeliy commented Feb 9, 2021

Main changes:

  • No assumptions on modalities are used
  • modalities are in its own modalities substructure
  • Additional fields in modality structure (corresponding to files):
    • basename -- filename without suffix
    • gz -- true if file is compressed (ends with .gz)
    • tab -- true if file is tabular
    • intended -- array of files that uses current file
    • metafie -- full path to corresponding json file
    • entity -- substructure containing parsed entities of file
  • tabular files are treated as data files

Things to do:

  • Adapt validator and integrate it in layout
  • Adapt querry
  • Create depends field, that contains paths to all files needed for given one

@Remi-Gau
Copy link
Collaborator

Remi-Gau commented Feb 9, 2021

hey @nbeliy

The "No assumptions on modalities are used" duplicates the "schema less" approach of this #124

Also I would STRONGLY avoid storing the name of the json sidecars in the layout because the metadata for a given file can be spread across several json files spread through the different levels of the BIDS folder hierarchy (inheritance principle).

@nbeliy
Copy link
Collaborator Author

nbeliy commented Feb 9, 2021

The "No assumptions on modalities are used" duplicates the "schema less" approach of this #124

In shema-less approach, you need to separate modality and entity fields from predefined it. You can't forbid user to use modality filename and completely broke the layout. Same with modality.

I still don't see the point to run all parse_dwi, parse_func instead of just loading the layout and run validator on top.

@Remi-Gau
Copy link
Collaborator

Remi-Gau commented Feb 9, 2021

The "No assumptions on modalities are used" duplicates the "schema less" approach of this #124

In shema-less approach, you need to separate modality and entity fields from predefined it. You can't forbid user to use modality filename and completely broke the layout. Same with modality.

Wait I am not sure what you mean.

This is what the behavior of the schema-less use of layout function

use_schema = false();

BIDS = bids.layout(fullfile(pth_bids_example, ...
                              'ds000001-fmriprep'), use_schema);

This can then return all modalities including those that do not exist in the BIDS specification (ex: figures)

bids.query(BIDS, 'modalities')

ans =

  1×3 cell array

    'anat'    'figures'    'func'

Entities that do not exist the BIDS schema at the moment are listed in the layout.

disp(BIDS.subjects(1).func(1))

ans = 

  struct with fields:

    filename: 'sub-10_task-balloonanalogrisktask_run-1_AROMAnoiseICs.csv'
      suffix: 'AROMAnoiseICs'
         ext: '.csv'
         sub: '10'
        task: 'balloonanalogrisktask'
         run: '1'
        desc: ''
     content: []
        meta: []
        from: ''
        mode: ''
          to: ''
         res: ''
       space: ''
        hemi: ''

I am trying to see what is the test case you mentioned that would break the layout but I think I don't see what you mean, could you give me a concrete example?

@nbeliy
Copy link
Collaborator Author

nbeliy commented Feb 9, 2021

Imagine that someone will have file sub-10_task-balloonanalogrisktask_run-1_AROMAnoiseICs_ext-test.csv.
Then the test entity will override the ext field

@Remi-Gau
Copy link
Collaborator

Remi-Gau commented Feb 9, 2021

Imagine that someone will have file sub-10_task-balloonanalogrisktask_run-1_AROMAnoiseICs_ext-test.csv.
Then the test entity will override the ext field

Ha yeah in that case the file would actually just be skipped.

But then that changes the basic assumption we've had from the beginning of this discussion that files should at least follow this pattern: sub-<>_(<key>-<value>)*_<suffix>.<extension>.

I am not so keen in going in that direction: I think this is the minimal amount of file formatting that there should be in the subject folders.

@nbeliy
Copy link
Collaborator Author

nbeliy commented Feb 9, 2021

I meant:
sub-10_task-balloonanalogrisktask_run-1_ext-test_AROMAnoiseICs.csv

If one of entities (or modalities) coincides with field name -- it will pass the regexp, but overwrite the corresponding field, which will make it unsafe. I agree it will be a rare occasion.

A possible fix (without creating sub-structure), is to make fields starting wth '_' (which will never appear in entity/modality name).

@Remi-Gau
Copy link
Collaborator

Remi-Gau commented Feb 9, 2021

I meant:
sub-10_task-balloonanalogrisktask_run-1_ext-test_AROMAnoiseICs.csv

If one of entities (or modalities) coincides with field name -- it will pass the regexp, but overwrite the corresponding field, which will make it unsafe. I agree it will be a rare occasion.

Aaaahhhh I see. Sorry. It took me a while.

A possible fix (without creating sub-structure), is to make fields starting wth '_' (which will never appear in entity/modality name).

If I understand you correctly, then I am afraid matlab won't be happy about this:

A = struct('_test', 1)
Error using struct
Invalid field name "_test"

@nbeliy
Copy link
Collaborator Author

nbeliy commented Feb 9, 2021

Then with '_' at the end?:

p.filename_ = ...

@Remi-Gau
Copy link
Collaborator

Remi-Gau commented Feb 9, 2021

Then with '_' at the end?:

p.filename_ = ...

Yuck. No.

Let's go with sub-structures for entities : I am always afraid of deeply nested sub-structures but sometimes you can't avoid it.

Just to make sure we are on the same page, you are suggesting something like this?

disp(BIDS.subjects(1).func(1))

ans = 

  struct with fields:

      filename: 'sub-10_task-balloonanalogrisktask_run-1_AROMAnoiseICs.tsv'   
      suffix: 'AROMAnoiseICs'
      ext: '.tsv'
      sub: '10'
      entities: [1×1 struct]
      basename: 'sub-10_task-balloonanalogrisktask_run-1' 
      gz: 0  
      tab: 1
      intended_for: []
      dependency : []
      metafile: 'sub-10_task-balloonanalogrisktask_run-1_AROMAnoiseICs.json'   

Entity substructure

disp(BIDS.subjects(1).func(1).entities)     

ans = 

  struct with fields: 

     task: 'balloonanalogrisktask'
     run: '1'

sprintf(['^%s.*' pattern '$'], ...
subject.name));

pattern);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are still going to need to have something like this here because some of the "files" for meg are actually folders.

  if strcmp(modality, 'meg') && ~isempty(d)
    for i = 1:size(d, 1)
      file_list{end + 1, 1} = d(i, :);
    end
  end

@nbeliy
Copy link
Collaborator Author

nbeliy commented Feb 10, 2021

Let's go with sub-structures for entities : I am always afraid of deeply nested sub-structures but sometimes you can't avoid it.

Another possibility, probably simpler, is to disallow entities and modalities with the same name as fields (with a warning).
Realistically entity namemed filename or suffix do not have much sense.

I agree to drop this pull request (in any case I can't prvide a quality matlab code), and move to your implementation for derivatives.

@Remi-Gau
Copy link
Collaborator

Let's go with sub-structures for entities : I am always afraid of deeply nested sub-structures but sometimes you can't avoid it.

Another possibility, probably simpler, is to disallow entities and modalities with the same name as fields (with a warning).
Realistically entity namemed filename or suffix do not have much sense.

Well you just never know: so I went ahead and adapted the other PR to align better with your suggestion. Also adapted the query along the way.

I was also thinking that to get the 2 steps process you were suggesting to give the same outcome as the other PR would imply going:

  • through all the files to add them to the layout
    -and then going through the whole layout to remove from the layout the files that don't match the schema.

So I think I prefer only indexing stuff once even if it means carrying around the schema during parsing.

But you are correct that some of the code parse_X will most likely not be needed in the long run. Seeing your code made realize there is some further refactoring I can do actually.

@Remi-Gau
Copy link
Collaborator

Quick question.

What was the idea or the usecase for adding those 2 fields?

  • gz -- true if file is compressed (ends with .gz)
  • tab -- true if file is tabular

@nbeliy
Copy link
Collaborator Author

nbeliy commented Feb 10, 2021

What was the idea or the usecase for adding those 2 fields?

These are just small 'usefull for later' switches, based on object-oriented approach. In my mind the layout is used as structure passed to several bids-related tools or even a pipeline. And for such tools it may be usefull to know if file is compressed, or it is tabular file.

Practice example: the Christop's prepare_derivatives function will decompress compressed data (for SPM). And it is easier and more robust to just check if file.gz than test if it ends in .gz.

The tab is in the same spirit -- tabular data files should be treated differently from image (at least the sidecar json files have different info and structure), so test if given data file is tabular is useful.

The matlab-bids is more oriented to query, and in context of flat lists of files these switches not really important.

@Remi-Gau
Copy link
Collaborator

These are just small 'usefull for later' switches, based on object-oriented approach. In my mind the layout is used as structure passed to several bids-related tools or even a pipeline. And for such tools it may be usefull to know if file is compressed, or it is tabular file.

OK that 's what I though but I just wanted to make sure.

Practice example: the Christop's prepare_derivatives function will decompress compressed data (for SPM). And it is easier and more robust to just check if file.gz than test if it ends in .gz.

yup sort of ran into the same problem, so it could make sense to have it either in the structure or have query be able to filter files based on that their zipped status.

PS: @ChristophePhillips I used SPM for years and learned only recently that it can handle zipped file fine: bids-standard/bids-specification#136 (comment)

The tab is in the same spirit -- tabular data files should be treated differently from image (at least the sidecar json files have different info and structure), so test if given data file is tabular is useful.

yup, same here support either in the structure or via query seems to make sense.

@Remi-Gau
Copy link
Collaborator

@all-contributors please add @nbeliy for code, ideas

@allcontributors
Copy link
Contributor

@Remi-Gau

I've put up a pull request to add @nbeliy! 🎉

@Remi-Gau
Copy link
Collaborator

Will close this one and wrap the other PR so it can at least live "happily" in the dev branch.

@Remi-Gau Remi-Gau closed this Feb 15, 2021
@nbeliy nbeliy deleted the crcLayout branch November 14, 2022 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants