Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name should be a required field on DRSObject #385

Open
mattions opened this issue May 25, 2022 · 5 comments
Open

Name should be a required field on DRSObject #385

mattions opened this issue May 25, 2022 · 5 comments

Comments

@mattions
Copy link

mattions commented May 25, 2022

Hi all,

I'm opening this issue to discuss the change for the field name on the DRSObject from optional to required.

As per version 1.2 the field name is optional on the DRSObject, while it is required on the ContentsObject.
We suggest to make the name required also on the DRSObject.

In our experience, the name of the file carry a certain type of information that are extremely useful to the final user. who uses this information to operate on the data pointed by the DRSObject. The information usually are:

  1. information related to the content of the file, identifying for example the sample_id or other type of "sequencing" metadata which the user can utilize to distinguish the files and operate on them correctly.

  2. Extension of the file, which provides an implicit assertion on what is the content of the file, and gives the user the ability to choose the right tool to continue the analysis (for example a .bam vs .vcf vs .dcm informs the user what type of files they have, and what would be the next step that makes sense.) While the info on point 1 could be addressed via some ancillary metadata import, the information about the extension are usually not covered.

The main reason to not have the name as required it is only if the name is overloaded with controlled type metadata, which should be accessed only by authorized party. However this problem will not be present anymore as soon the #381 gets merged in, given that the OPTION call would be used to assess the authorization of the files, in case there is the need to enforce the authorization directly from the first GET.

@mattions mattions changed the title DRS Name should be a required field Name should be a required field on DRSObject May 25, 2022
@Avantol13
Copy link

One of the main goals of DRS is to abstract the locations of files with a persistent identifier, such that a change in location or file name doesn't break existing references (e.g. we're enabling research reproducibility in a world of moving data). That being said, we need to be really careful about requiring and relying on a name separate from the explicit locations - because we're beginning to crack that abstraction a bit.

  1. I think it's dangerous for anyone to rely on file names for clinical metadata. Files change and move and the intention of the name is not to identify that kind of information - we cannot guarantee as part of DRS that every server will follow certain file name conventions.
  2. Agree on the use case here, but this should be pulled from the URLs, not another field like name - we're introducing a synchronization issue here between the real location and file name in the URLs (assuming a cloud bucket path) and this name field

I think a file name hint for download clients is fine, if that's all name is. But I don't think it should be required (clients should default to naming based on the DRS ID) and I'm very nervous about relying on it for information like clinical metadata and file extensions.

@ianfore
Copy link

ianfore commented Jun 3, 2022

Agree with Alex on the problem of putting data in the file name. And it's not just clinical metadata that's being put there.
This notebook illustrates twhat gets into filenames and the detective work that has to be done to work with them. There are about seven different pieces of information being encoded in the file names in this example.
https://github.com/ianfore/cdatest/blob/main/v2tests/09CO022%20Proteomic%20Files%20on%20v2.ipynb

Data Connect exists alongside DRS to handle this problem. There are plenty of examples we could use to illustrate this.

See also this issue which is intended to deal with a specific and important part of what Michele raises.
#319
I just updated that with new developments that came out of TASC.

@mattions
Copy link
Author

hi @Avantol13 and @ianfore , thanks for the feedback.

I see two main aspects raised:

  1. clinical level information on the name:

The PR #381, which has already received positive feedback and will be merged soon will resolve this problem: a user will be able to perform the get Content API call only if she has the proper permission, which means she can see the name and the clinical level.

  1. Synchronization between access_url and the name in the DRS_record

Using the access_url as the way to guess the name of the file is problematic:

  • there is no requirements to have the access_url on the DRS response on the first call.
  • A Drs record can have more than one access_url values, so it's not clear which one to pick from a DRS client point of view.
  • there is no guarantee that the end of the url is the actual filename (note that we have seen cases where these do differs, and it is not at all a filename at the end.)
  • the access_url can contain characters that are not allowed in a filename field, which means you can't simply get it from there.

As for the suggesting on using the drs_id, that would work form a functional point of view, however it means that any import that let you import DRS will provide just very difficult to work with data

For example, this is from the DRS Server in production from the INCLUDE Project:
If I import with the name, the files look like this to the user, so

  • 0012022d-c855-4087-9dde-c522f0632024.kallisto.abundance.tsv.gz
  • 0012022d-c855-4087-9dde-c522f0632024.rsem.genes.results.gz
  • 0012022d-c855-4087-9dde-c522f0632024.rsem.isoforms.results.gz
  • HTP0001B_MSD.tsv.gz
  • HTP0003A_MSD.tsv.gz
  • HTP0005A_MSD.tsv.gz
  • ...

If instead I do import via just the id I have:

  • 135d4f94-d9cf-4a57-a830-c61fd94a7607
  • 6c189b67-b022-49d3-bd0c-a049814d46c8
  • aa06c73f-b1c9-4c65-923f-e4783f261540
  • 733f0956-dac4-4ad9-b787-4e186c6530b0
  • 5a2c8168-dbbf-4c46-8609-8b265f25aecd
  • 0b2b0a33-0a52-41cb-a9b5-6e4446fd485f

Does this make sense to you?

@Avantol13
Copy link

I understand what you're saying but all the things I previously said still stand. Requiring an additional name is breaking the critical abstraction that DRS provides and potentially sets a dangerous precedent for trusting this field rather that the actual physical location, which may change. File names could get changed/corrected/moved and adding another required place to update (other than the access_urls) creates the synchronization issue I brought up. The intention of DRS is to provide a persistent identifier for a data object, not a persistent file name.

I sympathize with the user experience problem of file naming, but clients should just be flexible in their logic for naming things. First, try to get a filename from the access_url (the real source of truth). Pick a URL at random. Escape characters if you have to. Then fallback to the ID.

The reality is that users should not be using the filename as a consistent and reliable source of data. The intention is that there is some other external source of information that makes these files useful by linking such data to the DRS ID, like @ianfore 's mention of Data Connect.

I could perhaps get behind a filename_hint field, that a client could prefer over the ID - but even that should not be required in the spec.

@mattions
Copy link
Author

Hi @Avantol13 ,

I think it's totally ok if the storage location changes, and the drs_uri will stay the same.

I do not understand the change in the filename instead. The DRS standard has the idea that once, an object is indexing, the content should not change.

Do you have use-cases where the drs object stays the same, but the filename changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants