Make the relationship between Sample and Investigation many-to-many #231

RKrahl · 2020-04-23T10:17:46Z

We have the Sample class in our schema to represent "a sample to be used in an investigation". It has many-to-one relationship with Investigation, e.g. each sample must be related to one and only one investigation.

While this is certainly suitable for most situations, it may be problematic in some cases: if one individual sample has been the subject of more then one investigation, there is currently no way to properly represent this in our schema. I'd like to remind that we occasionally also have rather prominent samples such as archeological artifacts or pieces of art and it is not so improbable that the same item might be investigated more then once by independent groups using different techniques trying to answer unrelated questions.

I therefore suggest to add the following class to the schema:

`InvestigationSample`

Represents a many-to-many relationship between an investigation and a sample that has been used in that investigation

Constraint: investigation, sample

Relationships:

Card	Class	Field
1,1	Investigation	investigation
1,1	Sample	sample

As a result, the samples relationship in Investigation would be changed to have the type InvestigationSample rather then Sample. The Sample class would need to be modified as follows:

`Sample`

A sample to be used in one or more investigations

Constraint: facility, name

Relationships:

Card	Class	Field
1,1	Facility	facility
0,*	Dataset	datasets
0,1	SampleType	type
0,*	SampleParameter	parameters
0,*	InvestigationSample	investigationSamples

Other fields:

Field	Type	Description
name	String[255] NOT NULL
pid	String[255]	A persistent identifier attributed to this sample

E.g. the many-to-one relation with Investigation would be replaced by a one-to-many relation with InvestigationSample and I suggest to add a relationship with Facility just for the sake of consistency with the current schema where most things are related directly or indirectly to the facility on top. The uniqueness constraint would need to be adapted as well. The attributes of Sample would remain unchanged.

The text was updated successfully, but these errors were encountered:

kevinphippsstfc · 2020-06-23T09:47:46Z

I agree with your proposed changes except in the area of constraints. I don't think a Sample needs to have any constraints. Regarding a new constraint on Facility, I agree that most other ICAT entities have this but this is (mostly!) because that makes sense. With an archaeological sample, for example, wouldn't it make sense that the sample might be taken to a number of facilities for analysis and is therefore not tied to any of them? I can see that this is unlikely to ever happen in a single ICAT, but nonetheless the data model should be right and we should allow for that possibility. That just leaves name as a constraint and there are likely to be duplicates of the names that people give to their samples (which is fine) so that constraint is not required either.

RKrahl · 2020-06-23T11:37:38Z

Interesting point!

In fact, I suggested this constraint not because it would be particularly sensible, but in the lack of any better solution. With the relation to investigation changed to be many-to-many, we can't have investigation any more in the constraint, so I removed it from the existing constraint and that's it.

In general, it is beneficial to have a constraint in all entity types, because it allows to reliably refer to an object in ICAT based on attribute values. That makes many things easier, including serializing and deserializing ICAT content.

Regarding the facility, that is a strange thing anyway. The schema imposes to have one facility object in each ICAT, but it does not serve any particular purpose. I assume there is no production ICAT having more then one facility object. Most entity classes directly or indirectly depend on the facility and those that relate directly to it also have it in the constraint. So I also kept the facility here, just for the sake of consistency with the remaining schema.

But I do appreciate your argument. And I admit that I'm not very happy with that constraint either.

RKrahl · 2020-06-23T12:14:31Z

Maybe an alternative solution would be to take the opportunity to make a clear cut: set the constraint to be just pid. After all, reliably identify things, even across repositories, that is what PIDs have been invented for. We might just start to use them seriously.

The only drawback I can see is that we would need to change the pid attribute to be NOT NULL and force every site to set some unique value. Of course it would be best to register real PIDs such as an IGSN in the case of samples. But if the site prefers not to do that, any unique value will do. An upgrade script could just set something like local:<id> (e.g. reusing the id attribute). That would work for the time being and could gradually be replaced by real PIDs later on.

stuartpullinger · 2020-06-23T13:10:11Z

I agree with @kevinphippsstfc. I was also thinking along the same lines as @RKrahl to use the pid. This seems sensible and to fall back on using the id attribute is a good idea.

kevinphippsstfc · 2020-06-23T13:28:08Z

I agree that if you are using PIDs or want to start using PIDs for Samples then this is a sensible approach. I can also see that getting an upgrade script to assign the ID attribute to the PID will work.

However, what happens for creating Samples once the upgrade has been done? Every piece of software creating samples is forced to implement something to create this (unique) field. This can of course be done and could be as simple as just calling a library method to generate a UUID, but it is not as simple as doing nothing, where your software continues to work because PID is an optional field that you don't bother setting. Or am I missing something obvious?

RKrahl · 2020-06-23T13:34:43Z

No I don't think you are missing something. That is indeed a drawback of using this constraint. The benefit is that we have a way to reliably refer to sample objects in the ICAT later on.

RKrahl · 2020-06-23T13:55:54Z

… in other words: either way you will break some piece of software. Either you break the software creating samples that is relying upon not having to bother setting the pid attribute, or you break software that is relying upon samples to have a constraint that can be used to reliably identify them. I'd argue, the former case is easier to work around then the latter.

dfq16044 · 2020-06-25T15:54:47Z

We may have the following case:

some samples may have a defined PID
other may not have any PID. If we need to fill the PID column, then we will not to create an 'artificial' PID.

Then how do you distinguish between the two?
If we can set PID to null then the second case will be quite easy to identify as it is empty.

RKrahl · 2020-06-26T08:26:42Z

Then how do you distinguish between the two?
If we can set PID to null then the second case will be quite easy to identify as it is empty.

It would make sense in any case to prefix a PID with a scheme, e.g. to set values like doi:10.1234/abcd, handle:20.1000/100, igsn:ESNFZDVHICBD, or uuid:d4d712a3-7444-4e4c-a14d-edcb833bdaa0. That would make it much easier to resolve them, even for official, well defined PIDs. If you do this systematically, for all pid values, it does not create any ambiguity, you can easily distinguish scheme prefix and proper pid value with

pid_scheme, pid_value = sample.pid.split(':', maxsplit=1)

If you do this, you only need to define a dedicated scheme prefix for local, 'artificial' PIDs, such as local:4711 and you are done.

dfq16044 · 2020-06-26T08:49:20Z

In this case, does this make sense to have a field called PIDScheme or similar for example like IdentifierScheme used in DataCite (for Person PID, organization PID). This will be may be clearer that the we need to specify which scheme we are using.

RKrahl · 2020-06-26T09:03:22Z

In this case, does this make sense to have a field called PIDScheme or similar for example like IdentifierScheme used in DataCite (for Person PID, organization PID). This will be may be clearer that the we need to specify which scheme we are using.

I don't think so. It would make things more complicated and I don't see an advantage over my suggestion to use a prefix.

dfq16044 · 2020-06-26T09:16:43Z

Well, from my experience free text field is very problematic in production systems. Having a way to specify how this field should be used would help. For example, the PID may not be spelled correctly and it will become very difficult to find the issue.

dfq16044 · 2020-07-03T08:11:13Z

I think this schema change will need more internal discussion within DLS.
We are now reviewing the sample information workflow within DLS and we were already thinking on defining a PID for sample to be used as a reference across our local systems.
My concern about PID is that by definition it represents a persistent identifier and not an unique identifier. Let's give an example:
a User may have a OrcidID, a ResearcherID and a ScopusID that all points to the same user.
Here we may have multiple PIDs referring to the sample: a local PID, a PID from a specific sample database.
While we may have control over the local PID, for external PIDs this is more difficult. As mentioned in our discussion yesterday, we may have multiple experiments within DLS with the sample but locally this might have different PIDs (for example in the UAS sample might be assigned to the proposal). Later a user decide to use an external sample database and we want to be able to link those samples to the this PID. This will result on merging records but we will loose the reference to the original local PID.

RKrahl · 2024-04-02T09:34:04Z

Update: the current proposal as implemented in #294 is:

Add the following class:

`InvestigationSample`

Represents a many-to-many relationship between an investigation and a sample that has been used in that investigation

Constraint: investigation, sample

Relationships:

Card	Class	Field
1,1	Investigation	investigation
1,1	Sample	sample

Change Sample to:

`Sample`

A sample to be used in one or more investigations

Constraint: pid

Relationships:

Card	Class	Field
0,*	Dataset	datasets
0,1	SampleType	type
0,*	SampleParameter	parameters
0,*	InvestigationSample	investigationSamples

Other fields:

Field	Type	Description
name	String[255] NOT NULL
pid	String[255] NOT NULL	A persistent identifier attributed to this sample

Obviously: change the one-to-many relationship in Investigation from Sample to InvestigationSample.

RKrahl · 2024-04-25T14:36:06Z

As discussed in the ICAT Schema Discussion on April 2nd and in the collaboration meeting today, this change should be in a version 7.0 release that we aim to make in the second half of this year.

RKrahl added enhancement schema this involves changes to the ICAT schema labels Apr 23, 2020

RKrahl mentioned this issue May 12, 2020

Add support for the ICAT schema extensions expected in icat.server 5.0 icatproject/python-icat#73

Closed

EmilJunker mentioned this issue Jul 6, 2020

Add many-to-many relationship InvestigationSample to the schema #240

Closed

VKTB mentioned this issue Feb 1, 2022

Validation error for Sample pid field when ICAT value is None ral-facilities/datagateway-api#314

Closed

1 task

VKTB mentioned this issue May 9, 2022

Document.pid mapping adds the prefix pid ral-facilities/datagateway-api#356

Closed

RKrahl mentioned this issue May 24, 2022

Drop SQL upgrade scripts #291

Merged

EmilJunker linked a pull request Jul 18, 2022 that will close this issue

Add many-to-many relationship InvestigationSample to the schema #294

Open

RKrahl mentioned this issue Jan 19, 2024

Add a way to feed environment information into the XSLT in IngestReader icatproject/python-icat#148

Closed

RKrahl mentioned this issue Jan 31, 2024

Inject an additional element with environment information into the input data in IngestReader icatproject/python-icat#149

Merged

RKrahl added this to the 7.0.o milestone Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the relationship between Sample and Investigation many-to-many #231

Make the relationship between Sample and Investigation many-to-many #231

RKrahl commented Apr 23, 2020 •

edited

Loading

kevinphippsstfc commented Jun 23, 2020

RKrahl commented Jun 23, 2020

RKrahl commented Jun 23, 2020 •

edited

Loading

stuartpullinger commented Jun 23, 2020

kevinphippsstfc commented Jun 23, 2020

RKrahl commented Jun 23, 2020

RKrahl commented Jun 23, 2020 •

edited

Loading

dfq16044 commented Jun 25, 2020

RKrahl commented Jun 26, 2020

dfq16044 commented Jun 26, 2020

RKrahl commented Jun 26, 2020

dfq16044 commented Jun 26, 2020

dfq16044 commented Jul 3, 2020 •

edited

Loading

RKrahl commented Apr 2, 2024

RKrahl commented Apr 25, 2024

Make the relationship between Sample and Investigation many-to-many #231

Make the relationship between Sample and Investigation many-to-many #231

Comments

RKrahl commented Apr 23, 2020 • edited Loading

InvestigationSample

Sample

kevinphippsstfc commented Jun 23, 2020

RKrahl commented Jun 23, 2020

RKrahl commented Jun 23, 2020 • edited Loading

stuartpullinger commented Jun 23, 2020

kevinphippsstfc commented Jun 23, 2020

RKrahl commented Jun 23, 2020

RKrahl commented Jun 23, 2020 • edited Loading

dfq16044 commented Jun 25, 2020

RKrahl commented Jun 26, 2020

dfq16044 commented Jun 26, 2020

RKrahl commented Jun 26, 2020

dfq16044 commented Jun 26, 2020

dfq16044 commented Jul 3, 2020 • edited Loading

RKrahl commented Apr 2, 2024

InvestigationSample

Sample

RKrahl commented Apr 25, 2024

RKrahl commented Apr 23, 2020 •

edited

Loading

`InvestigationSample`

`Sample`

RKrahl commented Jun 23, 2020 •

edited

Loading

RKrahl commented Jun 23, 2020 •

edited

Loading

dfq16044 commented Jul 3, 2020 •

edited

Loading

`InvestigationSample`

`Sample`