Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor validation code and support LinkML rules #420

Merged
merged 7 commits into from
Dec 12, 2023
Merged

Conversation

pkalita-lbl
Copy link
Collaborator

Fixes #370

Summary of changes

  1. These changes add a new Validator class. This class is designed to accept a schema (the same JavaScript object representation of a LinkML schema that the DataHarmonizer class currently accepts) and a target class. It provides a method to validate rows of data regardless of where the data came from (it doesn't know anything about Handsontable).

    Generally speaking the way it works is that, as needed, it translates the target class in the schema into a series of validation functions and caches those functions. For example, while validating the first row of a dataset it might build, cache, and execute a function that knows how to validate a string according to the my_id slot. Then when validating the second row it only has to look up the cached function for the my_id slot and execute it against the value in the second row.

    Having this separated out into a standalone class makes it easier to unit test (see tests/Validator.test.js) and extend. In addition to what was previously validated, the Validator class also validates rules on LinkML classes.

  2. In the main DataHarmonizer class, a Validator instance is created in the constructor and whenever the schema changes. When the template changes the Validator instance's target class is updated. The getInvalidCells method now simply defers to the Validator.validate method.

  3. Previously, the DataHarmonizer.getInvalidCells method did a mixture of repairs and validation. The repair operations have been moved to a separate doPreValidationRepairs method and the DataHarmonizer.validate method is resposible for calling that method as well as getInvalidCells.

@ddooley I know this is kind of a big and complex set of changes, so I'd appreciate any testing you can do on this branch. Please let me know if you find anything concerning! Also happy to provide any clarification on the implementation details if needed.

This class directly uses the in-memory representation of the LinkML schema to perform validation on rows of data. The data is represented as an array of arrays (which is Handsontable's native format). An array of column headers is also required at validation time in order to map list elements to slots.
@ddooley
Copy link
Collaborator

ddooley commented Dec 5, 2023

Great step forward. I will look at this in the next few days (just back from vacation). I like distinction between cleanup and validation passes.

@ddooley
Copy link
Collaborator

ddooley commented Dec 6, 2023

Aside from some testbed stuff we've now discussed, there was one piece of behaviour I wasn't quite sure of - the PRESENT / ABSENT test . Both a_big_number and a_small_number seem to be getting set to required when there's no value in present_or_absent_string, though logic seems to indicate that only a_small_number would be marked required in that case?

@ddooley
Copy link
Collaborator

ddooley commented Dec 6, 2023

Also, a click on row # column cell seems to trigger a javascript crash.

@ddooley
Copy link
Collaborator

ddooley commented Dec 6, 2023

I think the crash is related to some code expecting column header to have a 'title' field, but in the minimal LinkML template code for validation test we don't necessarily have that for each slot, so should be a simple fix to test for that.

@ddooley
Copy link
Collaborator

ddooley commented Dec 6, 2023

FYI, I see I accidentally deleted right click cell context menu for 'remove_row', 'row_above', 'row_below', so main branch will get this update, which should merge smoothly with your pull request.

@pkalita-lbl
Copy link
Collaborator Author

Also, a click on row # column cell seems to trigger a javascript crash.

Oh I actually see the same thing happen when running the master branch with the CanCOGeN Covid-19 template. So I don't think it's related to these changes. I can create a new issue for that and maybe address that in a separate PR?

Both a_big_number and a_small_number seem to be getting set to required when there's no value in present_or_absent_string, though logic seems to indicate that only a_small_number would be marked required in that case?

I have to admit I didn't really intend for the schema in Validator.test.js to be used as a whole -- you can see various test cases pulling out specific slots or groups of slots to test. So it's possible that I took a few shortcuts that make it not entirely self-consistent because of that. But I'll take a look at it.

@ddooley
Copy link
Collaborator

ddooley commented Dec 6, 2023

Re. : "So I don't think it's related to these changes." - indeed error predates the validation code. I'll fix.

@pkalita-lbl
Copy link
Collaborator Author

I've updated the validator test schema so that it doesn't repurpose slots between rules. Hopefully that makes your testing easier!

@ddooley
Copy link
Collaborator

ddooley commented Dec 7, 2023

re. "Oh I actually see the same thing happen when running the master" I've now fixed this on master.

@ddooley
Copy link
Collaborator

ddooley commented Dec 10, 2023

Small q: is non_overlapping_intervals" supposed to have ranges 0-20, 10-30 ? The ranges overlap.

not_the_future: {
name: 'not_the_future',
range: 'date',
maximum_value: '{today}',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I see this works here, but for some reason its not working on the "sample collection date" in the Cancogen Covid19 schema.json template (which has an older todo=... way of specifying the min/max. which will be phased out in favour of minimum_value and maximum_value when LinkML allows dates there.)

As a clearer test, I tried adding maximum_value:{today} to another field, the "sample received date" field, in the slot_definition, and in class attribute, and even in slot_usage, but none worked.

Is it perhaps because range of those fields includes two separate sources - one being "date" and the other being a null value menu?

@pkalita-lbl
Copy link
Collaborator Author

pkalita-lbl commented Dec 11, 2023

Small q: is non_overlapping_intervals" supposed to have ranges 0-20, 10-30 ? The ranges overlap.

Oh haha yeah they're supposed to overlap. I don't know why I named it that. Will update.

Edit: I just realized why I named it that. Yes, those ranges overlap, but because it uses exactly_one_of a value in the overlap is not allowed. The name is still a bit confusing, so I'll come up with something better.

Is it perhaps because range of those fields includes two separate sources - one being "date" and the other being a null value menu?

Yeah, this is where using the any_of construct can be a little tricky. When you added min/max values to the sample collection date slot did you do something like this (abridged for clarity)?

"sample collection date": {
  "minimum_value": "2022-01-01",
  "maximum_value": "{today}",
  "any_of": [
    {
      "range": "date"
    },
    {
      "range": "null value menu"
    }
  ]
}

There's a bit of ambiguity here. Is it implied that the min/max constraints should trickle down into both of the any_of conditions? If min/max do, what other slot definition attributes also trickle down? What if the "inherited" attributes don't really make sense for one of the conditions (like this case where min/max on range: "null value menu" doesn't make much sense)? I've chosen to not open that can of worms by not passing anything from the main slot definitions into the any_of definitions. So if you try this instead it should work:

"sample collection date": {
  "any_of": [
    {
      "range": "date"
      "minimum_value": "2022-01-01",
      "maximum_value": "{today}",
    },
    {
      "range": "null value menu"
    }
  ]
}

This feels like clearer modelling to me anyway.

@ddooley
Copy link
Collaborator

ddooley commented Dec 11, 2023

Great, that's the best solution. I didn't know that was possible with LinkML. I think its good to merge!

@pkalita-lbl
Copy link
Collaborator Author

Awesome! Thanks for all the testing and ideas. I'm going to merge now, but if you run into any unexpected issues or need help modifying any existing schemas to accommodate this please let me know!

@pkalita-lbl pkalita-lbl merged commit d1e4778 into master Dec 12, 2023
1 check passed
@pkalita-lbl pkalita-lbl deleted the issue-370 branch December 12, 2023 17:41
@ddooley
Copy link
Collaborator

ddooley commented Dec 12, 2023

What do you think the probability of getting minimum_value and maximum_value to be range dependent in LinkML, rather than just integer? That becomes the one remaining priority.

I guess for more complex conditionals like "{sequencing date} minimum_value={collection date}" that these end up being stand-alone rules, rather than with some kind of shortcut like "minimum_value = {collection date}" on the range of {sequencing date} ?

@pkalita-lbl
Copy link
Collaborator Author

pkalita-lbl commented Dec 13, 2023

What do you think the probability of getting minimum_value and maximum_value to be range dependent in LinkML, rather than just integer? That becomes the one remaining priority.

The probability is pretty much 100%. There is an open issue about it: linkml/linkml#1384. What I can't say is exactly when that'll happen. It would involve a change to the LinkML metamodel and those don't tend to happen very quickly.

But in the meantime you're not blocked, right? I think you should be able to continue to use the existing todos workaround because those get translated into minimum_value and maximum_value when the template is loaded:

if (new_field.datatype == 'xsd:date' && new_field.todos) {
// Have to feed any min/max date comparison back into min max value fields
for (const test of new_field.todos) {
if (test.substr(0, 2) == '>=') {
new_field.minimum_value = test.substr(2);
}
if (test.substr(0, 2) == '<=') {
new_field.maximum_value = test.substr(2);
}
}
}

EDIT: Ah I just realized that translation of the todos is not happening at the right time for the Validator to pick it up. I'll make another update to make sure the existing todos continue to work.

I guess for more complex conditionals like "{sequencing date} minimum_value={collection date}" that these end up being stand-alone rules, rather than with some kind of shortcut like "minimum_value = {collection date}" on the range of {sequencing date} ?

Uh oh, somehow I missed that was a feature of the previous validation routine. I don't believe there's any way to specify a rule like that in LinkML. I may need to go back and add that to the new Validator class. Is there an existing schema that shows how you're currently expressing that?

@ddooley
Copy link
Collaborator

ddooley commented Dec 14, 2023

So at moment in our 10+ templates, there's only a few instances of our using "{today}", and a few instances of using the following to get one field's date minimum set in response to another's entry.

"sequencing date": {
  ...
  "todos": [
    ">={sample collection date}"
  ],

If a rule can be expressed to do the same as above then we can just switch our current templates to use that kind of rule. re. fields with 2 ranges, all our min and max tests pertain to the first range list datatype, so I think it would be smooth for the script to add the generally stated min/max if any, to it, in the case where there is more than 1 range. I guess this can be done browser side.

The "{today}" syntax is pretty special, since it doesn't point to another field. if your retrofitting via the "todos" annotation can make that work great, otherwise I'll take a crack at it.

@pkalita-lbl
Copy link
Collaborator Author

If a rule can be expressed to do the same as above then we can just switch our current templates to use that kind of rule

No, it cannot. There's nothing in LinkML currently that can express "the value of this slot need to be greater/lesser than the value of another slot".

I will update the Validator class to handle slot todos convention that is currently in your schemas (both the {today} special value and referencing other slots).

@pkalita-lbl
Copy link
Collaborator Author

PR to support existing conventions here: #426

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataHarmonizer should honor at least a subset of the LinkML rule language
2 participants