-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor validation code and support LinkML rules #420
Conversation
This class directly uses the in-memory representation of the LinkML schema to perform validation on rows of data. The data is represented as an array of arrays (which is Handsontable's native format). An array of column headers is also required at validation time in order to map list elements to slots.
…exible caching for validation functions
…to Validator class
Great step forward. I will look at this in the next few days (just back from vacation). I like distinction between cleanup and validation passes. |
Aside from some testbed stuff we've now discussed, there was one piece of behaviour I wasn't quite sure of - the PRESENT / ABSENT test . Both a_big_number and a_small_number seem to be getting set to required when there's no value in present_or_absent_string, though logic seems to indicate that only a_small_number would be marked required in that case? |
Also, a click on row # column cell seems to trigger a javascript crash. |
I think the crash is related to some code expecting column header to have a 'title' field, but in the minimal LinkML template code for validation test we don't necessarily have that for each slot, so should be a simple fix to test for that. |
FYI, I see I accidentally deleted right click cell context menu for 'remove_row', 'row_above', 'row_below', so main branch will get this update, which should merge smoothly with your pull request. |
Oh I actually see the same thing happen when running the
I have to admit I didn't really intend for the schema in |
Re. : "So I don't think it's related to these changes." - indeed error predates the validation code. I'll fix. |
I've updated the validator test schema so that it doesn't repurpose slots between rules. Hopefully that makes your testing easier! |
re. "Oh I actually see the same thing happen when running the master" I've now fixed this on master. |
Small q: is non_overlapping_intervals" supposed to have ranges 0-20, 10-30 ? The ranges overlap. |
not_the_future: { | ||
name: 'not_the_future', | ||
range: 'date', | ||
maximum_value: '{today}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I see this works here, but for some reason its not working on the "sample collection date" in the Cancogen Covid19 schema.json template (which has an older todo=... way of specifying the min/max. which will be phased out in favour of minimum_value and maximum_value when LinkML allows dates there.)
As a clearer test, I tried adding maximum_value:{today} to another field, the "sample received date" field, in the slot_definition, and in class attribute, and even in slot_usage, but none worked.
Is it perhaps because range of those fields includes two separate sources - one being "date" and the other being a null value menu?
Edit: I just realized why I named it that. Yes, those ranges overlap, but because it uses
Yeah, this is where using the "sample collection date": {
"minimum_value": "2022-01-01",
"maximum_value": "{today}",
"any_of": [
{
"range": "date"
},
{
"range": "null value menu"
}
]
} There's a bit of ambiguity here. Is it implied that the min/max constraints should trickle down into both of the "sample collection date": {
"any_of": [
{
"range": "date"
"minimum_value": "2022-01-01",
"maximum_value": "{today}",
},
{
"range": "null value menu"
}
]
} This feels like clearer modelling to me anyway. |
Great, that's the best solution. I didn't know that was possible with LinkML. I think its good to merge! |
Awesome! Thanks for all the testing and ideas. I'm going to merge now, but if you run into any unexpected issues or need help modifying any existing schemas to accommodate this please let me know! |
What do you think the probability of getting minimum_value and maximum_value to be range dependent in LinkML, rather than just integer? That becomes the one remaining priority. I guess for more complex conditionals like "{sequencing date} minimum_value={collection date}" that these end up being stand-alone rules, rather than with some kind of shortcut like "minimum_value = {collection date}" on the range of {sequencing date} ? |
The probability is pretty much 100%. There is an open issue about it: linkml/linkml#1384. What I can't say is exactly when that'll happen. It would involve a change to the LinkML metamodel and those don't tend to happen very quickly. But in the meantime you're not blocked, right? I think you should be able to continue to use the existing DataHarmonizer/lib/DataHarmonizer.js Lines 362 to 372 in d1e4778
EDIT: Ah I just realized that translation of the
Uh oh, somehow I missed that was a feature of the previous validation routine. I don't believe there's any way to specify a rule like that in LinkML. I may need to go back and add that to the new |
So at moment in our 10+ templates, there's only a few instances of our using "{today}", and a few instances of using the following to get one field's date minimum set in response to another's entry.
If a rule can be expressed to do the same as above then we can just switch our current templates to use that kind of rule. re. fields with 2 ranges, all our min and max tests pertain to the first range list datatype, so I think it would be smooth for the script to add the generally stated min/max if any, to it, in the case where there is more than 1 range. I guess this can be done browser side. The "{today}" syntax is pretty special, since it doesn't point to another field. if your retrofitting via the "todos" annotation can make that work great, otherwise I'll take a crack at it. |
No, it cannot. There's nothing in LinkML currently that can express "the value of this slot need to be greater/lesser than the value of another slot". I will update the |
PR to support existing conventions here: #426 |
Fixes #370
Summary of changes
These changes add a new
Validator
class. This class is designed to accept a schema (the same JavaScript object representation of a LinkML schema that theDataHarmonizer
class currently accepts) and a target class. It provides a method to validate rows of data regardless of where the data came from (it doesn't know anything about Handsontable).Generally speaking the way it works is that, as needed, it translates the target class in the schema into a series of validation functions and caches those functions. For example, while validating the first row of a dataset it might build, cache, and execute a function that knows how to validate a string according to the
my_id
slot. Then when validating the second row it only has to look up the cached function for themy_id
slot and execute it against the value in the second row.Having this separated out into a standalone class makes it easier to unit test (see
tests/Validator.test.js
) and extend. In addition to what was previously validated, theValidator
class also validatesrules
on LinkML classes.In the main
DataHarmonizer
class, aValidator
instance is created in the constructor and whenever the schema changes. When the template changes theValidator
instance's target class is updated. ThegetInvalidCells
method now simply defers to theValidator.validate
method.Previously, the
DataHarmonizer.getInvalidCells
method did a mixture of repairs and validation. The repair operations have been moved to a separatedoPreValidationRepairs
method and theDataHarmonizer.validate
method is resposible for calling that method as well asgetInvalidCells
.@ddooley I know this is kind of a big and complex set of changes, so I'd appreciate any testing you can do on this branch. Please let me know if you find anything concerning! Also happy to provide any clarification on the implementation details if needed.