-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validator for schema specification files? #2069
Comments
oh and one more note! I am developing a similar validator for a submission to JOSS (a markdown |
And one more validation relevant question - does anyone have suggestions / best practices for version control of the scripts associated with Google Sheets? I just found the script tab where you can write functions to export / validate (had a mind blown / amazing moment just now) and I want to make sure the code is version controlled. If there isn't an easy way, it would probably need to be done manually (and bundle the entire Google Spreadsheet / scripts together in some way). |
I am not sure I understand what you would like to validate. Some suggestions which might be helpful:
|
I don't want to read a book or a standard, I want to:
This would be added to continuous integration for the repo where you submit specifications like the one linked above. The closest thing I can think of is validating an XML against XSD. But here we are working with yaml derivatives, and python (I definitely don't want to introduce a dependency on RDFUnit), I want a clean solution in Python with minimal additional dependencies. |
I figured since scientific programming is strongly python, and these standards aren't anything new, this almost must be a solved problem! But maybe I'm wrong about that. |
Two additional suggestions: pySHACL is implemented in Python, but using it it still requires some understanding of SHACL. And it is about validating RDF, not YAML. And there is this: Swagger 2.0 and OpenAPI 3.0 parser and validator for Node and browsers, which also can be used for YAML - but it is not implemented in Python. Hopefully others have better suggestions... |
Thanks @akuckartz ! I'll take a look at pySHACL, but I'm trying to stay away from RDF. There's nothing wrong with RDF, but I've noticed that RDF/sparql has a lot higher barriers to entry than data structures like json (and even yaml) paired with Python for the scientific community. Due to this, and in that bioschemas is generating these specifications in yaml / frontend matter with yaml, I think it would be wise to stay consistent, and simple. I'm also a bit tentative to bring in a dependency of node.js - the bugs introduced (and excessive amount of code) are astounding! But still these are good things to know about (I didn't find them in my searching) so thank you kindly for showing me :o) Do others have suggestions? Something that meets the criteria discussed above? |
Here is a quick (WIP) example of what I think would be a good |
okey doke, I haven't added any real criteria yet, but I have a working dummy example. Let's say we have this criteria file: version: 1
checks:
pass:
- name: Dummy criteria that always returns warning
- level: log
- function: openschemas.main.validate.criteria.base.dummy
fail:
- name: Dummy criteria that always returns log
- level: error
- function: openschemas.main.validate.criteria.base.dummy
- kwargs:
- passing: False The checks are under "checks" and each entry is an individual check (e.g., there is a check called "fail" that always returns False and then triggers a fail/exit, and one called "pass" that always returns True and we move on). The user can name these, of course, whatever he/she likes. Within each check there is a name (more like a description, this mirrors CircleCI), a level, a function, and (optionally) kwargs. And just for clarify, here is the incredibly silly "dummy" function. It serves no purpose but to print something, and then return the status indicated by the "passing" variable: def dummy(spec, passing=True):
'''dummy can be used for testing, it returns the status given as an argument
Parameters
==========
spec: the input spec, in json format (dict)
passing: boolean to return True or False (default is True)
'''
msg = "not True"
if passing:
msg = "True"
messages = ['Roses are red, violets are blue, here is a test, it is %s' % msg,
"If I were a rich man, well then I wouldn't be a dinosaur.",
"Sweet dreams are made of cheese, who am I to diss a brie?"]
message = choice(messages)
print(message)
return passing Actually, all of them are optional - even if you mess up the function there is a fall back that will tell you that you did this. If you don't provide a name a robot name is given. If you don't provide a level, it uses warning. Then we can call it from the command line like this (note that I'm providing a custom criteria file and input specification file: $ openschemas validate --criteria openschemas/main/validate/criteria/dummy.yml --infile ../specifications/_specifications/Container.html
Found ../specifications/_specifications/Container.html, valid name
Found openschemas/main/validate/criteria/dummy.yml, valid name
[criteria:dummy.yml]
Found /home/vanessa/Documents/Dropbox/Code/openschemas/specifications/_specifications/Container.html, valid name
[group:pass] ----- <start
If I were a rich man, well then I wouldn't be a dinosaur.
[check:Dummy criteria that always returns warning]
test:function openschemas.main.validate.criteria.base.dummy
test:result pass
test:level LOG
LOG openschemas.main.validate.criteria.base.dummy
[group:pass] ----- end>
[group:fail] ----- <start
Sweet dreams are made of cheese, who am I to diss a brie?
[check:Dummy criteria that always returns log]
test:function openschemas.main.validate.criteria.base.dummy
test:result fail
test:level ERROR
ERROR openschemas.main.validate.criteria.base.dummy The funny print statements come from the functions themselves. They can do whatever they like - they are given the data structure of the specification and the custom list of kwargs (if defined) and then can go to town. You can also call the function from within python from openschemas.main import Client
validator = Client.validator(infile="Container.yml")
validator.validate_criteria(criteria="dummy.yml") I need to take a look at the pySHACL and associated content, and probably get some sleep :) But next I'm going to write some actual criteria and see how they do to test! This is a cool little library because it could be used very generally (for other things too :O) ) |
@vsoch I have been watching your enthusiastic (and apparently sleep depriving) approach to this evolve in this thread with interest. Putting aside for the moment the practical consideration of which technology you use to capture, and use to drive software, validation constraints (yaml vs SHACL, etc). I am wondering what those constrains are, how and by whom will they be defined, how limited to individual specific use cases are they. For example, for the Schema.org vocabulary itself there are exceedingly few constraints - no properties are 'required' any property can be repeated; a property can have one or more expected types as a value, but in addition could also have a type Text, URL, or Role as a value; an entity can be described as being any combination of one or more types (for example a something could be validly described as a Motorcycle and a CreativeWork and a TouristAttraction). Alternatively Google's Structured Data Testing Tool applies more stringent constraints (eg. A LocalBusiness must have an image property value) to validate supplied data for suitability for their services such as rich display, voice search etc. As fundamentally Schema.org is a RDF vocabulary, and some constraints may be more semantic that the simple "a Book must have a name", SHACL is potentially a good match for defining what many might call 'profiles' for use specific constraints of the generic vocabulary. Producing library focused data - use a bibliographic SHACL profile; producing vehicle data - use an automotive SHACL profile, etc. I am no SHACL expert, I am only channelling my understanding of the thoughts of others. I understand your comment about communities being more used to, and therefore open to, yaml. However surely that is only an issue for those defining a set of validation constraints - I would hope that most of a community would just apply them and consume human readable errors, or hopefully lack of them. Just my 2 cents from the sidelines..... |
@RichardWallis the criteria for a general specification I expect to be very minimal - mostly that the fields are provided (and in the correct locations) for rendering in the specifications repository. In that these specifications will drive labeling of data types (e.g., a container recipe) for programmatic discovery, the more important and detailed criteria I would expect to be provided with the individual specifications. I can talk about containers because that's more of a domain of expertise for me. If I am designing entire software around containers (and specifically functions to interact with them) my software will interact with the specification and then the metadata described by it. I need to have absolute certainty that changes to the specification, or a contribution of a new entity for the software to munch on, don't break my software, and the only way to do that is with continuous integration testing that is constantly run over all levels (specification, data, and software) to get the bugs before they break anything. In this light, it would be more akin to Google's Structured Data Tool, but geared for developers / scientists because it will come down to a Github repository you fork with a circleci script, or even just testing that happens without you doing anything when you open a PR to a repository. It has to be easy enough to use and customize so a researcher can come along that doesn't know a thing about RDF or ontologies, maybe doesn't care, and still is able to easily write specifications and run tests. With the current technology out there, if you've been around it a long time, or are very familiar with RDF, you may not realize how high a barrier to entry it is. The approach I'm taking is giving the developer (who is likely a researcher) freedom to write tests that can be seamlessly plugged into testing the data structure (they produced with a Google Drive export) without knowing how to do much more than write a bulleted list (yaml) and then write functions in python. As I've said before, Python is huge in the scientific community. In developing for the audience the tooling is intended for, this is the approach I believe to have the strongest, quickest impact. Also, I hope it doesn't go unnoticed that some prominent tools (TravisCI, CircleCI and other CI, jenkins to render static sites, docker-compose, kubernetes, etc.) use yaml. I think RDF might be the best choice if we are exclusively querying graphs, but for these tooly things, for the reasons above, I believe in yaml/json over it. I hope this clears up my thinking for why yaml is a most reasonable start. I want to empower non ontology /RDF experts to create / contribute tools for specifications to drive development in the space. |
hey everyone (and @RichardWallis) I finished up the first go at the specification default criteria, and added the start to docs (please excuse their brevity and incompletness, I will make these much better https://openschemas.github.io/openschemas-python/html/usage.html) I funnily enough found a lot of invalid things (mostly missing fields) when I validated my Container.html spec - next I'm going to add the validator (that is now served by the openschemas docker container) to the specifications repository, and go through what would be a reasonable back and forth between specification maintainer and PR submitter for a new spec. I haven't started writing the circleCI recipes yet (where this would all happen) but I'll post another update when I have something useful to look at! Here is the example running locally for my Container.html # This is cd-ing into the specifications Github repo I have locally
$ cd ../specifications/_specifications/
It didn't pass because I had a required field as an empty string... whomp whomp! 😆 I think I will also build an additional docker container with entrypoints just for the openschemas console script - the schema-builder could be used because it has openbases installed inside, but it's a different console script so the user wouldn't find it easily. Which isn't great. I think I probably will. |
Glancing at this I also need to fine tune the level - right now the tests each have an associated level, but it's misleading because it will print |
okey doke, validation is live in the specifications repository - you can (maybe?) see it working for the Container specification here --> https://circleci.com/gh/openschemas/specifications/6 The main categories of things tested are:
I'm definitely not an experienced ontologist, so if I'm missing something please open an issue to discuss so we can add it. Speaking of issues - I've created issues in respective repositories for the other things discussed in this thread, so I'm going to close here! If you have feedback or want to otherwise help with improving the validator, it would be fantastic! For reference, here are the issues that are still open that were created for our discussion:
Although it's lower priority, I definitely want to get generation of json-ld and RDF into the workflow here - it wasn't something I did off the bat because I didn't see it in bioschemas specifications jekyll template. If anyone has insights to that (insert bright orange airplane pointers directed toward issue board) Thanks again! Closing. |
@vsoch we are working on We are beginners obviously, with json-ld. I'd like to have a simple tool to check the Json that each hosted Pull Request us. I'm sure by now you know if it exists or not and can direct me to the right direction ;) Thanks for your help! I also like to document things for myself, as it might help somebody else or my future self ;) |
hey @pierreozoux, cool! I'm a beginning too, I can't say I distinguish it from json other than it has attributes that startwith @. I'm finished up with schemaorg python --> https://vsoch.github.io/2018/schemaorg/ and am very interested in the validation bit - I created a simple recipe.yml to run one locally, but the idea of an api is super awesome! Would you be interested for schemaorg python to add an endpoint to your API so it works programatically, easily with python? Meaning that it wraps your endpoints and can help drive the validation. If you can tell me how it works, I'd be glad to implement this for you and write up docs etc. It's much more fun to work together, and validation is an important (and overlooked) thing generally. |
Great that you want to work together ;) |
Cool! What exactly do I do? The tal.libreho.st is a disourse forum, is that where you mean? |
You can join us on matrix :) |
I'll make sure to drop by! What time zone are you in? I got eaten up with TODOs today but if I know the best time when you guys are around I can jump in then. If it's somewhere in Europe that might be okay too because I have a (non existing) sleep schedule, heh. |
We are mainly in Europe. (I'm in Germany). Drop by the chat, there might be some people at any time, or come to the gitlab: |
hey schemaorg!
I'm writing a validator class for specifications produced with map2model in openschemas-python (WIP is here) and my gut is saying that validation of yaml (or actualy, it loads to json, so json) has to be a solved problem. I am looking around and I've found a few (sort of close) resources, but none of them offer an actual python client or similar:
I'm wondering if there is something simple, maybe like this? it's a bit of a weird question because it's validating the structure that would be used to validate other structures, but this should have a valid format too :)
If there really is absolutely nothing, I can come up with a simple one. I was going to create a
criteria.yml
that defines simple fields that should be present, a level (e.g., warning, error, info) corresponding with python logger levels, and then have the criteria read in and checked against some input (either frontend matter from a jekyll template, or yml /yaml). The library will provide a default, but of course the user could use some customcriteria.yml
too! And at some point we would want json too, although that's not needed now since the sites render from frontend matter.Thanks in advance for your help!
The text was updated successfully, but these errors were encountered: