Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validator for schema specification files? #2069

Closed
vsoch opened this issue Sep 23, 2018 · 21 comments
Closed

validator for schema specification files? #2069

vsoch opened this issue Sep 23, 2018 · 21 comments

Comments

@vsoch
Copy link

vsoch commented Sep 23, 2018

hey schemaorg!

I'm writing a validator class for specifications produced with map2model in openschemas-python (WIP is here) and my gut is saying that validation of yaml (or actualy, it loads to json, so json) has to be a solved problem. I am looking around and I've found a few (sort of close) resources, but none of them offer an actual python client or similar:

  • validator is a jar that was referenced in this issue - not exactly what I'm looking for.
  • Rich Results seems to be a Google thing for website metadata, but it's a web GUI without any code and this doesn't help me much. (and note I got there from here.
  • the structured Data Testing Tool is also very pretty, but is helping with the next level up (testing a web page against a data schema) and there is no code I can see that helps with validation.
  • Yandex seems to have validators, again behind the mysterious pretty UI. I found them on Github but have no idea what is what.
  • Anything on this list here?

I'm wondering if there is something simple, maybe like this? it's a bit of a weird question because it's validating the structure that would be used to validate other structures, but this should have a valid format too :)

If there really is absolutely nothing, I can come up with a simple one. I was going to create a criteria.yml that defines simple fields that should be present, a level (e.g., warning, error, info) corresponding with python logger levels, and then have the criteria read in and checked against some input (either frontend matter from a jekyll template, or yml /yaml). The library will provide a default, but of course the user could use some custom criteria.yml too! And at some point we would want json too, although that's not needed now since the sites render from frontend matter.

Thanks in advance for your help!

@vsoch
Copy link
Author

vsoch commented Sep 23, 2018

oh and one more note! I am developing a similar validator for a submission to JOSS (a markdown paper.md) file and it comes down to the same need! See this issue --> openbases/openbases-python#12 It would be fantastic to be able to use the same strategy / tool in both places (and again I'm happy to move forward with the idea I outlined in that issue). Importantly, there shouldn't be extra or complicated dependencies, beyond the libraries you'd expect to interact with the data structures.

@vsoch
Copy link
Author

vsoch commented Sep 23, 2018

And one more validation relevant question - does anyone have suggestions / best practices for version control of the scripts associated with Google Sheets? I just found the script tab where you can write functions to export / validate (had a mind blown / amazing moment just now) and I want to make sure the code is version controlled. If there isn't an easy way, it would probably need to be done manually (and bundle the entire Google Spreadsheet / scripts together in some way).

@akuckartz
Copy link

I am not sure I understand what you would like to validate. Some suggestions which might be helpful:

@vsoch
Copy link
Author

vsoch commented Sep 23, 2018

I don't want to read a book or a standard, I want to:

  • define a set of critiera for what constitutes valid
  • run the criteria against a specification (e.g., something you would see here like this which bioschemas is maintaining as yaml front end matter)
  • and then it says "Great job, you meet the criteria!" or "sorry, missing fields / not structured correctly," etc.

This would be added to continuous integration for the repo where you submit specifications like the one linked above. The closest thing I can think of is validating an XML against XSD. But here we are working with yaml derivatives, and python (I definitely don't want to introduce a dependency on RDFUnit), I want a clean solution in Python with minimal additional dependencies.

@vsoch
Copy link
Author

vsoch commented Sep 23, 2018

I figured since scientific programming is strongly python, and these standards aren't anything new, this almost must be a solved problem! But maybe I'm wrong about that.

@akuckartz
Copy link

akuckartz commented Sep 23, 2018

I want a clean solution in Python with minimal additional dependencies.
I wonder if that is possible.

Two additional suggestions: pySHACL is implemented in Python, but using it it still requires some understanding of SHACL. And it is about validating RDF, not YAML.

And there is this: Swagger 2.0 and OpenAPI 3.0 parser and validator for Node and browsers, which also can be used for YAML - but it is not implemented in Python.

Hopefully others have better suggestions...

@vsoch
Copy link
Author

vsoch commented Sep 23, 2018

Thanks @akuckartz ! I'll take a look at pySHACL, but I'm trying to stay away from RDF. There's nothing wrong with RDF, but I've noticed that RDF/sparql has a lot higher barriers to entry than data structures like json (and even yaml) paired with Python for the scientific community. Due to this, and in that bioschemas is generating these specifications in yaml / frontend matter with yaml, I think it would be wise to stay consistent, and simple. I'm also a bit tentative to bring in a dependency of node.js - the bugs introduced (and excessive amount of code) are astounding! But still these are good things to know about (I didn't find them in my searching) so thank you kindly for showing me :o)

Do others have suggestions? Something that meets the criteria discussed above?

@vsoch
Copy link
Author

vsoch commented Sep 24, 2018

Here is a quick (WIP) example of what I think would be a good criteria.yml to start - just really simple that has groupings of criteria, each criterion with a function, level, and name to run over a particular specification (loaded or provided as the file, the python library will handle both use cases). I'll keep working on this and then show you a working example soon!

@vsoch
Copy link
Author

vsoch commented Sep 24, 2018

okey doke, I haven't added any real criteria yet, but I have a working dummy example. Let's say we have this criteria file:

version: 1
checks:
    pass:
      - name: Dummy criteria that always returns warning
      - level: log
      - function: openschemas.main.validate.criteria.base.dummy
    fail:
      - name: Dummy criteria that always returns log
      - level: error
      - function: openschemas.main.validate.criteria.base.dummy
      - kwargs:
          - passing: False

The checks are under "checks" and each entry is an individual check (e.g., there is a check called "fail" that always returns False and then triggers a fail/exit, and one called "pass" that always returns True and we move on). The user can name these, of course, whatever he/she likes. Within each check there is a name (more like a description, this mirrors CircleCI), a level, a function, and (optionally) kwargs.

And just for clarify, here is the incredibly silly "dummy" function. It serves no purpose but to print something, and then return the status indicated by the "passing" variable:

def dummy(spec, passing=True):
    '''dummy can be used for testing, it returns the status given as an argument

       Parameters
       ==========
       spec: the input spec, in json format (dict)
       passing: boolean to return True or False (default is True)
    '''
    msg = "not True"
    if passing:
        msg = "True"

    messages = ['Roses are red, violets are blue, here is a test, it is %s' % msg,
                "If I were a rich man, well then I wouldn't be a dinosaur.",
                "Sweet dreams are made of cheese, who am I to diss a brie?"]

    message = choice(messages)
    print(message)
    return passing

Actually, all of them are optional - even if you mess up the function there is a fall back that will tell you that you did this. If you don't provide a name a robot name is given. If you don't provide a level, it uses warning. Then we can call it from the command line like this (note that I'm providing a custom criteria file and input specification file:

$ openschemas validate --criteria openschemas/main/validate/criteria/dummy.yml --infile ../specifications/_specifications/Container.html
Found ../specifications/_specifications/Container.html, valid name
Found openschemas/main/validate/criteria/dummy.yml, valid name
[criteria:dummy.yml]
Found /home/vanessa/Documents/Dropbox/Code/openschemas/specifications/_specifications/Container.html, valid name
[group:pass] ----- <start
If I were a rich man, well then I wouldn't be a dinosaur.
[check:Dummy criteria that always returns warning]
 test:function openschemas.main.validate.criteria.base.dummy
 test:result pass
 test:level LOG
LOG openschemas.main.validate.criteria.base.dummy
[group:pass] ----- end>
[group:fail] ----- <start
Sweet dreams are made of cheese, who am I to diss a brie?
[check:Dummy criteria that always returns log]
 test:function openschemas.main.validate.criteria.base.dummy
 test:result fail
 test:level ERROR
ERROR openschemas.main.validate.criteria.base.dummy

The funny print statements come from the functions themselves. They can do whatever they like - they are given the data structure of the specification and the custom list of kwargs (if defined) and then can go to town. You can also call the function from within python

from openschemas.main import Client
validator = Client.validator(infile="Container.yml")
validator.validate_criteria(criteria="dummy.yml")

I need to take a look at the pySHACL and associated content, and probably get some sleep :) But next I'm going to write some actual criteria and see how they do to test! This is a cool little library because it could be used very generally (for other things too :O) )

@RichardWallis
Copy link
Contributor

RichardWallis commented Sep 24, 2018

@vsoch I have been watching your enthusiastic (and apparently sleep depriving) approach to this evolve in this thread with interest.

Putting aside for the moment the practical consideration of which technology you use to capture, and use to drive software, validation constraints (yaml vs SHACL, etc).

I am wondering what those constrains are, how and by whom will they be defined, how limited to individual specific use cases are they.

For example, for the Schema.org vocabulary itself there are exceedingly few constraints - no properties are 'required' any property can be repeated; a property can have one or more expected types as a value, but in addition could also have a type Text, URL, or Role as a value; an entity can be described as being any combination of one or more types (for example a something could be validly described as a Motorcycle and a CreativeWork and a TouristAttraction).

Alternatively Google's Structured Data Testing Tool applies more stringent constraints (eg. A LocalBusiness must have an image property value) to validate supplied data for suitability for their services such as rich display, voice search etc.

As fundamentally Schema.org is a RDF vocabulary, and some constraints may be more semantic that the simple "a Book must have a name", SHACL is potentially a good match for defining what many might call 'profiles' for use specific constraints of the generic vocabulary. Producing library focused data - use a bibliographic SHACL profile; producing vehicle data - use an automotive SHACL profile, etc.

I am no SHACL expert, I am only channelling my understanding of the thoughts of others.

I understand your comment about communities being more used to, and therefore open to, yaml. However surely that is only an issue for those defining a set of validation constraints - I would hope that most of a community would just apply them and consume human readable errors, or hopefully lack of them.

Just my 2 cents from the sidelines.....

@vsoch
Copy link
Author

vsoch commented Sep 24, 2018

@RichardWallis the criteria for a general specification I expect to be very minimal - mostly that the fields are provided (and in the correct locations) for rendering in the specifications repository. In that these specifications will drive labeling of data types (e.g., a container recipe) for programmatic discovery, the more important and detailed criteria I would expect to be provided with the individual specifications. I can talk about containers because that's more of a domain of expertise for me. If I am designing entire software around containers (and specifically functions to interact with them) my software will interact with the specification and then the metadata described by it. I need to have absolute certainty that changes to the specification, or a contribution of a new entity for the software to munch on, don't break my software, and the only way to do that is with continuous integration testing that is constantly run over all levels (specification, data, and software) to get the bugs before they break anything.

In this light, it would be more akin to Google's Structured Data Tool, but geared for developers / scientists because it will come down to a Github repository you fork with a circleci script, or even just testing that happens without you doing anything when you open a PR to a repository. It has to be easy enough to use and customize so a researcher can come along that doesn't know a thing about RDF or ontologies, maybe doesn't care, and still is able to easily write specifications and run tests. With the current technology out there, if you've been around it a long time, or are very familiar with RDF, you may not realize how high a barrier to entry it is. The approach I'm taking is giving the developer (who is likely a researcher) freedom to write tests that can be seamlessly plugged into testing the data structure (they produced with a Google Drive export) without knowing how to do much more than write a bulleted list (yaml) and then write functions in python. As I've said before, Python is huge in the scientific community. In developing for the audience the tooling is intended for, this is the approach I believe to have the strongest, quickest impact. Also, I hope it doesn't go unnoticed that some prominent tools (TravisCI, CircleCI and other CI, jenkins to render static sites, docker-compose, kubernetes, etc.) use yaml. I think RDF might be the best choice if we are exclusively querying graphs, but for these tooly things, for the reasons above, I believe in yaml/json over it.

I hope this clears up my thinking for why yaml is a most reasonable start. I want to empower non ontology /RDF experts to create / contribute tools for specifications to drive development in the space.

@vsoch
Copy link
Author

vsoch commented Sep 24, 2018

hey everyone (and @RichardWallis) I finished up the first go at the specification default criteria, and added the start to docs (please excuse their brevity and incompletness, I will make these much better https://openschemas.github.io/openschemas-python/html/usage.html) I funnily enough found a lot of invalid things (mostly missing fields) when I validated my Container.html spec - next I'm going to add the validator (that is now served by the openschemas docker container) to the specifications repository, and go through what would be a reasonable back and forth between specification maintainer and PR submitter for a new spec. I haven't started writing the circleCI recipes yet (where this would all happen) but I'll post another update when I have something useful to look at! Here is the example running locally for my Container.html

# This is cd-ing into the specifications Github repo I have locally
$ cd ../specifications/_specifications/
# I'm going to run the test against the Container.html file, using the default specification.yml
$ openschemas validate --infile Container.html 
Found Container.html, valid name
[criteria:specification.yml]
Found /home/vanessa/Documents/Dropbox/Code/openschemas/specifications/_specifications/Container.html, valid name
[group|start:global]
[field:description}
[field:edit_url}
Testing URL https://github.com/openschemas/specifications/tree/master/_specifications/Container.html
[field:gh_tasks}
Testing URL https://github.com/openschemas/specifications/labels/type%3A%20Container
[field:hierarchy}
[field:mapping}
[field:name}
[field:parent_type}
[field:spec_info}
[field:spec_type}
[field:status}
[field:subtitle}
[field:use_cases_url}
Testing URL https://www.github.com/openschemas/spec-container
[field:version}
[check:Check for required global sections and metadata]
 test:function openschemas.main.validate.criteria.structure.required
 test:result pass
 test:level ERROR
[group|end:global]
[group|start:metadata]
[field:gh_folder}
Testing URL https://github.com/openschemas/specifications/tree/master/_specifications/Container.html
[check:Check for suggested (not required) fields]
 test:function openschemas.main.validate.criteria.structure.optional
 test:result pass
 test:level WARNING
[group|end:metadata]
[group|start:spec_info]
[field:description}
[field:full_example}
ERROR full_example is missing, invalid

It didn't pass because I had a required field as an empty string... whomp whomp! 😆

I think I will also build an additional docker container with entrypoints just for the openschemas console script - the schema-builder could be used because it has openbases installed inside, but it's a different console script so the user wouldn't find it easily. Which isn't great. I think I probably will.

@vsoch
Copy link
Author

vsoch commented Sep 24, 2018

Glancing at this I also need to fine tune the level - right now the tests each have an associated level, but it's misleading because it will print test:level ERROR even when the test passes. I probably should skip this printline if the test passes so the user isn't confused, and only issue the print if there is some non-pass state.

@vsoch
Copy link
Author

vsoch commented Sep 25, 2018

okey doke, validation is live in the specifications repository - you can (maybe?) see it working for the Container specification here --> https://circleci.com/gh/openschemas/specifications/6 The main categories of things tested are:

  • required global fields
  • optional global fields
    • both of these include testing if empty / not found, that urls return 200 responses, and data types
  • semantic versioning
  • metadata in list of mappings and "spec_info" group (note these were both what biocshemas is using, I am following suit!)

I'm definitely not an experienced ontologist, so if I'm missing something please open an issue to discuss so we can add it. Speaking of issues - I've created issues in respective repositories for the other things discussed in this thread, so I'm going to close here! If you have feedback or want to otherwise help with improving the validator, it would be fantastic! For reference, here are the issues that are still open that were created for our discussion:

Although it's lower priority, I definitely want to get generation of json-ld and RDF into the workflow here - it wasn't something I did off the bat because I didn't see it in bioschemas specifications jekyll template. If anyone has insights to that (insert bright orange airplane pointers directed toward issue board) ✈️ 🛩️

Thanks again! Closing.

@vsoch vsoch closed this as completed Sep 25, 2018
@pierreozoux
Copy link

pierreozoux commented Nov 13, 2018

@vsoch we are working on
Https://libreho.st/directory.json

We are beginners obviously, with json-ld.

I'd like to have a simple tool to check the Json that each hosted Pull Request us.

I'm sure by now you know if it exists or not and can direct me to the right direction ;)

Thanks for your help!

I also like to document things for myself, as it might help somebody else or my future self ;)

@vsoch
Copy link
Author

vsoch commented Nov 13, 2018

hey @pierreozoux, cool! I'm a beginning too, I can't say I distinguish it from json other than it has attributes that startwith @. I'm finished up with schemaorg python --> https://vsoch.github.io/2018/schemaorg/ and am very interested in the validation bit - I created a simple recipe.yml to run one locally, but the idea of an api is super awesome!

Would you be interested for schemaorg python to add an endpoint to your API so it works programatically, easily with python? Meaning that it wraps your endpoints and can help drive the validation. If you can tell me how it works, I'd be glad to implement this for you and write up docs etc. It's much more fun to work together, and validation is an important (and overlooked) thing generally.

@pierreozoux
Copy link

Great that you want to work together ;)
Come to the matrix listed here:
Https://libreho.st we can discuss further details.
(I'll create a specific channel for the "catalogue" )

@vsoch
Copy link
Author

vsoch commented Nov 14, 2018

Cool! What exactly do I do? The tal.libreho.st is a disourse forum, is that where you mean?

@pierreozoux
Copy link

You can join us on matrix :)
https://riot.allmende.io/#/room/#librehosters-techtalk:chat.weho.st

@vsoch
Copy link
Author

vsoch commented Nov 14, 2018

I'll make sure to drop by! What time zone are you in? I got eaten up with TODOs today but if I know the best time when you guys are around I can jump in then. If it's somewhere in Europe that might be okay too because I have a (non existing) sleep schedule, heh.

@pierreozoux
Copy link

We are mainly in Europe. (I'm in Germany).
Sleep is important! Burn out is a real disease among free software community and we have to care for each other :)

Drop by the chat, there might be some people at any time, or come to the gitlab:

https://lab.libreho.st/librehosters/librehost-api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants