This is a repository with example extractors and recipes intended to be used with schemaorg Python to help you to extract metadata from your datasets, software and other entities described in schema.org.
The following specifications have Dockerfiles (and associated Github actions) for you to use! See the subdirectories to get usage:
- Dataset is an example starter script to extract a Dataset.
- ImageDefinition is a kind of SoftwareSourceCode extended to describe containers. We provide a Dockerfile that builds the extractor to generate a static page for an input Dockerfile.
- ContainerTree is an extended ImageDefinition to also include a filesystem listing that can be used to generate a container tree.
For both of the above, when you deploy to Github pages for the first time, you
need to switch Github Pages to deploy from master and then back to the gh-pages
branch on deploy. There is a known issue with Permissions if you deploy
to the brain without activating it (as an admin) from the respository first.
The following examples for entities (children of "Thing") defined in schema.org are also provided. These specifications don't yet have Docker containers or Github Action extractors.
- DataCatalog: a collection or grouping of Datasets
- Organization: a complete organization, with a ContactPoint
- SoftwareSourceCode an example extraction shown here for a Dockerfile.
For each of the above, the metadata shown is also embedded in the page as json-ld (when you "View Source.")
Each folder above includes an example python script to extract metadata (extract.py
),
a recipe to follow (recipe.yml
), and the specification in yaml format (in the
case of a specification not served by production schema.org).
For the Docker and Github Actions usage, see inside the ImageDefinition folder. For all other schema.org entities and local usage, details are provided here. Before running these examples, make sure you have installed the module (and note this module is under development, contributions are welcome!)
pip install schemaorg
To extract a recipe for a particular datatype, you can modify extract.py
and the
recipe.yml
for your particular needs, or use as is. Generally we:
- Read in a specific version of the schemaorg definitions provided by the library
- Read in a recipe for a template that we want to populate (e.g., google/dataset)
- Use helper functions provided by the template (or our own) to extract
- Extract, validate, and generate the final dataset
The goal of the software is to provide enough structure to help the user (typically a developer) but not so much as to be annoying to use generally.
If I am a provider of a service and want my users to label their data for my service,
I need to tell them how to do this. I do this by way of a recipe file, in each
example folder there is a file called recipe.yml
that is a simple listing of required fields defined for the entities that are needed. For example, the recipe.yml in the
"SoftwareSourceCode" folder tells the parser that we need to define
properties for "SoftwareSourceCode" and an Organization or Person. For example.
with the schemaorg Python module
I can learn that the "SoftwareSourceCode" definition has 121 properties,
but the recipe tells us that we only need a subset of those
properties for a valid extraction.
This is the code snippet that shows how you extract metadata and use the schemaorg Python module to generate the final template page. This file could be run in multiple places!
- In a continuous integration setup so that each change to master updates the Github Pages metadata.
- Using a tool like datalad that allows for version control of such metadata, and definition of extractors (also in Python).
- As a Github hook (or action) that is run at any stage in the development process.
- Rendered by a web server that provides Container Recipes for users that should be indexed with Google Search (e.g., Singularity Hub).
For the folders with associated containers, you will find a Dockerfile (and associated entrypoint.sh)! These containers will build the extractor into an image that can be used with Github Actions.
- extract-dockerfile Writeup to demonstrate extraction for a Dockerfile.
- extract-dockerfile Repository
- dockerfiles A scaled extraction (under development) for ~30-60K Dockerfiles, a subset of the Dinosaur Dataset.