Following describes the architecture of eFlows4HPC Data Catalog. The service will provide information about data sets used in the project. The catalog will store info about locations, schemas, and additional metadata.
Main features:
-
keep track of data sources
-
enable registration of new data sources
-
provide user-view as well as simple API to access the information
ID | Requirement | Explanation |
---|---|---|
R1 |
View data sources |
View the list of data sources and details on particular ones (web page + api) |
R2 |
Register data sets |
Authenticated users should be able register/change data sets with additional metadata |
R3 |
No schema MD |
We don’t impose a schema for the metadata (not know what is relevant) |
R4 |
Documented API |
Swagger/OpenAPI |
Constraint | Explanation |
---|---|
Authentication |
OAuth-based for admin users |
Deployment |
We shall use CI/CD, the project will also be a playing field to setup this and test before the Data Logistics |
Docker-based Deployment |
This technology will be used in the project anyways |
This product will not be very mission critical, we want to keep it simple. A solution even without a backend database would be possible. API with Swagger/OpenAPI (e.g. fastAPI). Frontend static page with JavaScript calls to the API.
-
Code in Gitlab
-
Resources on HDF Cloud
-
Automatic deployment with Docker + docker-compose, OpenStack API
We use docker image repository in gitlab to generate new images.
Main data model is based on JSON and uses pydantic. Resources in the Catalog are of two storage classes (sources and targets). The number of classes can change in the future.
The actual storage of the information in the catalog is done through an abstract interface which in the first attempt stores the data in a file, other backends can be added.
API uses a backend abstraction to mange the informations
Web front-end are static html files generated from templates. This gives a lot of flexibility and allows for easy scalability if required.