1. Introduction and Goals

Following describes the architecture of eFlows4HPC Data Catalog. The service will provide information about data sets used in the project. The catalog will store info about locations, schemas, and additional metadata.

Main features:

keep track of data sources
enable registration of new data sources
provide user-view as well as simple API to access the information

1.1. Requirements Overview

ID	Requirement	Explanation
R1	View data sources	View the list of data sources and details on particular ones (web page + api)
R2	Register data sets	Authenticated users should be able register/change data sets with additional metadata
R3	No schema MD	We don’t impose a schema for the metadata (not know what is relevant)
R4	Documented API	Swagger/OpenAPI

1.2. Quality Goals

ID	Prio	Quality	Explanation
Q1	1	Extensibility	Possibility to add new metadata to existing rows
Q2	2	Interoperability	The service should work with Data Logistics
Q3	2	Deployability	Quick/automatic deployment

2. Architecture Constraints

Constraint	Explanation
Authentication	OAuth-based for admin users
Deployment	We shall use CI/CD, the project will also be a playing field to setup this and test before the Data Logistics
Docker-based Deployment	This technology will be used in the project anyways

3. System Scope and Context

3.1. Business Context

3.2. Technical Context

Mapping Input/Output to Channels

User → Data Catalog: simple (static?) web page view

Data Logistics → Data catalog HTTP/API read-only

Admin → Data Catalog: either a web page or CLI

4. Solution Strategy

4.1. Speed and flexibility

This product will not be very mission critical, we want to keep it simple. A solution even without a backend database would be possible. API with Swagger/OpenAPI (e.g. fastAPI). Frontend static page with JavaScript calls to the API.

4.2. Automatic Deployment

Code in Gitlab
Resources on HDF Cloud
Automatic deployment with Docker + docker-compose, OpenStack API

We use docker image repository in gitlab to generate new images.

4.3. Structure

Main data model is based on JSON and uses pydantic. Resources in the Catalog are of two storage classes (sources and targets). The number of classes can change in the future.

The actual storage of the information in the catalog is done through an abstract interface which in the first attempt stores the data in a file, other backends can be added.

API uses a backend abstraction to mange the informations

Web front-end are static html files generated from templates. This gives a lot of flexibility and allows for easy scalability if required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arch.adoc

arch.adoc

1. Introduction and Goals

1.1. Requirements Overview

1.2. Quality Goals

2. Architecture Constraints

3. System Scope and Context

3.1. Business Context

3.2. Technical Context

4. Solution Strategy

4.1. Speed and flexibility

4.2. Automatic Deployment

4.3. Structure

Files

arch.adoc

Latest commit

History

arch.adoc

File metadata and controls

1. Introduction and Goals

1.1. Requirements Overview

1.2. Quality Goals

2. Architecture Constraints

3. System Scope and Context

3.1. Business Context

3.2. Technical Context

4. Solution Strategy

4.1. Speed and flexibility

4.2. Automatic Deployment

4.3. Structure