This repository provides a project framework for students to use to jump start their work.
We recommend that students clone this repository and use it as a starting point (inital commit) for their own github repository which will hold all of the project's data/files/scripts going forward (including: pdfs, python and shell scripts, PostGres dumps, json files, etc.).
caldarium_med_parsing
├── Dockerfiles/ - (Provided Folder)
│ └── helper.Dockerfile - (Provided File) Needed to run the Helper Container
├── schemas/ - (Provided Folder) To house the .json schemas
│ └── invoice.json - (Provided File) The .json schema for the invoice pdfs
├── minio_buckets/ - (Provided Folder) To house the buckets loaded from and into MinIO
│ └── invoices/ - (Provided Folder) This folder represents the MinIO invoices Bucket which can be loaded into and from MinIO
| ├── invoice_T1_gen1.pdf - (Provided File) This is an invoice pdf
| ├── invoice_T1_gen2.pdf - (Provided File) Another invoice pdf...
| ├── the rest of the invoices... - (Provided File) ...
├── postgres_dump/ - (Provided Folder) This folder houses the PostGres dump file
| └── med_parsing_dump.sql - (Provided File) This file will contain your PostGres data that can be loaded into/from the PostGres server
├── FastAPI/ - (Not Provided Folder) This folder would contain the files that you create, including FastAPI files
| ├── your first api files - (Not Provided File) FastAPI file
| ├── the rest of your api files or other files - (Not Provided File) FastAPI file or other project file
├── Another Project Folder/ - (Not Provided Folder) Another folder maybe to house .json files if parsing happens before postgres integration
| └── Some Project Files - (Not Provided Files) .json or other files
├── README.md - (Provided File) This file is the one you are reading right now. Once you start this project, it should be modified to represent both this starting framework as well as your contributions.
├── docker-compose.yml - (Provided File) Defines the Docker container, services, and their properties.
├── export_data.sh - (Provided File) Script to export data from MinIO and PostGres server.
├── bootstrap.sh - (Provided File) Script to load data into MinIO and PostGres server.
├── schema_validation.py - (Provided File) Python script to validation parse .json files against the provided schema.
└── .gitignore - (Provided File) Tells github which files to ignore.
You need to have both Docker and Git installed before-hand.
I recommend Docker Desktop as well. You can use it to easily see if the container/images are running. You can also easily use it to delete the container/images if you need a clean wipe.
You will also need Python installed for schema_validation.py
First, you need to download the code onto your computer:
git clone https://github.com/masshu0504/Caldarium
cd caldarium_med_parsing_start
git clone downloads the repository from GitHub.
cd moves you into the project directory so you can run Docker and scripts from the correct location.
Next, start all required services (Label Studio, MinIO, PostGres, and any helper containers) using Docker Compose:
docker compose up -d --build
docker compose up starts all services defined in docker-compose.yml.
-d runs the containers in the background (detached mode).
--build ensures that Docker rebuilds images if there are any changes.
In order to execute the export_data.sh and bootstrap.sh scripts, you will need to enter the helper container within your command line.
docker compose run --rm helper bash
This opens an interactive bash session in the helper container.
--rm ensures the container is removed automatically after you exit.
After commiting changes to the PostGres DB or uploading pdfs to MinIO, you can export these changes by running the export_data.sh script.
All PostGres data will be dumped into a postgres_dumps folder as a file named med_parsing_dump.sql.
All files uploaded to MinIO will be saved into folders within the minio_buckets folder. For example, a pdf file named file.pdf stored in MinIO in a bucket named test will be saved into the following file path: minio_buckets/test/file.pdf.
On Windows:
./export_data.sh
On Mac/Linux:
bash export_data.sh
This script loads pdfs and PostGres data into the MinIO and PostGres servers.
MinIO: If you are pulling a version of the github repo that contains additional pdfs in the minio_bucket folder, those additional files will be loaded into the MinIO server when this script is run.
PostGres: If you are pulling a version of the github repo that contains different PostGres data in the med_parsing_dump.sql file, that data will be committed to the PostGres server.
On Windows:
./bootstrap.sh
On Mac/Linux:
bash bootstrap.sh
Label Studio (Annotation Tool): http://localhost:8080
For Label Studio you'll have to create your own account.
MinIO (Object Storage): http://localhost:9000
MinIO Login -> username: [minio], password: [minio123]
Host name/address: localhost port: 5432 user: postgres password: postgres db: med_parsing
Once you are ready to validate your .json file (hopefully containing fields and values parsed from the pdfs) you will need to execute the schema_validation.py script.
This project is provided under the Apache 2.0 license.
See NOTICE.txt for details.
The script takes the following arguments:
- instance: [REQUIRED] the file path to the .json file you want to compare to the schema.
- schema: [OPTIONAL, ENABLE WITH '--schema' FLAG] the file path to the .json schema that the instance will be compared against. Default file path is: schemas/invoice.json.
This repo includes a lightweight Great Expectations (GX) suite to sanity-check parsed JSON.
(Note: This project uses the new GX folder layout: gx/.)
- Presence:
invoice_id,patient_id,date,total - Types/Format:
invoice_id→ stringpatient_id→ stringdate→YYYY-MM-DD(regex)total→ float
- Checkpoint:
invoices_checkpointruns the suite and writes a summary tovalidation_logs/.
- Docker services are running:
docker compose up -d --build docker compose run --rm helper bash -lc "pip3 install --break-system-packages -r requirements.txt"
docker compose run --rm helper bash -lc "python3 validate_invoices.py"
docker compose run --rm helper bash -lc ' python3 - <<PY import json p="data/dummy_invoices.json" d=json.load(open(p)); d[0].pop("total", None) # remove required field json.dump(d, open(p,"w"), indent=2) PY python3 validate_invoices.py '