Skip to content

Blood test PDF parser (pdfplumber) #69

@EduardPetraeus

Description

@EduardPetraeus

Problem

Blood test results from GetTested (and other labs) are PDF-only. No structured data.

Proposed Solution

  1. PDF parser using pdfplumber (better than tabula for structured tables)
  2. Extract: biomarker name, value, unit, reference range, date
  3. Normalize biomarker names to standard vocabulary
  4. Output: parquet files in blood_tests/raw/ with date partitioning
  5. Silver merge: merge_blood_test_results.sql
  6. Support multiple lab formats (GetTested, Sundhed.dk)

Acceptance Criteria

  • Parse GetTested PDF to structured parquet
  • Biomarker names normalized
  • Reference ranges extracted
  • Silver table queryable with historical trends

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions