Skip to content

fix(biospecimen): rebuild ICD-10 and ICD-O-3 enums from canonical sources#186

Open
adamjtaylor wants to merge 4 commits into
mainfrom
185-icd10-and-icdo3-codes-missing
Open

fix(biospecimen): rebuild ICD-10 and ICD-O-3 enums from canonical sources#186
adamjtaylor wants to merge 4 commits into
mainfrom
185-icd10-and-icdo3-codes-missing

Conversation

@adamjtaylor
Copy link
Copy Markdown
Collaborator

@adamjtaylor adamjtaylor commented May 11, 2026

Closes #185.

Summary

  • ICD-10: previously parsed from the CDC alphabetical index (lay-term keyword lookup) → only ~20k codes with descriptions like "Eberth's disease" for A01.00, and many cancer codes including C78.7 and all of C77.x missing.
  • ICD-O-3: previously sourced from a hematologic-only subset → only 184 codes, the entire 8xxx solid-tumor block absent, including 8140/3 Adenocarcinoma NOS.

Both enums are regenerated from canonical sources via two new end-to-end builder scripts (committed in 4b954e3):

Enum Before After Source
Icd10DiseaseEnum 20,347 98,186 CDC FY2026 icd10cm-order-2026.txt
IcdO3MorphologyEnum 184 1,211 NAACCR Histology3_v26_20250903.xlsx

The 98,186 ICD-10 figure matches caDSR CDE 11479873 and Kristen's "~98K" reference. Full code sets are included (no subsetting) — both C77.x and C78.7 are restored on the ICD-10 side, and the full 8xxx block is restored on the ICD-O-3 side.

Scripts

  • scripts/build_icd10_enum.py — fetches the CDC FY2026 zip, extracts the fixed-width order file, restores the decimal (C787C78.7), and writes the enum YAML. --billable-only excludes the ~23k category-header rows. Accepts --order-file for offline reproducibility.
  • scripts/build_icdo3_enum.py — fetches NAACCR Histology3 v26 xlsx, takes one row per code using the Preferred=True label, writes the enum YAML. --behaviors 2,3 filters to in-situ + malignant. Accepts --histology-xlsx for offline reproducibility.

Both run with zero args (download + write to the canonical path under modules/Biospecimen/domains/).

Test plan

  • make modules-test — existing biospecimen tests pin only enum names + slot ranges (no specific code values), so the expansion should be non-breaking.
  • Spot-check C78.7 and 8140/3 resolve in the regenerated YAML.
  • Confirm that downstream JSON schema regeneration (separate PR) still succeeds with the larger permissible-value set.
  • Verify behavior against real WASHU data that triggered the report.

🤖 Generated with Claude Code

adamjtaylor and others added 3 commits May 11, 2026 20:46
Adds two scripts that re-extract the ICD-10-CM and ICD-O-3 morphology
permissible values from canonical sources, replacing the incomplete
enums flagged in #185 (C78.7 missing from ICD-10, 8140/3 missing from
ICD-O-3).

- scripts/build_icd10_enum.py: fetches the CDC FY2026 ICD-10-CM
  Code Descriptions zip, parses the fixed-width order file
  (98,186 rows), restores decimals, and writes the enum YAML.
  --billable-only excludes ~23k category headers.

- scripts/build_icdo3_enum.py: fetches the NAACCR Histology3 v26
  xlsx (2025-09-03 release), extracts one row per code using the
  Preferred label (1,211 codes), and writes the enum YAML.
  --behaviors filters to specific behavior digits (e.g. 2,3 for
  in-situ + malignant).

Both scripts run with no args (download + write to default path) or
accept a local-file flag for offline reproducibility.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sources

Closes #185.

The previous enums were extracted from incomplete sources, omitting codes
that real HTAN data needs:
- ICD-10: parsed from the ICD-10-CM *alphabetical index*, capturing only
  ~20k of the ~98k valid codes. C78.7 and all of C77 were missing.
- ICD-O-3: only the 9xxx hematologic block (184 codes); the entire 8xxx
  solid-tumor block was absent, including 8140/3 Adenocarcinoma NOS.

Regenerated via scripts/build_icd10_enum.py and
scripts/build_icdo3_enum.py:
- Icd10DiseaseEnum:     20,347 → 98,186 codes (CDC FY2026 order file)
- IcdO3MorphologyEnum:     184 →  1,211 codes (NAACCR Histology3 v26)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adamjtaylor adamjtaylor linked an issue May 11, 2026 that may be closed by this pull request
@adamjtaylor adamjtaylor self-assigned this May 11, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTAN Schema Review — ✅ Approved

Automated review by htan-claude · commit c8c1a26cccbc112393d24bb3a1f1d4c4c8aaf0e8

The enum rebuilds are solid and well-motivated — a few biological/governance questions need team decisions before merge, but no structural breakage.

Warning

2 warnings to address (no blocking issues)

Files Changed

  • modules/Biospecimen/domains/icd10_disease_enum.yaml
  • modules/Biospecimen/domains/icdo3_morphology_enum.yaml

Checklist Results

Check Result Notes
Inheritance correctness N/A No classes or slots changed
Inlining of nested objects N/A No classes or slots changed
Slot completeness (range, title, description) PASS Every permissible value carries a non-empty description; no slots modified
Enum integrity (alphabetical, descriptions) PASS Lexicographic ordering correct; old malformed wildcard keys ("A40-", "A40.-", etc.) correctly removed; no duplicates introduced
Generated artifacts N/A JSON schema regeneration deferred to separate PR per PR description

Findings

Warnings

  • icd10_disease_enum.yaml — Non-billable category-header codes included by default (biological + coverage)
    The rebuilt enum includes ~23k ICD-10-CM category-header rows (e.g., "A00" → "Cholera", "A01" → "Typhoid and paratyphoid fevers") that are not valid for clinical coding on patient records. The builder script provides a --billable-only flag to exclude these rows, but the default zero-arg run retains them. A submitter selecting "A00" instead of "A00.0""A00.9" will pass HTAN validation but fail payer/registry adjudication in downstream systems. The team should make an explicit data-governance decision: if only billable codes are intended, re-run with --billable-only and document the choice in the schema description field; if non-billable headers are intentionally included (e.g., for research flexibility), document that decision equally clearly.

  • icdo3_morphology_enum.yaml — Behavior-digit /6 (metastatic) codes present in primary-tumor morphology enum (biological)
    The rebuilt enum includes codes such as "8000/6" ("Neoplasm, metastatic"), "8010/6" ("Carcinoma, metastatic, NOS"), "8140/6" ("Adenocarcinoma, metastatic, NOS"), and "8480/6" ("Pseudomyxoma peritonei"), which carry ICD-O-3 behavior digit /6 (secondary/metastatic site). SEER and NAACCR registries prohibit /6 codes as the primary histology code for a tumor record — they are only valid in secondary-site fields. If this enum drives primary tumor morphology coding (the expected HTAN biospecimen use case), retaining /6 values creates a path for submitters to erroneously classify a primary tumor as metastatic. Consider either excluding /6 codes entirely from the enum, or adding an explicit description-level warning on each /6 entry stating it is invalid for primary tumor classification.

Informational

  • icdo3_morphology_enum.yaml — NAACCR registry-guidance annotations embedded in description strings (biological)
    Several descriptions carry inline NAACCR site-restriction and year-range guidance (e.g., "Warty carcinoma [2018+ CASES FOR C60._, USE 8054/3. ALL OTHER SITES USE 8051/3 FOR ALL YEARS]"). These improve submitter clarity but are sourced verbatim from the NAACCR spreadsheet and will become stale when the source is updated. No structural issue — just be aware these annotations require maintenance on each rebuild.

  • icd10_disease_enum.yaml — Schema version bump may be warranted (coverage)
    The ICD-10 enum grows ~5× (20,347 → 98,186 values) and existing descriptions change (e.g., A01.00 shifts from "Eberth's disease" to "Typhoid fever, unspecified"). If the biospecimen schema header carries a version: field, bumping it would help downstream consumers track the magnitude of this update. No action required if the project doesn't track schema versions at that level.

  • Both files — No tests assert newly restored codes are valid enum values (coverage)
    Existing biospecimen tests pin only enum names and slot ranges, so the expansion is non-breaking by default. However, there are no tests confirming that newly restored codes like "C78.7", "C77.0", and "8140/3" resolve as valid, nor that a clearly invalid code raises a validation error. Adding a small spot-check test set would prevent silent regression if the builder scripts are re-run with different flags in the future. The PR's own test plan calls this out — this is a reminder to close that loop.

Verdict

APPROVE


Rules defined in CLAUDE.md · To update review rules, edit CLAUDE.md and open a PR to main

@adamjtaylor
Copy link
Copy Markdown
Collaborator Author

adamjtaylor commented May 11, 2026

@aditigopalan @kristenanton — two questions for your call:

  1. Do we want the full 98k codes (current PR), or --billable-only (~74k, no headers)?
  2. Is ICD_10_DISEASE_CODE intended only for precancerous lesions, or for any disease diagnosis associated with the biospecimen? Description says the former; real data implies the latter.

Auto-generated Python classes from LinkML schema updates.

**Auto-generated by GitHub Actions workflow**
[skip ci]
@github-actions
Copy link
Copy Markdown
Contributor

Python classes auto-updated!

The Python classes have been automatically regenerated and committed to this PR branch.

**Updated files:**
```
modules/Biospecimen/src/htan_biospecimen/datamodel/biospecimen.py

modules/Clinical/src/htan_clinical/datamodel/clinical.py
modules/DigitalPathology/src/htan_digitalpathology/datamodel/digital_pathology.py
modules/Imaging/src/htan_imaging/datamodel/imaging.py
modules/MultiplexMicroscopy/src/htan_multiplexmicroscopy/datamodel/multiplex_microscopy.py
modules/Sequencing/src/htan_sequencing/datamodel/sequencing.py
modules/SpatialOmics/src/htan_spatial/datamodel/spatial.py
modules/WES/src/htan_wes/datamodel/wes.py
modules/scRNA-seq/src/htan_scrna_seq/datamodel/scrna_seq.py
```

The generated classes are now up to date with the schema changes.

@aditigopalan
Copy link
Copy Markdown
Collaborator

HTAN Schema Review — ✅ Approved

Automated review by htan-claude · commit 2c5e1a58a0c88e718be56224bca010deb9bf9b4f

Warning

2 warnings carried forward (no blocking issues)

The previous automated review (on c8c1a26) covered the core YAML changes thoroughly. This review covers the full PR including the chore: auto-generate commit that was added afterward.

Files Changed

File Type Notes
modules/Biospecimen/domains/icd10_disease_enum.yaml Source YAML 20,347 → 98,186 codes; CDC FY2026 canonical source
modules/Biospecimen/domains/icdo3_morphology_enum.yaml Source YAML 184 → 1,211 codes; NAACCR Histology3 v26 source
scripts/build_icd10_enum.py New build script End-to-end CDC order-file fetcher
scripts/build_icdo3_enum.py New build script End-to-end NAACCR xlsx fetcher
modules/Biospecimen/src/.../biospecimen.py Generated (auto) Updated by GitHub Actions chore commit
modules/*/src/.../ (8 other modules) Generated (auto) Timestamp-only bump (1 line each)

Checklist Results

Check Result Notes
Inheritance correctness N/A No class hierarchy changes
Inlining of nested objects N/A No new class-ranged slots
Slot completeness (range, title, description) PASS No slots added or modified
Enum integrity — descriptions PASS All 98,186 ICD-10 and all 1,211 ICD-O-3 permissible values carry a non-empty description
Enum integrity — ordering PASS Both enums are sorted lexicographically (sorted(set(rows), key=lambda kv: kv[0])); no duplicates
Enum integrity — key codes restored PASS C77.0C77.9, C78.7, and 8140/3 all confirmed present with correct descriptions
Generated artifacts INFO Python dataclasses updated in-PR via GitHub Actions chore commit; JSON schemas correctly deferred to a downstream PR (consistent with CLAUDE.md)

Findings

Warnings (carried from prior review — unresolved at time of this review)

Warning

icd10_disease_enum.yaml — Non-billable category-header codes included by default

The enum includes ~23k ICD-10-CM category headers (e.g., "A00" → "Cholera", "A01" → "Typhoid and paratyphoid fevers") that are not valid for clinical coding. The builder provides --billable-only to exclude them, but the committed output retains them. A submitter selecting "A00" instead of "A00.9" passes HTAN validation but fails payer/registry adjudication. The team should make an explicit data-governance decision and document it in the enum description field — either re-run with --billable-only or explicitly state that header codes are included by design.

Warning

icdo3_morphology_enum.yaml — Behavior /6 (metastatic) codes present in primary-tumor morphology enum

Codes such as "8000/6", "8010/6", "8140/6" carry ICD-O-3 behavior digit /6 (secondary/metastatic). SEER and NAACCR prohibit /6 as the primary histology code for a tumor record. If this enum drives primary tumor morphology coding (the expected HTAN biospecimen use case), retaining /6 values creates a path for submitters to erroneously classify a primary tumor as metastatic. Consider filtering them out with --behaviors 0,1,2,3 or adding an explicit per-entry description warning on each /6 code.

Informational

Note

Generated Python dataclasses included alongside YAML changes

The chore: auto-generate Python classes from schema changes commit (GitHub Actions) adds the regenerated biospecimen.py (372k additions) and timestamps 8 other module .py files. CLAUDE.md calls for generated artifacts to go in a separate downstream PR; however, the rule targets missing or stale artifacts — the artifacts here are present and current, so this is not a blocking issue. JSON schemas are correctly deferred. Worth being aware of for future PRs to keep YAML and generated-artifact changes cleanly separated.

Note

No spot-check tests for newly restored codes

Existing tests pin only enum names and slot range references, so the expansion is non-breaking. There are no tests asserting that "C78.7", "C77.0", or "8140/3" resolve as valid enum values, nor that a clearly invalid code raises an error. The PR test plan flags this — adding a small spot-check test would prevent silent regression if the builder scripts are re-run with different flags. Not blocking, but closing this loop would be good hygiene.

Note

NAACCR registry-guidance annotations in description strings

Several ICD-O-3 descriptions carry inline NAACCR site-restriction and year-range guidance verbatim from the source spreadsheet (e.g., "Warty carcinoma [2018+ CASES FOR C60._, USE 8054/3. ALL OTHER SITES USE 8051/3 FOR ALL YEARS]"). These improve submitter clarity but will become stale on future rebuilds. No structural issue — just note for maintenance.

Verdict

APPROVE


Rules defined in CLAUDE.md · To update review rules, edit CLAUDE.md and open a PR to main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ICD10 and ICDO3 Codes Missing

3 participants