fix(biospecimen): rebuild ICD-10 and ICD-O-3 enums from canonical sources#186
fix(biospecimen): rebuild ICD-10 and ICD-O-3 enums from canonical sources#186adamjtaylor wants to merge 4 commits into
Conversation
Adds two scripts that re-extract the ICD-10-CM and ICD-O-3 morphology permissible values from canonical sources, replacing the incomplete enums flagged in #185 (C78.7 missing from ICD-10, 8140/3 missing from ICD-O-3). - scripts/build_icd10_enum.py: fetches the CDC FY2026 ICD-10-CM Code Descriptions zip, parses the fixed-width order file (98,186 rows), restores decimals, and writes the enum YAML. --billable-only excludes ~23k category headers. - scripts/build_icdo3_enum.py: fetches the NAACCR Histology3 v26 xlsx (2025-09-03 release), extracts one row per code using the Preferred label (1,211 codes), and writes the enum YAML. --behaviors filters to specific behavior digits (e.g. 2,3 for in-situ + malignant). Both scripts run with no args (download + write to default path) or accept a local-file flag for offline reproducibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sources Closes #185. The previous enums were extracted from incomplete sources, omitting codes that real HTAN data needs: - ICD-10: parsed from the ICD-10-CM *alphabetical index*, capturing only ~20k of the ~98k valid codes. C78.7 and all of C77 were missing. - ICD-O-3: only the 9xxx hematologic block (184 codes); the entire 8xxx solid-tumor block was absent, including 8140/3 Adenocarcinoma NOS. Regenerated via scripts/build_icd10_enum.py and scripts/build_icdo3_enum.py: - Icd10DiseaseEnum: 20,347 → 98,186 codes (CDC FY2026 order file) - IcdO3MorphologyEnum: 184 → 1,211 codes (NAACCR Histology3 v26) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
HTAN Schema Review — ✅ Approved
Automated review by htan-claude · commit
c8c1a26cccbc112393d24bb3a1f1d4c4c8aaf0e8
The enum rebuilds are solid and well-motivated — a few biological/governance questions need team decisions before merge, but no structural breakage.
Warning
2 warnings to address (no blocking issues)
Files Changed
modules/Biospecimen/domains/icd10_disease_enum.yamlmodules/Biospecimen/domains/icdo3_morphology_enum.yaml
Checklist Results
| Check | Result | Notes |
|---|---|---|
| Inheritance correctness | N/A | No classes or slots changed |
| Inlining of nested objects | N/A | No classes or slots changed |
| Slot completeness (range, title, description) | PASS | Every permissible value carries a non-empty description; no slots modified |
| Enum integrity (alphabetical, descriptions) | PASS | Lexicographic ordering correct; old malformed wildcard keys ("A40-", "A40.-", etc.) correctly removed; no duplicates introduced |
| Generated artifacts | N/A | JSON schema regeneration deferred to separate PR per PR description |
Findings
Warnings
-
icd10_disease_enum.yaml— Non-billable category-header codes included by default (biological + coverage)
The rebuilt enum includes ~23k ICD-10-CM category-header rows (e.g.,"A00"→ "Cholera","A01"→ "Typhoid and paratyphoid fevers") that are not valid for clinical coding on patient records. The builder script provides a--billable-onlyflag to exclude these rows, but the default zero-arg run retains them. A submitter selecting"A00"instead of"A00.0"–"A00.9"will pass HTAN validation but fail payer/registry adjudication in downstream systems. The team should make an explicit data-governance decision: if only billable codes are intended, re-run with--billable-onlyand document the choice in the schema description field; if non-billable headers are intentionally included (e.g., for research flexibility), document that decision equally clearly. -
icdo3_morphology_enum.yaml— Behavior-digit/6(metastatic) codes present in primary-tumor morphology enum (biological)
The rebuilt enum includes codes such as"8000/6"("Neoplasm, metastatic"),"8010/6"("Carcinoma, metastatic, NOS"),"8140/6"("Adenocarcinoma, metastatic, NOS"), and"8480/6"("Pseudomyxoma peritonei"), which carry ICD-O-3 behavior digit/6(secondary/metastatic site). SEER and NAACCR registries prohibit/6codes as the primary histology code for a tumor record — they are only valid in secondary-site fields. If this enum drives primary tumor morphology coding (the expected HTAN biospecimen use case), retaining/6values creates a path for submitters to erroneously classify a primary tumor as metastatic. Consider either excluding/6codes entirely from the enum, or adding an explicitdescription-level warning on each/6entry stating it is invalid for primary tumor classification.
Informational
-
icdo3_morphology_enum.yaml— NAACCR registry-guidance annotations embedded indescriptionstrings (biological)
Several descriptions carry inline NAACCR site-restriction and year-range guidance (e.g.,"Warty carcinoma [2018+ CASES FOR C60._, USE 8054/3. ALL OTHER SITES USE 8051/3 FOR ALL YEARS]"). These improve submitter clarity but are sourced verbatim from the NAACCR spreadsheet and will become stale when the source is updated. No structural issue — just be aware these annotations require maintenance on each rebuild. -
icd10_disease_enum.yaml— Schema version bump may be warranted (coverage)
The ICD-10 enum grows ~5× (20,347 → 98,186 values) and existing descriptions change (e.g.,A01.00shifts from"Eberth's disease"to"Typhoid fever, unspecified"). If the biospecimen schema header carries aversion:field, bumping it would help downstream consumers track the magnitude of this update. No action required if the project doesn't track schema versions at that level. -
Both files — No tests assert newly restored codes are valid enum values (coverage)
Existing biospecimen tests pin only enum names and slot ranges, so the expansion is non-breaking by default. However, there are no tests confirming that newly restored codes like"C78.7","C77.0", and"8140/3"resolve as valid, nor that a clearly invalid code raises a validation error. Adding a small spot-check test set would prevent silent regression if the builder scripts are re-run with different flags in the future. The PR's own test plan calls this out — this is a reminder to close that loop.
Verdict
APPROVE
Rules defined in CLAUDE.md · To update review rules, edit CLAUDE.md and open a PR to main
|
@aditigopalan @kristenanton — two questions for your call:
|
Auto-generated Python classes from LinkML schema updates. **Auto-generated by GitHub Actions workflow** [skip ci]
|
✅ Python classes auto-updated! modules/Clinical/src/htan_clinical/datamodel/clinical.py |
HTAN Schema Review — ✅ Approved
Warning 2 warnings carried forward (no blocking issues) The previous automated review (on Files Changed
Checklist Results
FindingsWarnings (carried from prior review — unresolved at time of this review)Warning
The enum includes ~23k ICD-10-CM category headers (e.g., Warning
Codes such as InformationalNote Generated Python dataclasses included alongside YAML changes The Note No spot-check tests for newly restored codes Existing tests pin only enum names and slot Note NAACCR registry-guidance annotations in Several ICD-O-3 descriptions carry inline NAACCR site-restriction and year-range guidance verbatim from the source spreadsheet (e.g., VerdictAPPROVE Rules defined in CLAUDE.md · To update review rules, edit |
Closes #185.
Summary
Both enums are regenerated from canonical sources via two new end-to-end builder scripts (committed in
4b954e3):Icd10DiseaseEnumicd10cm-order-2026.txtIcdO3MorphologyEnumHistology3_v26_20250903.xlsxThe 98,186 ICD-10 figure matches caDSR CDE 11479873 and Kristen's "~98K" reference. Full code sets are included (no subsetting) — both
C77.xandC78.7are restored on the ICD-10 side, and the full 8xxx block is restored on the ICD-O-3 side.Scripts
scripts/build_icd10_enum.py— fetches the CDC FY2026 zip, extracts the fixed-width order file, restores the decimal (C787→C78.7), and writes the enum YAML.--billable-onlyexcludes the ~23k category-header rows. Accepts--order-filefor offline reproducibility.scripts/build_icdo3_enum.py— fetches NAACCR Histology3 v26 xlsx, takes one row per code using thePreferred=Truelabel, writes the enum YAML.--behaviors 2,3filters to in-situ + malignant. Accepts--histology-xlsxfor offline reproducibility.Both run with zero args (download + write to the canonical path under
modules/Biospecimen/domains/).Test plan
make modules-test— existing biospecimen tests pin only enum names + slot ranges (no specific code values), so the expansion should be non-breaking.C78.7and8140/3resolve in the regenerated YAML.🤖 Generated with Claude Code