Skip to content

Therapy rules flagging wrong fields as required#189

Open
aditigopalan wants to merge 5 commits into
mainfrom
fix/htan-859-therapy-multivalued-pattern
Open

Therapy rules flagging wrong fields as required#189
aditigopalan wants to merge 5 commits into
mainfrom
fix/htan-859-therapy-multivalued-pattern

Conversation

@aditigopalan
Copy link
Copy Markdown
Collaborator

@aditigopalan aditigopalan commented May 18, 2026

Problem

Therapy records were being flagged for THERAPEUTIC_AGENTS, NUMBER_OF_CYCLES, and REGIMEN_OR_LINE_OF_THERAPY even when treatment was Surgical Procedure only. The rules were firing for every record regardless of treatment type.

Why

The YAML used pattern: ".Pharmacotherapy." against TREATMENT_TYPE, which is a multivalued (array) field. JSON Schema's pattern keyword is string-only — when applied to an array, it vacuously passes. So the if clause was always satisfied and the then requirements always fired.

Fix

Two layers:

  1. YAML (therapy.yaml)— switched from pattern to LinkML's has_member + equals_string / equals_string_in, which is the semantically correct way to express "this array contains value X".
  2. Converter (scripts/linkml_to_flat_synapse_jsonschema.py) — added fix_multivalued_member_constraints() post-processor. LinkML's stock JSON Schema generator drops has_member constraints inside rule preconditions (known gap), so this walks the output and rewrites them into contains: {const|enum: ...} using the source schema as a lookup.

Verified

  • 5 new unit tests in (test_linkml_schema_conversion.py), all passing
  • End-to-end tested in Synapse against bound schema HTAN2Organization-Therapy-0.0.999 on folder syn75075405 (test project htan2-testing1). All 13 test cases behave as expected — surgical-only and mixed Surgical+Pharma records now pass without spurious required-field flags.

Notes for reviewers

  • For the new JSON Schema, look at and test syn75075405
  • The Surgical/Radiation rule (rule 1) now explicitly enumerates radiation enum values. The original pattern: ".Radiation." regex matched the same set, but worth a glance to confirm the enumeration matches curator intent.
  • Related but separate: PR Fix fix_boolean_patterns recursion under nested combinators in if (#187) #188 fixes the equivalent bug for boolean slots (SpatialLevel3 PANEL_ID/PANEL_NAME). This PR fixes only the multivalued-array variantProblem
  • Therapy records were being flagged for THERAPEUTIC_AGENTS, NUMBER_OF_CYCLES, and REGIMEN_OR_LINE_OF_THERAPY even when treatment was Surgical Procedure only. The rules were firing for every record regardless of treatment type.

…ntains

Adds fix_multivalued_member_constraints() post-processor that walks
generated JSON Schema rule preconditions and rewrites empty constraint
fragments on multivalued slots into JSON Schema's `contains` keyword
using has_member content from the source LinkML schema.

LinkML's JsonSchemaGenerator processes rule preconditions with
omit_type=True, which skips the array-aware path that would emit
`contains`. Without this fix, has_member constraints on multivalued
slots compile to empty {} fragments, causing if-clauses to vacuously
pass and then-clauses to fire unconditionally.

Related: HTAN-859
Adds five unit tests for fix_multivalued_member_constraints covering
has_member with equals_string, has_member with equals_string_in,
any_of of has_member branches, scalar slots (no-op), and missing
class_name (skip).

Related: HTAN-859
Replaces pattern: ".*X.*" workaround in Therapy rule preconditions
with LinkML's has_member + equals_string / equals_string_in. The
old pattern construct silently passed against array values (pattern
is string-only), making conditional rules fire unconditionally for
every record regardless of treatment type.

Resolves HTAN-859: surgical-only Therapy records were being flagged
for THERAPEUTIC_AGENTS, NUMBER_OF_CYCLES, REGIMEN_OR_LINE_OF_THERAPY.
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTAN Schema Review — ✅ Approved

Automated review by htan-claude · commit fcf867f3be576d5972166e562424db1d2955bde5

Solid work on this PR — the core fix is correct and well-tested, with just a couple of warnings about radiation enum coverage that need curator confirmation before merge.

Warning

2 warnings to address (no blocking issues)

Files Changed

  • modules/Clinical/domains/therapy.yaml
  • scripts/linkml_to_flat_synapse_jsonschema.py
  • tests/test_linkml_schema_conversion.py

Checklist Results

Check Result Notes
Inheritance correctness PASS Therapy class inheritance unchanged; no new classes introduced
Inlining of nested objects N/A No new slots with class ranges introduced
Slot completeness (range, title, description) PASS No new slots introduced; existing slots unchanged
Enum integrity (alphabetical, descriptions) PASS No new enums; inline equals_string_in values verified against TreatmentTypeEnum
Generated artifacts N/A Generated files produced in a separate downstream PR per repo convention

Findings

Warnings

  • modules/Clinical/domains/therapy.yaml — Radiation rule equals_string_in list omits several radiation-type enum values (structural + biological + coverage)
    The new equals_string_in list for Rule 1 (anatomic site required) enumerates 8 values but omits "Brachytherapy", "High-Dose Rate Brachytherapy", "Low-Dose Rate Brachytherapy", "Conventional Radiotherapy", "Stereotactic Radiosurgery", and "Radiopharmaceutical" — all present in TreatmentTypeEnum and clinically requiring a directed anatomic site. The old pattern: ".*Radiation.*" regex also missed most of these (e.g. "Brachytherapy", "Stereotactic Radiosurgery"), so some gaps pre-existed, but the explicit enumeration now makes the exclusion look intentional. Similarly, "Concurrent Chemoradiation" involves a radiation component yet is absent from Rule 1, meaning records with that treatment type will not be required to supply THERAPY_ANATOMIC_SITE_UBERON_CODE. The PR description acknowledges the radiation list is "worth a glance" — please confirm the enumeration matches curator intent before merge, and consider adding a test that asserts which TREATMENT_TYPE values trigger the anatomic-site rule to prevent future silent drift.

  • tests/test_linkml_schema_conversion.py — No test for rule-count / if-then-block count mismatch guard (coverage)
    fix_multivalued_member_constraints() contains an explicit defensive branch that prints a warning and returns the schema unchanged when len(rules) != len(if_then_blocks). None of the 5 new tests exercise this path. Since this guard protects against silent partial rewrites, it should have a test that supplies mismatched counts and asserts the schema is returned unmodified.

Informational

  • modules/Clinical/domains/therapy.yaml — Schema version bump may be warranted (coverage)
    The rule precondition rewrite is a behavioural breaking change: records that previously passed spuriously will now correctly fail. If therapy.yaml carries a version: field, incrementing it would help downstream consumers track the change. No version field is visible in the provided content, so this is noted for awareness only — no action required if versioning is managed elsewhere.

Verdict

APPROVE


Rules defined in CLAUDE.md · To update review rules, edit CLAUDE.md and open a PR to main

Auto-generated Python classes from LinkML schema updates.

**Auto-generated by GitHub Actions workflow**
[skip ci]
@github-actions
Copy link
Copy Markdown
Contributor

Python classes auto-updated!

The Python classes have been automatically regenerated and committed to this PR branch.

**Updated files:**
```
modules/Biospecimen/src/htan_biospecimen/datamodel/biospecimen.py

modules/Clinical/src/htan_clinical/datamodel/clinical.py
modules/DigitalPathology/src/htan_digitalpathology/datamodel/digital_pathology.py
modules/Imaging/src/htan_imaging/datamodel/imaging.py
modules/MultiplexMicroscopy/src/htan_multiplexmicroscopy/datamodel/multiplex_microscopy.py
modules/Sequencing/src/htan_sequencing/datamodel/sequencing.py
modules/SpatialOmics/src/htan_spatial/datamodel/spatial.py
modules/WES/src/htan_wes/datamodel/wes.py
modules/scRNA-seq/src/htan_scrna_seq/datamodel/scrna_seq.py
```

The generated classes are now up to date with the schema changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants