feat: synthetic ecosystem generator (constraint tests + demonstrator)#150
feat: synthetic ecosystem generator (constraint tests + demonstrator)#150adamjtaylor wants to merge 3 commits into
Conversation
Add three scripts for automated schema verification and synthetic data: - generate_constraint_tests.py: produces valid.json + one invalid file per constraint (missing required, bad enum, bad pattern, wrong type, conditional violation) for every JSON schema - validate_synthetic_data.py: validates constraint test suites against schemas - generate_demonstrator.py: Claude-powered generator producing a coherent joinable dataset across all HTAN schemas (clinical, biospecimen, assays) with deterministic post-processing (enum snapping, type coercion, conditional field filling) and single-pass LLM fix fallback Also adds CI workflow steps to run these at release time and commits outputs alongside JSON_Schemas. Closes #146 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add tests for deterministic helpers (_extract_level, _assay_group, _id_preassignments, _summarise_schema) and validate_synthetic_data.py (valid-that-fails, invalid-that-passes, missing-dir paths). 48 tests total. - Guard demonstrator CI step with secret-presence check so missing ANTHROPIC_API_KEY doesn't block release workflow. - Tighten anthropic lower bound to ^0.50 (model IDs require >=0.50). - Regenerate poetry.lock to match pyproject.toml changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Superseded by updated review on commit eab0486
There was a problem hiding this comment.
HTAN Schema Review — ✅ Approved
Automated review by htan-claude · commit
eab048641b3fa3494b9dd7a7c5b26bf17ec6905b
I'll review this PR against the defined scope: structural issues, coverage warnings, and dependency notes. This PR adds Python scripts and CI workflow changes — no domains/*.yaml schema files are modified.
Review Summary
🔵 [INFO] No YAML schema changes — generated artifacts checklist not applicable
This PR modifies only scripts/, .github/workflows/, pyproject.toml, and poetry.lock. No domains/*.yaml files are touched. The generated-artifacts checklist item does not apply.
🔵 [INFO] anthropic added as a main (production) dependency
In pyproject.toml / poetry.lock, anthropic is placed in groups = ["main"] rather than a dev/optional group. The demonstrator generator is a release-time tool that requires an API key and is conditionally skipped in CI (if: ${{ secrets.ANTHROPIC_API_KEY != '' }}). Shipping anthropic, faker, and exrex as unconditional production dependencies will bloat the install footprint for anyone who pip installs this package without needing the synthetic data tooling.
Consider moving these to an optional extras group (e.g. [tool.poetry.extras] synthetic = ["anthropic", "faker", "exrex"]) or a dedicated dev/tools dependency group, consistent with how jsonschema is already available in both groups.
🔵 [INFO] poetry.lock regenerated with Poetry 2.3.2 (was 1.6.1)
The lockfile header changed from Poetry 1.6.1 to 2.3.2. This is a significant major-version bump. If CI or other contributors are still on Poetry 1.x this will cause lock-file incompatibility errors. Ensure the team's Poetry version is documented and pinned in CI accordingly.
🟡 [WARNING] No unit tests cover the CI workflow integration paths
The PR test plan confirms 34 unit tests pass for constraint generation and post-processing, but there are no tests covering:
- The
validate_synthetic_data.pyscript's exit-non-zero behaviour when a constraint test fails - The
generate_demonstrator.pypost-processor's enum-snapping and type-coercion logic against actual HTAN schema enum values (tests appear to use synthetic/mock schemas rather than the realJSON_Schemas/fixtures)
Per the review guidelines, for each new script added the following coverage is expected:
- At least one invalid input raises the correct error
- At least one valid input loads without error
The demonstrator script is conditionally skipped when no API key is present, so integration tests are understood to be impractical. However, validate_synthetic_data.py's failure path should have a unit test asserting it exits non-zero on invalid data.
🟡 [WARNING] git add synthetic_data/… 2>/dev/null || true silently suppresses errors
In the release workflow's commit step:
git add synthetic_data/${{ steps.version.outputs.version }}/ 2>/dev/null || trueIf the demonstrator step was skipped (no API key), this silently succeeds and the demonstrator output is simply absent from the release commit — which is the intended behaviour and is noted in the PR description. However, if the demonstrator step ran and produced output but git add failed for a different reason (permissions, path issue), that failure will also be silently swallowed. Consider separating the two cases:
# Only attempt if directory exists
if [ -d "synthetic_data/${{ steps.version.outputs.version }}/" ]; then
git add synthetic_data/${{ steps.version.outputs.version }}/
fiThis is not blocking but could mask real failures in future runs.
🔵 [INFO] Constraint test step runs unconditionally; demonstrator step is conditional
The "Generate constraint test suites" and "Validate constraint test data" steps run unconditionally at release time and will fail hard if JSON_Schemas/$VERSION/ is empty or does not exist (e.g. if the schema generation steps earlier in the workflow failed). The existing schema generation steps do not appear to have explicit failure guards that would prevent these downstream steps from running. Verify that the workflow's job-level needs: or step ordering guarantees the schemas exist before constraint generation is attempted, or add an explicit existence check.
Rules defined in CLAUDE.md · To update review rules, edit CLAUDE.md and open a PR to main
The `dict | None` and `dict[str, set]` type hint syntax requires Python 3.10+. CI runs 3.9, so add `from __future__ import annotations` to both scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Superseded by updated review on commit e9cd8b2
There was a problem hiding this comment.
HTAN Schema Review — ✅ Approved
Automated review by htan-claude · commit
e9cd8b280e022827ad2878988afd9725e20a2511
I have reviewed PR #150 against the HTAN Data Model AI Review Guidelines. This PR adds Python scripts, CI workflow changes, and dependency updates — it does not modify any domains/*.yaml schema files or src/*/datamodel/*.py / JSON_Schemas/ generated artifacts.
Applying the review rules exactly:
Structural (🔴 BLOCKING)
No schema YAML changes are present in this diff. No inheritance, inlining, slot completeness, or enum integrity issues can be flagged. No generated artifacts are claimed to be included in this PR.
No blocking issues found.
Coverage (🟡 WARNING)
No new LinkML classes, slots, or enums are added in this PR. The coverage checklist items (new class, new required slot, new enum, new module) all apply to schema changes, not to Python scripts.
The PR description states "34 unit tests covering constraint test generation and demonstrator post-processing" and the test plan shows pytest tests/test_constraint_test_generator.py passing. However, per the guidelines, coverage warnings apply only to new schema-level constructs (classes, slots, enums). Script-level test coverage is outside the defined review scope.
No coverage warnings triggered under the defined scope.
Dependency (🔵 INFO)
[INFO] poetry.lock is regenerated from Poetry 1.6.1 → Poetry 2.3.2. The lock file format update is mechanical and expected when the Poetry version changes. No action required, but note that anyone on the team still using Poetry 1.x will need to upgrade to resolve the lock file.
[INFO] Three new runtime dependencies are introduced in this PR: anthropic==0.50.0, exrex==0.11.0, and faker==24.14.1. These are added to the main group (not dev), meaning they will be installed in all environments including production schema-generation runs. If the anthropic package (which requires an API key) is not needed outside the demonstrator step, consider moving it (and faker/exrex) to an optional extras group or a dev/scripts group to avoid bloating the default install.
[INFO] linkml==1.8.5 remains pinned in the lock file. The current LinkML stable release is 1.9.x. The pin is 1 minor version behind, which is within the 2-minor-version tolerance defined in the guidelines — no action required.
Summary
No blocking issues. No schema-level coverage warnings triggered. Two informational notes on dependency grouping and the Poetry version bump in the lock file.
Rules defined in CLAUDE.md · To update review rules, edit CLAUDE.md and open a PR to main
Every release generates JSON schemas from LinkML YAML sources, but until now there was no automated way to verify that those schemas actually enforce their constraints, and no synthetic dataset demonstrating the full data model working end-to-end. This PR adds both.
The constraint test suite is pure Python and needs no API key. For each JSON schema it produces one valid record and a set of minimal invalid records — one per constraint (missing required field, invalid enum, broken pattern, wrong type, violated conditional). If a future schema change accidentally drops a constraint, exactly one test flips red.
The demonstrator generator uses Claude to produce a small but semantically coherent dataset that spans the entire HTAN data model: clinical records across 8 schemas, biospecimens linked to participants, and assay files chained through levels with consistent HTAN IDs. A deterministic post-processor fixes the mechanical errors LLMs commonly make (enum case mismatches, scalars where arrays are expected, missing conditional fields) so that a single lightweight LLM fix pass is enough to clean up anything remaining.
Both run automatically in CI at release time. The constraint tests are validated immediately; the demonstrator output is committed alongside the JSON schemas in the release PR.
Changes
scripts/generate_constraint_tests.py— generatesvalid.json+ oneinvalid_*.jsonper constraint per schema (missing required, bad enum, bad pattern, wrong type, conditional violation). Pure Python, no API key needed.scripts/validate_synthetic_data.py— validates constraint test suites against schemas, exits non-zero on failure.scripts/generate_demonstrator.py— Claude-powered generator producing a coherent, joinable synthetic dataset across all HTAN schemas (clinical → biospecimen → assays) with deterministic post-processing (enum snapping, type coercion, conditional field filling) and single-pass LLM fix fallback.JSON_Schemas/.exrex,faker,jsonschema,anthropictopyproject.toml.Test plan
pytest tests/test_constraint_test_generator.py— 34/34 passANTHROPIC_API_KEYsecret is configured for demonstrator stepsynthetic_data/is committed in release PRCloses #146
🤖 Generated with Claude Code