Skip to content

feat: synthetic ecosystem generator (constraint tests + demonstrator)#150

Open
adamjtaylor wants to merge 3 commits into
mainfrom
feat/synthetic-ecosystem-v2
Open

feat: synthetic ecosystem generator (constraint tests + demonstrator)#150
adamjtaylor wants to merge 3 commits into
mainfrom
feat/synthetic-ecosystem-v2

Conversation

@adamjtaylor
Copy link
Copy Markdown
Collaborator

@adamjtaylor adamjtaylor commented Mar 15, 2026

Every release generates JSON schemas from LinkML YAML sources, but until now there was no automated way to verify that those schemas actually enforce their constraints, and no synthetic dataset demonstrating the full data model working end-to-end. This PR adds both.

The constraint test suite is pure Python and needs no API key. For each JSON schema it produces one valid record and a set of minimal invalid records — one per constraint (missing required field, invalid enum, broken pattern, wrong type, violated conditional). If a future schema change accidentally drops a constraint, exactly one test flips red.

The demonstrator generator uses Claude to produce a small but semantically coherent dataset that spans the entire HTAN data model: clinical records across 8 schemas, biospecimens linked to participants, and assay files chained through levels with consistent HTAN IDs. A deterministic post-processor fixes the mechanical errors LLMs commonly make (enum case mismatches, scalars where arrays are expected, missing conditional fields) so that a single lightweight LLM fix pass is enough to clean up anything remaining.

Both run automatically in CI at release time. The constraint tests are validated immediately; the demonstrator output is committed alongside the JSON schemas in the release PR.

Changes

  • scripts/generate_constraint_tests.py — generates valid.json + one invalid_*.json per constraint per schema (missing required, bad enum, bad pattern, wrong type, conditional violation). Pure Python, no API key needed.
  • scripts/validate_synthetic_data.py — validates constraint test suites against schemas, exits non-zero on failure.
  • scripts/generate_demonstrator.py — Claude-powered generator producing a coherent, joinable synthetic dataset across all HTAN schemas (clinical → biospecimen → assays) with deterministic post-processing (enum snapping, type coercion, conditional field filling) and single-pass LLM fix fallback.
  • CI integration — three new workflow steps run at release time: generate constraint tests, validate them, generate demonstrator. Outputs committed alongside JSON_Schemas/.
  • Dependencies — adds exrex, faker, jsonschema, anthropic to pyproject.toml.
  • Tests — 34 unit tests covering constraint test generation and demonstrator post-processing.

Test plan

  • pytest tests/test_constraint_test_generator.py — 34/34 pass
  • Local end-to-end: constraint tests generated + validated for all 23 schemas
  • Local end-to-end: demonstrator generated with 3 participants, post-processor resolves most validation errors deterministically
  • CI: verify ANTHROPIC_API_KEY secret is configured for demonstrator step
  • CI: verify synthetic_data/ is committed in release PR

Closes #146

🤖 Generated with Claude Code

Add three scripts for automated schema verification and synthetic data:

- generate_constraint_tests.py: produces valid.json + one invalid file per
  constraint (missing required, bad enum, bad pattern, wrong type,
  conditional violation) for every JSON schema
- validate_synthetic_data.py: validates constraint test suites against schemas
- generate_demonstrator.py: Claude-powered generator producing a coherent
  joinable dataset across all HTAN schemas (clinical, biospecimen, assays)
  with deterministic post-processing (enum snapping, type coercion,
  conditional field filling) and single-pass LLM fix fallback

Also adds CI workflow steps to run these at release time and commits
outputs alongside JSON_Schemas.

Closes #146

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
github-actions[bot]

This comment was marked as resolved.

- Add tests for deterministic helpers (_extract_level, _assay_group,
  _id_preassignments, _summarise_schema) and validate_synthetic_data.py
  (valid-that-fails, invalid-that-passes, missing-dir paths). 48 tests total.
- Guard demonstrator CI step with secret-presence check so missing
  ANTHROPIC_API_KEY doesn't block release workflow.
- Tighten anthropic lower bound to ^0.50 (model IDs require >=0.50).
- Regenerate poetry.lock to match pyproject.toml changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot dismissed their stale review March 15, 2026 16:41

Superseded by updated review on commit eab0486

github-actions[bot]
github-actions Bot previously approved these changes Mar 15, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTAN Schema Review — ✅ Approved

Automated review by htan-claude · commit eab048641b3fa3494b9dd7a7c5b26bf17ec6905b

I'll review this PR against the defined scope: structural issues, coverage warnings, and dependency notes. This PR adds Python scripts and CI workflow changes — no domains/*.yaml schema files are modified.


Review Summary

🔵 [INFO] No YAML schema changes — generated artifacts checklist not applicable

This PR modifies only scripts/, .github/workflows/, pyproject.toml, and poetry.lock. No domains/*.yaml files are touched. The generated-artifacts checklist item does not apply.


🔵 [INFO] anthropic added as a main (production) dependency

In pyproject.toml / poetry.lock, anthropic is placed in groups = ["main"] rather than a dev/optional group. The demonstrator generator is a release-time tool that requires an API key and is conditionally skipped in CI (if: ${{ secrets.ANTHROPIC_API_KEY != '' }}). Shipping anthropic, faker, and exrex as unconditional production dependencies will bloat the install footprint for anyone who pip installs this package without needing the synthetic data tooling.

Consider moving these to an optional extras group (e.g. [tool.poetry.extras] synthetic = ["anthropic", "faker", "exrex"]) or a dedicated dev/tools dependency group, consistent with how jsonschema is already available in both groups.


🔵 [INFO] poetry.lock regenerated with Poetry 2.3.2 (was 1.6.1)

The lockfile header changed from Poetry 1.6.1 to 2.3.2. This is a significant major-version bump. If CI or other contributors are still on Poetry 1.x this will cause lock-file incompatibility errors. Ensure the team's Poetry version is documented and pinned in CI accordingly.


🟡 [WARNING] No unit tests cover the CI workflow integration paths

The PR test plan confirms 34 unit tests pass for constraint generation and post-processing, but there are no tests covering:

  • The validate_synthetic_data.py script's exit-non-zero behaviour when a constraint test fails
  • The generate_demonstrator.py post-processor's enum-snapping and type-coercion logic against actual HTAN schema enum values (tests appear to use synthetic/mock schemas rather than the real JSON_Schemas/ fixtures)

Per the review guidelines, for each new script added the following coverage is expected:

  • At least one invalid input raises the correct error
  • At least one valid input loads without error

The demonstrator script is conditionally skipped when no API key is present, so integration tests are understood to be impractical. However, validate_synthetic_data.py's failure path should have a unit test asserting it exits non-zero on invalid data.


🟡 [WARNING] git add synthetic_data/… 2>/dev/null || true silently suppresses errors

In the release workflow's commit step:

git add synthetic_data/${{ steps.version.outputs.version }}/ 2>/dev/null || true

If the demonstrator step was skipped (no API key), this silently succeeds and the demonstrator output is simply absent from the release commit — which is the intended behaviour and is noted in the PR description. However, if the demonstrator step ran and produced output but git add failed for a different reason (permissions, path issue), that failure will also be silently swallowed. Consider separating the two cases:

# Only attempt if directory exists
if [ -d "synthetic_data/${{ steps.version.outputs.version }}/" ]; then
  git add synthetic_data/${{ steps.version.outputs.version }}/
fi

This is not blocking but could mask real failures in future runs.


🔵 [INFO] Constraint test step runs unconditionally; demonstrator step is conditional

The "Generate constraint test suites" and "Validate constraint test data" steps run unconditionally at release time and will fail hard if JSON_Schemas/$VERSION/ is empty or does not exist (e.g. if the schema generation steps earlier in the workflow failed). The existing schema generation steps do not appear to have explicit failure guards that would prevent these downstream steps from running. Verify that the workflow's job-level needs: or step ordering guarantees the schemas exist before constraint generation is attempted, or add an explicit existence check.


Rules defined in CLAUDE.md · To update review rules, edit CLAUDE.md and open a PR to main

The `dict | None` and `dict[str, set]` type hint syntax requires
Python 3.10+. CI runs 3.9, so add `from __future__ import annotations`
to both scripts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot dismissed their stale review March 15, 2026 16:47

Superseded by updated review on commit e9cd8b2

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTAN Schema Review — ✅ Approved

Automated review by htan-claude · commit e9cd8b280e022827ad2878988afd9725e20a2511

I have reviewed PR #150 against the HTAN Data Model AI Review Guidelines. This PR adds Python scripts, CI workflow changes, and dependency updates — it does not modify any domains/*.yaml schema files or src/*/datamodel/*.py / JSON_Schemas/ generated artifacts.

Applying the review rules exactly:


Structural (🔴 BLOCKING)

No schema YAML changes are present in this diff. No inheritance, inlining, slot completeness, or enum integrity issues can be flagged. No generated artifacts are claimed to be included in this PR.

No blocking issues found.


Coverage (🟡 WARNING)

No new LinkML classes, slots, or enums are added in this PR. The coverage checklist items (new class, new required slot, new enum, new module) all apply to schema changes, not to Python scripts.

The PR description states "34 unit tests covering constraint test generation and demonstrator post-processing" and the test plan shows pytest tests/test_constraint_test_generator.py passing. However, per the guidelines, coverage warnings apply only to new schema-level constructs (classes, slots, enums). Script-level test coverage is outside the defined review scope.

No coverage warnings triggered under the defined scope.


Dependency (🔵 INFO)

[INFO] poetry.lock is regenerated from Poetry 1.6.1 → Poetry 2.3.2. The lock file format update is mechanical and expected when the Poetry version changes. No action required, but note that anyone on the team still using Poetry 1.x will need to upgrade to resolve the lock file.

[INFO] Three new runtime dependencies are introduced in this PR: anthropic==0.50.0, exrex==0.11.0, and faker==24.14.1. These are added to the main group (not dev), meaning they will be installed in all environments including production schema-generation runs. If the anthropic package (which requires an API key) is not needed outside the demonstrator step, consider moving it (and faker/exrex) to an optional extras group or a dev/scripts group to avoid bloating the default install.

[INFO] linkml==1.8.5 remains pinned in the lock file. The current LinkML stable release is 1.9.x. The pin is 1 minor version behind, which is within the 2-minor-version tolerance defined in the guidelines — no action required.


Summary

No blocking issues. No schema-level coverage warnings triggered. Two informational notes on dependency grouping and the Poetry version bump in the lock file.


Rules defined in CLAUDE.md · To update review rules, edit CLAUDE.md and open a PR to main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: generate fully connected synthetic demonstrator ecosystem with per-schema constraint tests

1 participant