Description
The adversarial testing harness currently supports a global pass-rate gate, but CI policy enforcement can be stricter and more explainable with category-level thresholds and stable regression fixtures.
Current Behavior
- CI gate mainly relies on
PASS_RATE_MIN.
- Per-category failure budgets are not enforced.
- Regression inputs are not explicitly versioned as deterministic fixtures for CI policy auditing.
Why This Is a Problem
- A suite can pass the global rate while still regressing in critical categories like secrets exfiltration.
- It is harder to detect policy drift in a deterministic and reviewable way across runs.
Expected Behavior
- Support global and category-specific fail thresholds in CI gate config.
- Add stricter adversarial categories relevant to production policy risk.
- Include deterministic regression fixtures for stable CI and reproducible failures.
- Ensure tests validate YAML/fixture → suite → policy/gate behavior.
Proposed Implementation
- Add categories for
DataExfiltration and ToolPrivilegeEscalation.
- Add deterministic adversarial fixture suite (
regression_suite.json).
- Extend CI gate config to support:
MAX_FAILURES
MAX_FAILURES_BY_CATEGORY (e.g. secrets_exfiltration=0,prompt_injection=0)
- Update tests to cover category threshold failures and fixture determinism.
- Wire workflow env vars in adversarial CI workflow.
Description
The adversarial testing harness currently supports a global pass-rate gate, but CI policy enforcement can be stricter and more explainable with category-level thresholds and stable regression fixtures.
Current Behavior
PASS_RATE_MIN.Why This Is a Problem
Expected Behavior
Proposed Implementation
DataExfiltrationandToolPrivilegeEscalation.regression_suite.json).MAX_FAILURESMAX_FAILURES_BY_CATEGORY(e.g.secrets_exfiltration=0,prompt_injection=0)