This guide explains:
- How test suites are represented and loaded.
- The exact assertion-engine semantics (critical for writing valid tests).
- A practical workflow for extending Box, Slack, Calendar, and Linear suites safely.
It is based on the current code paths in:
backend/utils/seed_tests.pybackend/src/platform/api/routes.pybackend/src/platform/evaluationEngine/{compiler.py, dsl_schema.json, assertion.py, differ.py}
Use code behavior as source-of-truth when this guide and older docs differ.
Current benchmark suites are JSON files under:
examples/box/testsuites/box_bench.jsonexamples/slack/testsuites/slack_bench_v2.jsonexamples/calendar/testsuites/calendar_bench.jsonexamples/linear/testsuites/linear_bench.json
For Docker/runtime seeding, mirrored files live in backend/seeds/testsuites/.
Top-level shape:
{
"id": "suite-id",
"name": "Suite Name",
"description": "What this suite measures",
"service": "slack|box|calendar|linear",
"ignore_fields": {
"global": ["created_at", "updated_at"]
},
"tests": [ ... ]
}Notes:
ignore_fieldsis merged into each test's evaluation spec during seeding.serviceis metadata for humans/tooling; the DB row stores resolvedtemplate_schemaper test.
In benchmark files, tests are authored with assertions shorthand:
{
"id": "test_12",
"name": "Human-readable name",
"prompt": "Natural-language task for the agent",
"type": "actionEval",
"seed_template": "slack_bench_v2",
"impersonate_user_id": "U01AGENBOT9",
"assertions": [ ... ],
"metadata": { ... }
}Important seeding behavior from seed_tests.py:
- Required/used fields for DB
Test:name,prompt,type,seed_template,impersonate_user_id, plusassertions/expected_output. metadataand Slack_step_sequenceare currently ignored by runtime evaluation.- If
expected_outputis present, it is used directly; else shorthandassertionsbecomes{"assertions": ...}. ignore_fieldsat suite level is merged into each test'sexpected_output.ignore_fields.- Test/suite UUIDs are deterministic from suite name + test id (
uuid5), so changingidchanges DB identity.
You can also create tests via platform API:
POST /api/platform/testSuitesPOST /api/platform/testSuites/{suite_id}/tests
In this path, each test item carries expected_output and environmentTemplate. DSL is validated before persistence (CoreTestManager.validate_dsl).
This section is the most important part for writing correct tests.
At evaluateRun:
- A diff is computed (snapshot-based or replication journal).
- A spec is selected:
- request
expectedOutputif provided, else - stored test
expected_output(when run hastest_id), else {"assertions": []}.
- request
- Spec is compiled (
DSLCompiler.compile):- JSON Schema validation against
dsl_schema.json. - Predicate normalization.
- JSON Schema validation against
- Assertions run against diff via
AssertionEngine.evaluate.
Implication:
- If a run has no
test_idand you do not passexpectedOutput, route code builds{"assertions": []}. - Current DSL schema requires at least one assertion (
minItems: 1), so this path can fail validation and produce run statuserror. - For reliable evaluation, always provide a real spec (via
test_idorexpectedOutput).
Diff model:
{
"inserts": [ { "__table__": "...", "...": "..." } ],
"updates": [ { "__table__": "...", "before": {...}, "after": {...} } ],
"deletes": [ { "__table__": "...", "...": "..." } ]
}Routing by diff_type:
added-> matches rows ininsertsremoved-> matches rows indeleteschanged-> matches rows inupdates
Rows are filtered by entity via row["__table__"] == entity.
From dsl_schema.json:
- Supported
diff_type:added,removed,changed. unchangedis not accepted by schema.assertionsis required and must contain at least 1 item.wherevalues must be either:- predicate object, or
- primitive shorthand (string/number/bool/null), normalized to
{"eq": value}.
Some older tests/docs mention unchanged and logical combinators like and/or. Current schema/engine path does not support these as benchmark DSL inputs.
Implemented operators:
- Equality:
eq,ne - Membership:
in,not_in - String:
contains,not_contains,i_contains,starts_with,ends_with,i_starts_with,i_ends_with,regex - Numeric/order:
gt,gte,lt,lte - Existence/list:
exists,has_any,has_all
Behavior details:
- Multiple operators in one predicate object are ANDed.
- Multiple
wherefields are ANDed. - Dot paths are supported for nested objects (
start.timeZone). - Date/datetime values are normalized to ISO strings before comparison.
- For
contains/i_containson dict/list, runtime stringifies JSON first.
For each assertion:
- Candidate rows are selected by table + bucket.
wherefilter applied.- Match count checked against
expected_count.
If expected_count is omitted:
- Default is at least 1 match (
actual >= 1).
For each update row:
- Row matches if
wherematchesafterORbefore. - Changed fields are computed by keywise inequality (
before.get(k) != after.get(k)), excluding ignore fields. expected_changesfield checks are applied.
When strict=true, changed field set must be a subset of expected field names.
This is the key rule:
- Any extra changed field not listed in
expected_changesfails the assertion.
When strict=false:
- Extra changed fields are allowed; only declared expected fields are validated.
Per field:
- Must have changed (field must be in computed changed set).
- Optional
frompredicate checksbefore[field]. - Optional
topredicate checksafter[field].
Shorthand normalization by compiler:
"expected_changes": {"status": "done"}becomes"status": {"to": {"eq": "done"}}.- Primitive
from/toalso normalize to{"eq": ...}.
Default remains at least one matched update (actual >= 1).
Supported forms:
- Exact: integer
- Range: object with
minand/ormax
Examples:
1{"min": 1}{"max": 3}{"min": 1, "max": 3}
Use explicit expected_count whenever possible for deterministic scoring.
Ignore set for each assertion is union of:
- suite/test spec
ignore_fields.global - suite/test spec
ignore_fields[entity] - assertion-level
ignore(orignore_fieldsalias in runtime)
Ignored fields are excluded from changed-field computation in changed assertions.
Output score:
{
"passed": <bool>,
"score": { "passed": X, "total": Y, "percent": ... },
"failures": [ ... ]
}Scoring is assertion-level:
- One assertion index failing multiple rows still counts as one failed assertion in score.
- Failure list may include multiple messages for that assertion.
aggregatesis defined in schema but not used byAssertionEngine.evaluate.
Do not rely on aggregates unless runtime evaluator is extended first.
- Choose stable template/user:
seed_templatemust exist and contain all prerequisite entities.impersonate_user_idmust map to realistic permissions.
- Write prompt with clear end-state:
- Describe target objects and final conditions, not implementation details.
- Add minimal-but-sufficient assertions:
- Prefer 1-4 precise assertions over many weak ones.
- For
changedassertions:- Include
expected_changes. - Decide
strictintentionally (default strict is often good, but brittle if noisy fields change).
- Include
- Set
expected_countexplicitly when cardinality matters. - Add/maintain
ignore_fieldsfor nondeterministic fields (timestamps, etags, etc.). - Avoid unsupported DSL constructs (
unchanged,and/or,aggregatesruntime assumptions).
Added row:
{
"diff_type": "added",
"entity": "messages",
"where": {
"channel_id": { "eq": "C01ABCD1234" },
"message_text": { "contains": "hello" }
},
"expected_count": 1
}Changed row with from/to:
{
"diff_type": "changed",
"entity": "issues",
"where": { "id": { "eq": "ISSUE-1" } },
"expected_changes": {
"assigneeId": {
"from": { "eq": "old-user" },
"to": { "eq": "new-user" }
}
},
"expected_count": 1
}Changed with minimum affected rows:
{
"diff_type": "changed",
"entity": "calendar_events",
"where": { "calendar_id": { "eq": "cal_ops" } },
"expected_changes": {
"status": { "to": { "eq": "cancelled" } }
},
"expected_count": { "min": 1 }
}Nested JSON field check:
{
"diff_type": "added",
"entity": "calendar_events",
"where": {
"start.timeZone": { "eq": "America/Los_Angeles" }
},
"expected_count": 1
}- Missing
expected_countunintentionally allowing "any >=1" semantics. changedassertion failing because strict mode catches extra changed fields.- Using stale DSL forms (
unchanged, combinators) that fail schema validation. - Over-asserting volatile fields (
updated_at,etag, generated ids). - Writing prompts that do not force a unique, verifiable state change.
- Starting/evaluating runs without
test_idand withoutexpectedOutput(this can hit the empty-assertion fallback and fail schema validation).
- Edit canonical file in
examples/<service>/testsuites/*.json. - Mirror to
backend/seeds/testsuites/*.jsonfor Docker seed path. - Reseed:
cd backend
python utils/seed_tests.py- Run integration/perf validations relevant to your service.
- Slack:
- Channel/message ids are stable in seed; good for deterministic
where. - Consider thread/reaction side effects when using strict
changed.
- Channel/message ids are stable in seed; good for deterministic
- Box:
- Ignore volatile metadata (
etag, timestamps, hash/version fields) at suite level. - Use entity-specific assertions for files/folders/comments/tasks as needed.
- Ignore volatile metadata (
- Calendar:
- JSON-like fields (
start,end) are best checked with dot-path predicates. - Cancellation/clear flows often affect multiple rows; use range counts.
- JSON-like fields (
- Linear:
- Many workflows are relational; combine one primary state-change assertion with one integrity assertion.
from+tochecks are especially useful for assignee/state transitions.
- Prompt describes a unique end state.
- Assertions map directly to diffable DB effects.
- DSL validates under current schema.
- No unsupported DSL constructs used.
-
expected_countandstrictchoices are intentional. - Volatile fields ignored appropriately.
- Seed template and impersonated user are correct for the scenario.