Rework old tests and add validator step #9

khromov · 2025-12-22T20:11:30Z

Since we decided to only do one run per model as opposed to pass@k I brought back the old v1 tests for now, with the goal of merging some of them into more complex tests in the future, and add new tests. All the old tests have been rewritten with the "test first" approach. This will also give a percentage score for each model.

I've also added validator.ts support for tests, where we can test that the AI actually uses the intended functions eg $derived.by is actually used as opposed to just creating a component that passes the tests with $effect or something without actually using the right functions.

Main changes

Settings persistence: Save/restore preferences in .ai-settings.json
Code validation: Pre-test validation with custom validators
Retry logic: 10 attempts with exponential backoff via p-retry
Improved reporting: Score calculation, unit test totals, validation results
Enhanced HTML reports: Score badges, validation section, unit test counts
CI updates: Split self-tests and benchmark verification
New tests: Added 7 test suites (derived, derived-by, each, effect, inspect, props, snippets)

khromov added 17 commits December 22, 2025 20:30

add 1.0 tests, rewritten

ba3709f

wip

a954c91

wip

cb6abd8

Update Reference.svelte

3f2830a

add validators

b21cd6e

Update index.ts

7d82289

wip

839c050

fix

b899e3e

Create result-2025-12-22-20-36-07-anthropic-claude-haiku-4.5.json

b9e41c7

save old settings

e1fcbe7

wip

c0bf019

Update test.yml

ec32cb4

wip

f106225

Update report-template.ts

a8cc474

Update report-template.ts

d2c215e

Update index.ts

8af63f7

remove unnecessary comments

2d88284

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework old tests and add validator step #9

Rework old tests and add validator step #9

Uh oh!

khromov commented Dec 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rework old tests and add validator step #9

Are you sure you want to change the base?

Rework old tests and add validator step #9

Uh oh!

Conversation

khromov commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

khromov commented Dec 22, 2025 •

edited

Loading