Add --keep-sessions flag and rejudge command to skill-validator by lewing · Pull Request #149 · dotnet/skills

lewing · 2026-02-27T22:28:20Z

Summary

Adds two opt-in features to the skill-validator eval framework:

--keep-sessions flag — Preserves agent session data (SDK events, config dirs) in a SQLite database (sessions.db) under the timestamped results directory instead of deleting them after evaluation.
rejudge command — Re-runs judges on previously saved sessions without re-running the expensive agent LLM calls. Useful for iterating on judge prompts or trying different judge models.

What's in the sessions database

Each eval run pair (baseline + with-skill) is tracked with:

Session ID, skill name/path, scenario name, run index, role, model
Config and work directory paths
Prompt — the eval scenario prompt text (for browsing/replay)
Skill SHA — a truncated SHA-256 hash of the skill directory contents for change detection
Run metrics, judge results, and pairwise comparison results
Status tracking (running → completed/timed_out)

Design decisions

Fully opt-in — without --keep-sessions, behavior is unchanged. sessionDb stays null and all sessionDb?. calls are no-ops.
Thread-safe SQLite — WAL mode + SemaphoreSlim write lock for concurrent scenario execution. Each eval process gets its own timestamped results dir → separate sessions.db → no cross-process conflicts.
Work dirs always cleaned — only config dirs (containing events.jsonl) are preserved; temp work dirs are always deleted.
Skill SHA over full content — stores a 12-char hex hash of the skill directory rather than the full SKILL.md text, giving change detection without bloating the DB.

Files changed

File	Change
`SessionDatabase.cs`	New — thread-safe SQLite service with schema, CRUD, and `ComputeDirectorySha`
`RejudgeCommand.cs`	New — subcommand to re-run judges on saved sessions
`ValidateCommand.cs`	Wire `--keep-sessions` flag, session registration, prompt + skill SHA
`AgentRunner.cs`	Separate config/work dir lifecycle, extended `RunOptions`
`Models.cs`	Added `KeepSessions` to `ValidatorConfig`
`Program.cs`	Register `RejudgeCommand`
`SkillValidator.csproj`	Added `Microsoft.Data.Sqlite` dependency
`SessionDatabaseTests.cs`	New — 11 tests covering CRUD, concurrency, isolation, SHA determinism

Testing

All 192 tests pass (pre-existing + 11 new session DB tests).

Supersedes #135 (reopened from non-forked branch per review feedback).

- Add --keep-sessions flag to preserve agent session data in SQLite DB - Create SessionDatabase service with WAL mode and thread-safe writes - Update AgentRunner to separate config/work dir lifecycle - Track session metadata, RunMetrics, judge and pairwise results - Add rejudge command to re-run judges on saved sessions - Add 8 tests for SessionDatabase (CRUD, concurrency, isolation) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Store the eval scenario prompt and SKILL.md content in the sessions table so they are available for browsing and replay without needing the original skill directory. Both columns are optional (nullable) for backward compatibility. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Compute a SHA-256 hash (truncated to 12 hex chars) over all files in the skill directory, sorted by relative path for determinism. This gives lightweight change detection without storing potentially large skill content in the database. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-02-27T22:50:46Z

Skill Validation Results

Skill	Scenario	Baseline	With Skill	Δ	Skills Loaded	Overfit	Verdict
microbenchmarking	Investigate runtime upgrade performance impact	2.5/5 ⏰ timeout	4.0/5	+1.5	✅ microbenchmarking; tools: skill, stop_bash, glob, read_bash	✅ 0.11	✅
csharp-scripts	Test a C# language feature with a script	2.5/5	4.5/5	+2.0	✅ csharp-scripts; tools: skill, create, edit	🟡 0.32	✅
thread-abort-migration	Worker thread with abort-based cancellation	5.0/5	5.0/5	0.0	✅ thread-abort-migration; tools: skill	✅ 0.13	❌
thread-abort-migration	Timeout enforcement via Thread.Abort	4.0/5	5.0/5	+1.0	✅ thread-abort-migration; tools: report_intent, skill	✅ 0.13	✅
thread-abort-migration	Blocking WaitHandle with Thread.Interrupt	3.5/5	4.0/5	+0.5	✅ thread-abort-migration; tools: skill	✅ 0.13	✅
thread-abort-migration	ASP.NET Response.End and Response.Redirect with Thread.Abort	4.0/5	4.5/5	+0.5	✅ thread-abort-migration; tools: skill, report_intent, bash	✅ 0.13	❌
thread-abort-migration	Thread.Join and Thread.Sleep only — should not migrate	4.0/5	5.0/5	+1.0	✅ thread-abort-migration; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Detects compiled regex startup budget and regex chain allocations	1.0/5	4.0/5 ⏰ timeout	+3.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Detects CurrentCulture comparer and compiled regex budget in inflection rules	1.0/5	5.0/5 ⏰ timeout	+4.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Finds per-call Dictionary allocation not hoisted to static	1.0/5	5.0/5	+4.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Catches compound allocations in recursive number converter with ToLower	1.0/5	4.5/5	+3.5	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Finds StringComparison.Ordinal missing and FrozenDictionary opportunities	1.0/5	5.0/5	+4.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Detects Aggregate+Replace chain and struct missing IEquatable	1.0/5	5.0/5	+4.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Finds branched Replace chain in format string manipulation	1.0/5	3.5/5	+2.5	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Catches LINQ on hot-path string processing and All(char.IsUpper)	1.0/5	5.0/5	+4.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Detects LINQ pipeline in TimeSpan formatting and collection processing	1.5/5	5.0/5	+3.5	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Flags Span inconsistencies and compound method chains in truncation library	1.0/5	5.0/5	+4.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Identifies unsealed leaf classes and locale hierarchy patterns	1.0/5	4.5/5 ⏰ timeout	+3.5	✅ analyzing-dotnet-performance; tools: skill, write_bash, stop_bash	✅ 0.14	✅
optimizing-ef-core-queries	Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete	4.5/5	5.0/5	+0.5	✅ optimizing-ef-core-queries; tools: skill	🟡 0.35	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET 8+)	3.5/5	5.0/5	+1.5	✅ dotnet-pinvoke; tools: skill	✅ 0.11	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET Framework)	4.5/5	5.0/5	+0.5	✅ dotnet-pinvoke; tools: skill, view	✅ 0.11	✅
dump-collect	Configure automatic crash dumps for CoreCLR app on Linux	5.0/5	5.0/5	0.0	✅ dump-collect; tools: skill, report_intent, view, glob, bash	✅ 0.09	✅
dump-collect	Set up NativeAOT crash dumps with createdump in Kubernetes	2.5/5	5.0/5	+2.5	✅ dump-collect; tools: skill	✅ 0.09	✅
dump-collect	Recover crash dump from macOS NativeAOT without createdump	3.5/5	4.0/5	+0.5	✅ dump-collect; tools: skill, report_intent, view, glob, bash	✅ 0.09	✅
dump-collect	Configure CoreCLR dump collection in Alpine Docker as non-root	3.5/5	4.0/5	+0.5	✅ dump-collect; tools: skill, bash	✅ 0.09	✅
dump-collect	Advisory: macOS NativeAOT crash dump recovery steps	3.5/5	5.0/5	+1.5	✅ dump-collect; tools: skill	✅ 0.09	✅
dump-collect	Advisory: CoreCLR Alpine Docker non-root configuration	3.0/5	5.0/5	+2.0	✅ dump-collect; tools: skill	✅ 0.09	✅
dump-collect	Advisory: NativeAOT Kubernetes dump collection setup	2.5/5	3.0/5	+0.5	✅ dump-collect; tools: skill, bash	✅ 0.09	❌
dump-collect	Detect runtime and configure crash dumps for unknown .NET app on Linux	4.0/5	4.5/5	+0.5	✅ dump-collect; tools: skill, glob	✅ 0.09	✅
dump-collect	Decline dump analysis request	2.5/5	4.5/5	+2.0	ℹ️ not activated (expected)	✅ 0.09	✅
build-parallelism	Analyze build parallelism bottlenecks	1.0/5 ⏰ timeout	3.0/5 ⏰ timeout	+2.0	✅ build-parallelism; binlog-generation; binlog-failure-analysis; tools: skill, task, glob	✅ 0.16	✅
including-generated-files	Diagnose generated file inclusion failure	3.0/5	5.0/5	+2.0	✅ including-generated-files; tools: skill	✅ 0.17	✅
msbuild-antipatterns	Review MSBuild files for anti-patterns and style issues	5.0/5	5.0/5	0.0	✅ msbuild-antipatterns; tools: skill, glob, edit	✅ 0.08	✅
build-perf-baseline	Establish build performance baseline and recommend optimizations	4.0/5	5.0/5	+1.0	✅ build-perf-baseline; tools: skill	🟡 0.36	✅
msbuild-modernization	Modernize legacy project to SDK-style	5.0/5	5.0/5	0.0	✅ msbuild-modernization; tools: skill	✅ 0.06	✅
directory-build-organization	Organize build infrastructure for a multi-project repo	3.5/5	5.0/5	+1.5	✅ directory-build-organization; msbuild-antipatterns; tools: skill	✅ 0.20	✅
check-bin-obj-clash	Diagnose bin/obj output path clashes	3.5/5	5.0/5	+1.5	✅ check-bin-obj-clash; tools: skill, glob, edit	✅ 0.14	✅
incremental-build	Analyze incremental build issues	3.0/5	4.0/5	+1.0	✅ incremental-build; tools: skill	✅ 0.14	✅
eval-performance	Analyze MSBuild evaluation performance issues	4.0/5	4.5/5	+0.5	✅ eval-performance; tools: skill	🟡 0.21	✅
build-perf-diagnostics	Analyze analyzer performance impact on builds	5.0/5	5.0/5	0.0	✅ build-perf-diagnostics; binlog-generation; binlog-failure-analysis; tools: skill	🟡 0.23	❌
binlog-generation	Build project with /bl flag	1.0/5	5.0/5	+4.0	✅ binlog-generation; tools: skill	✅ 0.00	✅
binlog-generation	Build with /bl in PowerShell	2.5/5	5.0/5	+2.5	✅ binlog-generation; tools: skill	✅ 0.00	✅
binlog-generation	Build multiple configurations with unique binlogs	2.0/5	5.0/5	+3.0	✅ binlog-generation; tools: skill	✅ 0.00	✅
binlog-failure-analysis	Diagnose build failures from binlog only (no source files)	4.0/5	4.0/5	0.0	✅ binlog-failure-analysis; tools: skill, load_binlog, edit, view	✅ 0.11	✅

⏰ timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

steveisok · 2026-02-28T18:52:36Z

@ViktorHofer @JanKrivanek I think this will be immensly helpful for validation. We should prioritize getting this in.

I'm trying to help tweak #122 and validation cycles are loooong. I inquired about how this PR could help and this is what it said:

Yes, this would help enormously. Here's why:

The rejudge command directly addresses our biggest pain point. Right now, each eval run costs ~35 min (parallel) to ~105 min (sequential), and the vast majority of that time is the agent LLM calls — not the judging. When we tweak rubric wording (like adding
"suppressed warnings don't count as resolved"), we have to re-run the entire pipeline to see the effect. With rejudge, we could:

 - Re-score existing sessions with updated rubric text in seconds
 - Try different judge models (e.g., judge with Sonnet instead of Opus) without re-running agents
 - Iterate on rubric fairness without burning hours of Opus compute

Concrete examples from our experience where this would have saved hours:

 1. When we updated the rubric to penalize pragma suppressions — that was a rubric change, not a skill change. We had to run the full pipeline again to see the impact.
 2. The +8.2% run where baseline "won" on final build by cheating with pragmas — with rejudge we could have immediately tested whether clarified rubric language fixed the scoring without another 35-minute run.
 3. Comparing our two 3-run results (+46.9% vs +25.6%) — if we could rejudge both with identical rubric text, we'd know how much of the variance is judge calibration vs actual agent behavior.

The --keep-sessions flag is also valuable for our debugging workflow. We spent significant time figuring out why the agent timed out (analysis paralysis, sed breakage, ResponseError wall). Having events.jsonl preserved would make that forensics much faster.

In short: it turns our iteration loop from "change → wait 35+ min → analyze" into "change rubric → rejudge in seconds → analyze." That's a 100× speedup on the judging iteration cycle.

/cc @agocke @jeffschwMSFT

lewing · 2026-02-28T19:01:49Z

I've also had several runs die thanks to copilot exiting while they were 20 minutes in with multiple runs completed. My other reason for working on this it to make it easy to resume one of the eval sessions and ask the agent for feedback on why it made certain choices.

lewing

Some review from a different agent:

Nice work — the schema design is clean and the thread-safety approach is solid. A few suggestions from the dotnet-replay integration perspective:

1. Store config_dir as relative to the DB directory

Currently config_dir is stored as an absolute path (e.g. /home/user/.../results-2026-02-28T.../sessions/<guid>). Since the results directory is self-contained (sessions.db + sessions/<guid>/events.jsonl are siblings), storing it relative to the DB directory (e.g. sessions/<guid>) would make the results dir portable — you can zip it, move it to another machine, and tools like dotnet-replay --db sessions.db just work without broken paths.

The validator only reads config_dir back in RejudgeCommand, which already knows resultsDir, so resolving via Path.Combine(resultsDir, relativeConfigDir) is straightforward. The change would be in RegisterSession — compute the relative path before storing:

var relativeConfigDir = Path.GetRelativePath(Path.GetDirectoryName(dbPath)!, configDir);

2. Add a schema version marker

The Copilot CLI session-store.db has a schema_version table that makes DB type detection trivial for consumers. Without one here, tools must probe column names via PRAGMA table_info(sessions). A small metadata table also future-proofs schema migrations:

CREATE TABLE IF NOT EXISTS schema_info (key TEXT PRIMARY KEY, value TEXT NOT NULL);
INSERT OR IGNORE INTO schema_info (key, value) VALUES ('type', 'skill-validator');
INSERT OR IGNORE INTO schema_info (key, value) VALUES ('version', '1');

3. Consider a display_name column (minor / follow-up)

When browsing sessions in a TUI, Copilot CLI sessions have a summary field for display. For skill-validator sessions, a consumer has to synthesize something like "scenario_name (role) — skill_name". Pre-computing a display_name in the DB makes it self-describing. This could be a follow-up though.

- Use session ID as config dir folder name (sessions/<id>) so the DB record links directly to its files on disk - Store config_dir as relative path for portable results directories - Add schema_info table with type='skill-validator' and version='1' for easy DB detection by external tools like dotnet-replay - Add GetSchemaInfo() method for consumers - Extract GetSessions(whereClause) for reuse - Add tests for schema_info and relative config_dir Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Detect and load skill-validator databases (from dotnet/skills#149) alongside existing Copilot CLI session-store.db files. The --db flag now accepts either database type, auto-detected via schema_info vs schema_version tables. Changes: - Add SessionDbType enum and BrowserSession record to replace tuple - Add DetectDbType() for schema-based DB type detection - Add LoadSkillValidatorSessions() with config_dir relative path resolution - Show eval metadata (metrics, judge, pairwise) in preview panel - Adapt browser icons and display for skill-validator sessions - Handle sessions without events.jsonl gracefully Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When --keep-sessions is active, pass the already-created timestamped directory to Reporter.ReportResults so sessions.db, session files, and report outputs (results.json, summary.md) all live in the same folder instead of creating two separate timestamped directories. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix test failures from .NET 10 terminal logger stdout noise - Guard SafeGetString against undefined JsonElement (ValueKind check) - Add -v q to dotnet run in test helpers to suppress build output leaking into stdout, which corrupted JSON parsing in tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add skill-validator sessions.db support Detect and load skill-validator databases (from dotnet/skills#149) alongside existing Copilot CLI session-store.db files. The --db flag now accepts either database type, auto-detected via schema_info vs schema_version tables. Changes: - Add SessionDbType enum and BrowserSession record to replace tuple - Add DetectDbType() for schema-based DB type detection - Add LoadSkillValidatorSessions() with config_dir relative path resolution - Show eval metadata (metrics, judge, pairwise) in preview panel - Adapt browser icons and display for skill-validator sessions - Handle sessions without events.jsonl gracefully Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix cross-platform events.jsonl path resolution Normalize Windows backslashes in config_dir for Linux/WSL compatibility. Search nested session-state/<guid>/events.jsonl when direct path not found. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Detect copilot CLI install method for resume Try standalone 'copilot' first, fall back to 'gh copilot' for users who installed via the GitHub CLI extension. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Removed GitHub.Copilot.SDK package reference.

lewing and others added 3 commits February 27, 2026 11:19

lewing commented Mar 1, 2026

View reviewed changes

lewing mentioned this pull request Mar 1, 2026

Add skill-validator sessions.db support lewing/dotnet-replay#17

Merged

steveisok mentioned this pull request Mar 1, 2026

Add skill for AOT-compatibility #122

Merged

lewing added 2 commits March 4, 2026 14:00

Merge branch 'main' into incremental-results

5b5541b

Remove GitHub.Copilot.SDK from project dependencies

8768775

Removed GitHub.Copilot.SDK package reference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --keep-sessions flag and rejudge command to skill-validator#149

Add --keep-sessions flag and rejudge command to skill-validator#149
lewing wants to merge 7 commits intomainfrom
incremental-results

lewing commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

steveisok commented Feb 28, 2026

Uh oh!

lewing commented Feb 28, 2026

Uh oh!

lewing left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lewing commented Feb 27, 2026

Summary

What's in the sessions database

Design decisions

Files changed

Testing

Uh oh!

github-actions bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Skill Validation Results

Uh oh!

steveisok commented Feb 28, 2026

Uh oh!

lewing commented Feb 28, 2026

Uh oh!

lewing left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 27, 2026 •

edited

Loading

lewing left a comment •

edited

Loading