Add --keep-sessions flag and rejudge command to skill-validator#149
Add --keep-sessions flag and rejudge command to skill-validator#149
Conversation
- Add --keep-sessions flag to preserve agent session data in SQLite DB - Create SessionDatabase service with WAL mode and thread-safe writes - Update AgentRunner to separate config/work dir lifecycle - Track session metadata, RunMetrics, judge and pairwise results - Add rejudge command to re-run judges on saved sessions - Add 8 tests for SessionDatabase (CRUD, concurrency, isolation) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Store the eval scenario prompt and SKILL.md content in the sessions table so they are available for browsing and replay without needing the original skill directory. Both columns are optional (nullable) for backward compatibility. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Compute a SHA-256 hash (truncated to 12 hex chars) over all files in the skill directory, sorted by relative path for determinism. This gives lightweight change detection without storing potentially large skill content in the database. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Skill Validation Results
Model: claude-opus-4.6 | Judge: claude-opus-4.6 |
|
@ViktorHofer @JanKrivanek I think this will be immensly helpful for validation. We should prioritize getting this in. I'm trying to help tweak #122 and validation cycles are loooong. I inquired about how this PR could help and this is what it said: /cc @agocke @jeffschwMSFT |
|
I've also had several runs die thanks to copilot exiting while they were 20 minutes in with multiple runs completed. My other reason for working on this it to make it easy to resume one of the eval sessions and ask the agent for feedback on why it made certain choices. |
There was a problem hiding this comment.
Some review from a different agent:
Nice work — the schema design is clean and the thread-safety approach is solid. A few suggestions from the dotnet-replay integration perspective:
1. Store config_dir as relative to the DB directory
Currently config_dir is stored as an absolute path (e.g. /home/user/.../results-2026-02-28T.../sessions/<guid>). Since the results directory is self-contained (sessions.db + sessions/<guid>/events.jsonl are siblings), storing it relative to the DB directory (e.g. sessions/<guid>) would make the results dir portable — you can zip it, move it to another machine, and tools like dotnet-replay --db sessions.db just work without broken paths.
The validator only reads config_dir back in RejudgeCommand, which already knows resultsDir, so resolving via Path.Combine(resultsDir, relativeConfigDir) is straightforward. The change would be in RegisterSession — compute the relative path before storing:
var relativeConfigDir = Path.GetRelativePath(Path.GetDirectoryName(dbPath)!, configDir);2. Add a schema version marker
The Copilot CLI session-store.db has a schema_version table that makes DB type detection trivial for consumers. Without one here, tools must probe column names via PRAGMA table_info(sessions). A small metadata table also future-proofs schema migrations:
CREATE TABLE IF NOT EXISTS schema_info (key TEXT PRIMARY KEY, value TEXT NOT NULL);
INSERT OR IGNORE INTO schema_info (key, value) VALUES ('type', 'skill-validator');
INSERT OR IGNORE INTO schema_info (key, value) VALUES ('version', '1');3. Consider a display_name column (minor / follow-up)
When browsing sessions in a TUI, Copilot CLI sessions have a summary field for display. For skill-validator sessions, a consumer has to synthesize something like "scenario_name (role) — skill_name". Pre-computing a display_name in the DB makes it self-describing. This could be a follow-up though.
- Use session ID as config dir folder name (sessions/<id>) so the DB record links directly to its files on disk - Store config_dir as relative path for portable results directories - Add schema_info table with type='skill-validator' and version='1' for easy DB detection by external tools like dotnet-replay - Add GetSchemaInfo() method for consumers - Extract GetSessions(whereClause) for reuse - Add tests for schema_info and relative config_dir Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Detect and load skill-validator databases (from dotnet/skills#149) alongside existing Copilot CLI session-store.db files. The --db flag now accepts either database type, auto-detected via schema_info vs schema_version tables. Changes: - Add SessionDbType enum and BrowserSession record to replace tuple - Add DetectDbType() for schema-based DB type detection - Add LoadSkillValidatorSessions() with config_dir relative path resolution - Show eval metadata (metrics, judge, pairwise) in preview panel - Adapt browser icons and display for skill-validator sessions - Handle sessions without events.jsonl gracefully Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When --keep-sessions is active, pass the already-created timestamped directory to Reporter.ReportResults so sessions.db, session files, and report outputs (results.json, summary.md) all live in the same folder instead of creating two separate timestamped directories. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix test failures from .NET 10 terminal logger stdout noise - Guard SafeGetString against undefined JsonElement (ValueKind check) - Add -v q to dotnet run in test helpers to suppress build output leaking into stdout, which corrupted JSON parsing in tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add skill-validator sessions.db support Detect and load skill-validator databases (from dotnet/skills#149) alongside existing Copilot CLI session-store.db files. The --db flag now accepts either database type, auto-detected via schema_info vs schema_version tables. Changes: - Add SessionDbType enum and BrowserSession record to replace tuple - Add DetectDbType() for schema-based DB type detection - Add LoadSkillValidatorSessions() with config_dir relative path resolution - Show eval metadata (metrics, judge, pairwise) in preview panel - Adapt browser icons and display for skill-validator sessions - Handle sessions without events.jsonl gracefully Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix cross-platform events.jsonl path resolution Normalize Windows backslashes in config_dir for Linux/WSL compatibility. Search nested session-state/<guid>/events.jsonl when direct path not found. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Detect copilot CLI install method for resume Try standalone 'copilot' first, fall back to 'gh copilot' for users who installed via the GitHub CLI extension. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Removed GitHub.Copilot.SDK package reference.
Summary
Adds two opt-in features to the skill-validator eval framework:
--keep-sessionsflag — Preserves agent session data (SDK events, config dirs) in a SQLite database (sessions.db) under the timestamped results directory instead of deleting them after evaluation.rejudgecommand — Re-runs judges on previously saved sessions without re-running the expensive agent LLM calls. Useful for iterating on judge prompts or trying different judge models.What's in the sessions database
Each eval run pair (baseline + with-skill) is tracked with:
Design decisions
--keep-sessions, behavior is unchanged.sessionDbstays null and allsessionDb?.calls are no-ops.SemaphoreSlimwrite lock for concurrent scenario execution. Each eval process gets its own timestamped results dir → separatesessions.db→ no cross-process conflicts.events.jsonl) are preserved; temp work dirs are always deleted.Files changed
SessionDatabase.csComputeDirectoryShaRejudgeCommand.csValidateCommand.cs--keep-sessionsflag, session registration, prompt + skill SHAAgentRunner.csRunOptionsModels.csKeepSessionstoValidatorConfigProgram.csRejudgeCommandSkillValidator.csprojMicrosoft.Data.SqlitedependencySessionDatabaseTests.csTesting
All 192 tests pass (pre-existing + 11 new session DB tests).
Supersedes #135 (reopened from non-forked branch per review feedback).