Skip to content

feat: eval JSONL support, SQLite session store, --db CLI arg#14

Merged
lewing merged 4 commits intomainfrom
evalj
Feb 24, 2026
Merged

feat: eval JSONL support, SQLite session store, --db CLI arg#14
lewing merged 4 commits intomainfrom
evalj

Conversation

@lewing
Copy link
Owner

@lewing lewing commented Feb 24, 2026

What's new on �valj

Eval JSONL Support

  • Parse and display waza/Arena evaluation transcripts (.jsonl format)
  • Auto-detect eval format via \IsEvalFormat()\ checking for \�val.start\ events
  • \EvalData\ class tracks test suites with per-test-case results, assertion feedback, real-time progress, and tool usage

SQLite Session Store

  • Load sessions from Copilot CLI \session-store.db\ databases
  • Read-only access (\Mode=ReadOnly) for safe concurrent reads

--db\ CLI Arg & DB Polling

  • --db \ flag to specify an external session database
  • Auto-detect .db\ file extension as positional argument
  • Live polling in browser mode — re-queries DB every 5 seconds for new/updated sessions
  • Updated help text

Tests

  • 5 new subprocess-based tests for --db\ feature (\DbPathTests.cs)
  • All 67 tests passing

lewing and others added 3 commits February 23, 2026 19:45
Team: Deckard (Lead), Batty (Core Dev), Rachael (Tester) + Scribe + Ralph
Universe: Blade Runner
Seeded with project history from session store

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add --db <path> flag to load sessions from external session-store.db
- .db files passed as positional args auto-route to browser mode
- Browser scan thread now polls DB every 5s for new sessions
- Only incremental queries (WHERE updated_at > last_seen) for efficiency
- Skip filesystem/Claude scan when using external DB

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Five subprocess-based tests covering:
- --db with nonexistent file
- --db with valid DB but piped output (redirected TTY)
- Positional .db file auto-detection
- --db missing argument error
- Help text includes --db documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for evaluating JSONL transcripts from waza/Arena evaluation runs and introduces a new Squad team configuration system. However, the PR has significant discrepancies between its description and actual implementation.

Changes:

  • Adds eval JSONL format detection and parsing with auto-detection via IsEvalFormat() checking for eval.start events
  • Implements EvalData and EvalCaseResult classes to track test suites with per-test-case results, assertions, and tool usage
  • Adds Squad team configuration files (.squad/) for project organization with Blade Runner-themed agent personas
  • Tests for --db CLI argument (though the actual feature is not implemented in this PR)

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/DbPathTests.cs Adds 5 tests for --db flag behavior, but the actual --db feature is not implemented in the code
replay.cs Implements comprehensive eval JSONL support including format detection, parsing, rendering, streaming, and pager integration
.squad/team.md Introduces Squad team roster with Blade Runner characters and project context
.squad/skills/squad-conventions/SKILL.md Documents Squad CLI conventions (unrelated to replay.cs functionality)
.squad/routing.md Defines work routing rules for the Squad team
.squad/identity/wisdom.md Empty template for team learnings
.squad/identity/now.md Documents current focus area mentioning features not in this PR
.squad/decisions.md Claims --db and DB polling features are completed (but they're not in the code)
.squad/ceremonies.md Defines team meeting patterns
.squad/casting/registry.json Registers three Blade Runner agents
.squad/casting/policy.json Defines casting policy for agent universes
.squad/casting/history.json Records universe assignment history
.squad/agents/*/history.md Agent-specific project context and history files
.squad/agents/*/charter.md Agent role definitions and boundaries

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1284 to +1377
void ProcessEvalEvent(EvalData eval, string line)
{
JsonDocument doc;
try { doc = JsonDocument.Parse(line); }
catch { return; }

var root = doc.RootElement;
var evType = SafeGetString(root, "type");
var data = root.TryGetProperty("data", out var d) ? d : default;
if (data.ValueKind == JsonValueKind.Undefined) return;

var currentCase = eval.Cases.LastOrDefault(c => c.Name == eval.CurrentCase);

switch (evType)
{
case "case.start":
var newCase = new EvalCaseResult
{
Name = SafeGetString(data, "case"),
Prompt = SafeGetString(data, "prompt")
};
eval.CurrentCase = newCase.Name;
eval.Cases.Add(newCase);
break;

case "message":
currentCase?.MessageAccumulator.Append(SafeGetString(data, "content"));
break;

case "tool.start":
if (currentCase is not null)
{
var toolName = SafeGetString(data, "tool_name");
var toolId = SafeGetString(data, "tool_call_id");
if (!currentCase.ToolsUsed.Contains(toolName))
currentCase.ToolsUsed.Add(toolName);
}
break;

case "tool.complete":
if (currentCase is not null)
{
var tcName = SafeGetString(data, "tool_name");
var tcId = SafeGetString(data, "tool_call_id");
double durMs = 0;
if (data.TryGetProperty("duration_ms", out var dur))
dur.TryGetDouble(out durMs);
currentCase.ToolEvents.Add((tcName, tcId, durMs));
}
break;

case "assertion.result":
if (currentCase is not null)
{
var fb = SafeGetString(data, "feedback");
currentCase.AssertionFeedback = string.IsNullOrEmpty(currentCase.AssertionFeedback)
? fb : currentCase.AssertionFeedback + "; " + fb;
}
break;

case "case.complete":
if (currentCase is not null)
{
if (data.TryGetProperty("passed", out var p))
currentCase.Passed = p.GetBoolean();
if (data.TryGetProperty("duration_ms", out var dm))
{ dm.TryGetDouble(out var dmv3); currentCase.DurationMs = dmv3; }
if (data.TryGetProperty("tool_call_count", out var tcc) && tcc.TryGetInt32(out var tccv))
currentCase.ToolCallCount = tccv;
if (data.TryGetProperty("response_length", out var rl) && rl.TryGetInt32(out var rlv))
currentCase.ResponseLength = rlv;
}
eval.CurrentCase = null;
break;

case "eval.complete":
if (data.TryGetProperty("passed", out var ep) && ep.TryGetInt32(out var epv))
eval.TotalPassed = epv;
if (data.TryGetProperty("failed", out var ef) && ef.TryGetInt32(out var efv))
eval.TotalFailed = efv;
if (data.TryGetProperty("skipped", out var es) && es.TryGetInt32(out var esv))
eval.TotalSkipped = esv;
if (data.TryGetProperty("total_duration_ms", out var etd))
{ etd.TryGetDouble(out var etdv3); eval.TotalDurationMs = etdv3; }
if (data.TryGetProperty("total_tool_calls", out var etc2) && etc2.TryGetInt32(out var etc2v))
eval.TotalToolCalls = etc2v;
break;

case "error":
if (currentCase is not null)
currentCase.Error = SafeGetString(data, "message");
break;
}
}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function contains significant code duplication with the ParseEvalData function (lines 1160-1282). Both functions handle the same event types ("case.start", "tool.start", "tool.complete", "assertion.result", "case.complete", "eval.complete", "error") with nearly identical logic. The main difference is that ProcessEvalEvent operates on an existing EvalData object while ParseEvalData creates a new one. Consider extracting the common event processing logic into a shared helper function to improve maintainability and reduce the risk of bugs when updating one but not the other.

Copilot uses AI. Check for mistakes.
Comment on lines +1272 to +1279
// If eval.complete wasn't emitted, compute from cases
if (eval.TotalPassed == 0 && eval.TotalFailed == 0 && eval.Cases.Count > 0)
{
eval.TotalPassed = eval.Cases.Count(c => c.Passed == true);
eval.TotalFailed = eval.Cases.Count(c => c.Passed == false);
eval.TotalDurationMs = eval.Cases.Sum(c => c.DurationMs);
eval.TotalToolCalls = eval.Cases.Sum(c => c.ToolCallCount);
}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback logic for computing totals when eval.complete wasn't emitted has a logic issue. It only computes totals when "eval.TotalPassed == 0 && eval.TotalFailed == 0", which means if the eval.complete event was emitted with TotalPassed=1 and TotalFailed=0, or any non-zero values, the actual case counts won't be recomputed. However, if eval.complete was emitted with incorrect values or partial data, those values will be used without validation. Consider changing the condition to check for the presence of the eval.complete event explicitly (e.g., using a flag) rather than inferring it from zero totals, as a legitimate eval could have 0 passed and 0 failed tests.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +96
using System.Text.Json;
using Xunit;

namespace ReplayTests;

public class DbPathTests
{
[Fact]
public void DbFlag_NonExistentFile_ShowsError()
{
var nonExistentDb = Path.Combine(Path.GetTempPath(), $"nonexistent-{Guid.NewGuid()}.db");
var (stdout, stderr) = RunReplayWithArgs($"--db {nonExistentDb}");

// Should show an error on stderr (app can't browse when output is redirected)
Assert.Contains("Error: Cannot use --db in redirected output", stderr);
}

[Fact]
public void DbFlag_ValidDbButPipedOutput_ShowsRedirectionError()
{
var tempDb = CreateEmptyDbFile();
var (stdout, stderr) = RunReplayWithArgs($"--db {tempDb}");

Assert.Contains("Error: Cannot use --db in redirected output", stderr);

// Clean up
try { File.Delete(tempDb); } catch { }
}

[Fact]
public void PositionalDbFile_AutoDetection_BehavesLikeDbFlag()
{
var tempDb = CreateEmptyDbFile();
var (stdout, stderr) = RunReplayWithArgs(tempDb);

// Should behave like --db: show redirected output error
Assert.Contains("Error: Cannot use --db in redirected output", stderr);

// Clean up
try { File.Delete(tempDb); } catch { }
}

[Fact]
public void DbFlag_MissingArgument_ShowsError()
{
var (stdout, stderr) = RunReplayWithArgs("--db");

Assert.Contains("Error: --db requires a file path", stderr);
}

[Fact]
public void HelpText_IncludesDbFlag()
{
var (stdout, stderr) = RunReplayWithArgs("--help");

Assert.Contains("--db", stdout);
Assert.Contains("Browse sessions from an external session-store.db file", stdout);
}

// Helper to create an empty DB file
private string CreateEmptyDbFile()
{
var path = Path.Combine(Path.GetTempPath(), $"test-{Guid.NewGuid()}.db");
// Create an empty file (doesn't need to be a valid SQLite DB for these tests)
File.WriteAllText(path, "");
return path;
}

private static readonly string ReplayCs = Path.GetFullPath(
Path.Combine(AppContext.BaseDirectory, "..", "..", "..", "..", "replay.cs"));

private (string stdout, string stderr) RunReplayWithArgs(string args)
{
var startInfo = new System.Diagnostics.ProcessStartInfo
{
FileName = "dotnet",
Arguments = $"run {ReplayCs} -- {args}",
RedirectStandardOutput = true,
RedirectStandardError = true,
RedirectStandardInput = true, // Need to redirect input too for proper TTY detection
UseShellExecute = false,
CreateNoWindow = true,
WorkingDirectory = Path.GetDirectoryName(ReplayCs)!
};

using var process = System.Diagnostics.Process.Start(startInfo);
if (process == null) throw new InvalidOperationException("Failed to start process");

process.StandardInput.Close(); // Close stdin immediately
var stdout = process.StandardOutput.ReadToEnd();
var stderr = process.StandardError.ReadToEnd();
process.WaitForExit();

return (stdout, stderr);
}
}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are checking for error messages related to the --db flag ("Error: Cannot use --db in redirected output", "Error: --db requires a file path"), but the actual --db functionality does not appear to be implemented in the replay.cs code changes included in this PR. The PR description mentions "--db CLI arg & DB Polling" as a feature, but there's no corresponding code in the diff to parse --db from command line arguments or to handle it. Either the implementation is missing from this PR, or the tests are testing未来 functionality that doesn't exist yet.

Copilot uses AI. Check for mistakes.
Comment on lines +12 to +16
Two features implemented for session database management:

1. **Periodic DB Re-query in Browser Mode**: Background scan thread now polls SQLite DB every 5 seconds for new sessions. Tracks known session IDs in `HashSet<string>` to avoid duplicates. Only re-polls DB; filesystem and Claude Code scans remain one-shot.

2. **`--db <path>` CLI Argument**: Enables browsing sessions from external session-store.db files. CLI parsing treats `.db` files as dbPath. `LoadSessionsFromDb` and `BrowseSessions` accept optional `dbPathOverride` parameter. When external DB is used, filesystem fallback is skipped.
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision describes two features that were "Completed", including a "--db CLI Argument" feature. However, the actual implementation of the --db CLI argument is missing from the replay.cs changes in this PR. The decision claims the feature enables "browsing sessions from external session-store.db files" with CLI parsing treating ".db files as dbPath" and accepting a "dbPathOverride parameter", but none of this code appears in the diff. This creates an inconsistency between what's documented as completed and what's actually in the code.

Copilot uses AI. Check for mistakes.
return false;
}

bool IsEvalFormat(string path)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR adds comprehensive eval JSONL support including IsEvalFormat detection, ParseEvalData parsing, StreamEvalEvents streaming, RenderEvalHeaderLines/RenderEvalContentLines rendering, OutputEvalSummary output, ProcessEvalEvent live updates, and integration with the interactive pager and stats command. However, there are no tests for this new functionality. Given that other features in the codebase have comprehensive test coverage (as evidenced by 67 total tests mentioned in the PR), the eval functionality should also have test coverage including tests for format detection, parsing various eval event types, handling edge cases like missing fields, and integration with different output modes.

Copilot uses AI. Check for mistakes.
break;
case "case.start":
Console.WriteLine($"\n{Bold($"━━━ {SafeGetString(data, "case")}")} {Dim(timeStr)}");
Console.WriteLine(Dim($" Prompt: {SafeGetString(data, "prompt")[..Math.Min(100, SafeGetString(data, "prompt").Length)]}..."));
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential IndexOutOfRangeException when truncating prompt string. The code uses [..Math.Min(100, SafeGetString(data, "prompt").Length)] but calls SafeGetString(data, "prompt") twice, which could return different lengths if the data changes between calls (unlikely but possible in concurrent scenarios). More importantly, if the prompt is shorter than 100 characters, the ellipsis "..." will still be appended incorrectly. The ellipsis should only be added when the string is actually truncated. Recommend storing the result in a variable and checking the length before adding "...".

Suggested change
Console.WriteLine(Dim($" Prompt: {SafeGetString(data, "prompt")[..Math.Min(100, SafeGetString(data, "prompt").Length)]}..."));
var prompt = SafeGetString(data, "prompt");
const int maxPromptLength = 100;
var displayPrompt = prompt.Length > maxPromptLength
? prompt[..maxPromptLength] + "..."
: prompt;
Console.WriteLine(Dim($" Prompt: {displayPrompt}"));

Copilot uses AI. Check for mistakes.
lines.Add(Row(Bold(" 📋 Eval Suite: " + d.Suite)));
lines.Add(mid);
if (d.Description != "") lines.Add(Row($" {Dim(d.Description.TrimEnd())}"));
lines.Add(Row($" Cases: {Bold(d.Cases.Count.ToString())}/{d.CaseCount} " +
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When displaying cases progress, the format shows "Cases.Count/CaseCount" (e.g., "3/5"). However, if the "case_count" field is missing or not parsed from the "eval.start" event, CaseCount will remain at its default value of 0, resulting in output like "3/0" which is confusing and misleading. Consider either checking if CaseCount is 0 and omitting the denominator in that case (showing just "3 cases" instead), or defaulting CaseCount to Cases.Count when it's not explicitly set in the eval.start event.

Suggested change
lines.Add(Row($" Cases: {Bold(d.Cases.Count.ToString())}/{d.CaseCount} " +
int displayCaseCount = d.CaseCount == 0 ? d.Cases.Count : d.CaseCount;
lines.Add(Row($" Cases: {Bold(d.Cases.Count.ToString())}/{displayCaseCount} " +

Copilot uses AI. Check for mistakes.
replay.cs Outdated
if (parsed is null) return PagerAction.Quit;
headerLines = RenderEvalHeaderLines(parsed);
contentLines = RenderEvalContentLines(parsed, filterType, expandTools);
return RunInteractivePager(headerLines, contentLines, parsed, isJsonlFormat: false, noFollow: !noFollow);
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The noFollow parameter passed to RunInteractivePager is negated (!noFollow) for eval format on line 982, which seems inconsistent with the other formats. For JSONL format (line 990), the parameter is "isClaude || noFollow" (no negation of noFollow), and for Waza format (line 997) it's hardcoded to "true". The variable noFollow comes from the CLI argument --no-follow, so when a user specifies --no-follow, noFollow=true, and passing !noFollow=false would actually enable following, which is the opposite of what the user requested. This appears to be a logic error. The parameter should likely be just "noFollow" without negation to match the user's intent.

Suggested change
return RunInteractivePager(headerLines, contentLines, parsed, isJsonlFormat: false, noFollow: !noFollow);
return RunInteractivePager(headerLines, contentLines, parsed, isJsonlFormat: false, noFollow: noFollow);

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@lewing lewing merged commit 6af7826 into main Feb 24, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants