Skip to content

Latest commit

 

History

History
214 lines (152 loc) · 7.19 KB

File metadata and controls

214 lines (152 loc) · 7.19 KB

A/B Benchmark: Clio vs Claude Code

Benchmark comparing Clio (v0.0.1) against Claude Code (v2.1.86) on identical tasks through the same API proxy, measuring latency, token usage, cache efficiency, and output correctness.

Test Environment

Item Value
Date 2026-03-28
Host macOS (Apple Silicon, 32GB) — Device A
Node.js v22.22.0
API Proxy http://192.168.10.6:8080 (same for both Clio and CC)
Upstream Model claude-sonnet-4-20250514
Runs per task 3

Both tools use the same proxy, same API key, same model. The only variable is the CLI itself.

Tasks

ID Description Tools Expected
simple-qa "What is the capital of France?" None
math "What is 17 * 23 + 42?" None
read-file Read package.json, extract project name Read
search-code Find all TS files containing 'export' Grep/Glob
multi-step Read tsconfig + package.json, describe TS config Read x2 + analyze

Round 1: Clio (OpenAI format) vs CC

Clio: -p --output-format json --api-format openai CC: -p --output-format json (Anthropic format, native)

Latency (median, ms)

Task Clio CC Delta Winner
simple-qa 1,292 1,826 -534 Clio
math 2,782 2,312 +470 CC
read-file 10,193 8,909 +1,284 CC
search-code 22,022 18,110 +3,912 CC
multi-step 11,502 11,165 +337 ~Tie

Input Tokens (avg per run)

Task Clio CC Clio saves
simple-qa 2,487 22,225 -89%
math 2,490 22,228 -89%
read-file 11,463 67,190 -83%
search-code 14,854 44,857 -67%
multi-step 16,038 67,654 -76%

Cache

Task Clio hit% CC hit%
All 0% 100%

Clio 0% cache hit — OpenAI format does not carry cache_control fields, so the proxy cannot cache prompt segments.

Issue Found

Clio in OpenAI format loses all caching capability. This is a format limitation, not a code bug.


Round 2: Clio (Anthropic format, broken usage parsing) vs CC

Clio: -p --output-format json (Anthropic format) CC: -p --output-format json

Both now use Anthropic format through the same proxy.

Latency (median, ms)

Task Clio CC Delta
simple-qa 2,853 1,595 +1,258
math 2,457 2,376 +81
read-file 2,751 7,310 -4,559
search-code 13,766 12,846 +920
multi-step 12,082 10,185 +1,897

Issue Found

Clio reported input_tokens: 0 and cache: 0% for all tasks. Investigation revealed the proxy returns usage data in the message_delta SSE event, not message_start. Clio only extracted input_tokens and cache fields from message_start.

Root cause: Proxy puts input_tokens + cache_read_input_tokens in message_delta.usage, but Clio only read output_tokens from that event.

Fix: Added input_tokens, cache_creation_input_tokens, cache_read_input_tokens extraction to the message_delta handler in agent.ts.


Round 3: Clio (Anthropic format, fixed) vs CC — Final Results

After fixing usage parsing. Both tools on equal footing: same proxy, same format, same model.

Latency (median, ms)

Task Clio CC Delta Winner
simple-qa 1,261 1,527 -266 Clio
math 1,812 2,581 -769 Clio
read-file 8,343 6,793 +1,550 CC
search-code 14,439 12,436 +2,003 CC
multi-step 10,894 10,227 +667 ~Tie

No-tool tasks: Clio is faster (smaller prompt = less processing). Tool tasks: CC is faster (fewer round-trips).

Input Tokens (avg per run)

Task Clio CC Clio saves
simple-qa 2,499 22,230 -89%
math 4,301 22,233 -81%
read-file 14,416 67,205 -79%
search-code 15,232 44,863 -66%
multi-step 14,884 67,671 -78%

Cache

Task Clio cache_read CC cache_read Clio hit% CC hit%
simple-qa 0 57,984 0%* 100%
math 2,304 53,760 100% 100%
read-file 17,536 115,840 100% 100%
search-code 29,312 35,840 100% 100%
multi-step 32,384 111,616 100% 100%

* simple-qa first run is always a cache miss (cold start). Subsequent runs would hit cache.

Both tools now achieve 100% cache hit rate through the same proxy. Clio's cache_read is 40-83% smaller because its cached prompt is smaller.

Tool Turns (avg)

Task Clio CC
simple-qa 1.0 1.0
math 1.7 1.0
read-file 5.0 3.0
search-code 5.3 2.0
multi-step 5.0 5.0

CC uses fewer turns on tool tasks. Its larger system prompt provides more detailed tool-use guidance, leading to more efficient tool selection.

Correctness

All tasks returned correct results from both tools across all 3 rounds:

Task Clio CC Match
simple-qa Paris Paris Yes
math 433 433 Yes
read-file clio-cli clio-cli Yes
search-code All files All files Yes
multi-step ES2022 + Node16 ES2022 + Node16 Yes

Summary

Clio Advantages

  • 70-89% fewer input tokens — smaller system prompt (~2.5k vs ~22k tokens)
  • 40-83% less cache volume — less data to cache and read back
  • Faster on simple tasks — less prompt to process means lower latency
  • Lower cost potential — fewer tokens at full price + fewer tokens at cache price

CC Advantages

  • Fewer tool turns — detailed system prompt guides more efficient tool selection
  • Faster on tool-heavy tasks — fewer API round-trips compensates for larger prompt
  • Mature caching — battle-tested cache_control placement

Net Assessment

Metric Winner Margin
Token efficiency Clio 70-89% fewer input tokens
Cache efficiency Clio 40-83% less cache volume
Simple task latency Clio 250-770ms faster
Tool task latency CC 650-2000ms faster
Tool turn efficiency CC 2-3x fewer turns on search/read
Correctness Tie 100% on all tasks

Bugs Found & Fixed During Benchmarking

  1. Print mode stdout leak — JSON output mode leaked streaming text to stdout. Fixed by suppressing process.stdout.write during agent loop in JSON mode.
  2. Usage parsing from message_delta — Proxy returns input_tokens and cache_read_input_tokens in message_delta, not message_start. Added extraction from both events.

Reproducing

git clone https://github.com/icebear0828/c2a.git
cd c2a && npm install && npx tsc
npm install -g @anthropic-ai/claude-code

export ANTHROPIC_BASE_URL=http://your-proxy:8080
export ANTHROPIC_API_KEY=your-key

# Anthropic format (recommended)
node benchmark/run.mjs --runs 3

# OpenAI format
node benchmark/run.mjs --runs 3 --clio-format openai

# Subset
node benchmark/run.mjs --runs 1 --tasks simple-qa,math