Skip to content

feat: add Anthropic SDK evaluator (#11)#19

Open
don-petry wants to merge 4 commits intoJoaolfelicio:mainfrom
don-petry:feat/anthropic-sdk-evaluator
Open

feat: add Anthropic SDK evaluator (#11)#19
don-petry wants to merge 4 commits intoJoaolfelicio:mainfrom
don-petry:feat/anthropic-sdk-evaluator

Conversation

@don-petry
Copy link
Copy Markdown
Collaborator

@don-petry don-petry commented Apr 4, 2026

Why

ClaudeEvaluator currently shells out to the claude CLI with a hard 120s timeout, no retry logic, and requires the CLI binary installed on the host. Users without the CLI installed cannot use Claude for evaluation at all. The Anthropic Python SDK provides built-in retry/backoff, proper error handling, and model selection — and only requires an API key, not a CLI installation.

Summary

  • Add AnthropicEvaluator that calls the Anthropic API directly via the Python SDK instead of shelling out to the Claude CLI
  • Register it in the evaluator factory as --evaluator anthropic
  • Auto-detection falls back to the SDK when no CLI tools are found but ANTHROPIC_API_KEY is set
  • anthropic is an optional dependency: pip install context-scribe[anthropic]
  • Gracefully skips registration if the SDK isn't installed (existing evaluators unaffected)
  • Module-level import anthropic so the try/except ImportError in __init__.py correctly gates registration
  • 120s client timeout matching other evaluators

Closes #11

Testing evidence (Python 3.12 + mcp + anthropic SDK installed)

Check Result
Full test suite: 74/74 passed ✅ PASS
Lazy import: 'anthropic' not in EVALUATOR_REGISTRY when SDK absent ✅ PASS
SDK installed: 'anthropic' in EVALUATOR_REGISTRY = True ✅ PASS
get_evaluator('anthropic') returns AnthropicEvaluator with correct model ✅ PASS
Module works without SDK: get_evaluator imports fine ✅ PASS
Optional dep in pyproject.toml: anthropic>=0.40.0 ✅ PASS
_detect_evaluator fallback test: returns "anthropic" when API key set ✅ PASS
Live API: real HTTP request to Anthropic API ✅ PASS
Live API: authenticates successfully (BadRequestError, not AuthenticationError) ✅ PASS
Live API: error handled gracefully (no crash, returns None) ✅ PASS

Live end-to-end rule extraction could not be verified — test API key had insufficient credits (billing error, not auth error). All transport, authentication, and error handling paths confirmed working.

Test plan

  • 5 new tests (rule extraction, NO_RULE, model/params, missing key, SDK fallback)
  • Full suite: 74/74 pass
  • Live: SDK registers, evaluator constructs, API call reaches Anthropic, auth succeeds, errors handled
  • Blocked: full rule extraction (requires funded API key)

🤖 Generated with Claude Code

Adds AnthropicEvaluator that calls the Anthropic API directly via the
SDK instead of shelling out to the Claude CLI. Benefits:
- Built-in retry/backoff from the SDK
- No CLI installation required (just ANTHROPIC_API_KEY)
- Proper error handling and model selection

The SDK is an optional dependency (pip install context-scribe[anthropic]).
Auto-detection falls back to the SDK evaluator when no CLI tools are
available but ANTHROPIC_API_KEY is set.

Closes Joaolfelicio#11

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 4, 2026 19:40
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Anthropic SDK–backed evaluator to context-scribe so Claude rule extraction can run without shelling out to the claude CLI, and updates evaluator auto-detection accordingly.

Changes:

  • Introduce AnthropicEvaluator that calls Anthropic via the Python SDK and add tests for rule/NO_RULE/model/key behaviors.
  • Add an anthropic optional dependency extra in pyproject.toml.
  • Update evaluator auto-detection to fall back to the SDK evaluator when no CLIs are present but ANTHROPIC_API_KEY is set.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
context_scribe/evaluator/anthropic_llm.py New SDK-based evaluator implementation.
context_scribe/evaluator/__init__.py Conditionally registers the new evaluator in the factory/registry.
context_scribe/main.py Adds auto-detection fallback to the Anthropic evaluator and updates the error message.
pyproject.toml Adds anthropic as an optional dependency extra.
tests/test_anthropic_evaluator.py New unit tests for Anthropic evaluator behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

DJ and others added 2 commits April 4, 2026 12:47
- Move anthropic import to module level so try/except in __init__.py
  correctly skips registration when SDK is missing
- Add 120s timeout to Anthropic client (matching other evaluators)
- Update error message assertion in test_main.py
- Fix test mocking to use sys.modules patching for optional dependency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The evaluator factory pattern removed direct imports of evaluator
classes from main.py. Tests must patch get_evaluator instead.

Found via full test suite run with Python 3.12 + mcp installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 5, 2026 17:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Remove unused evaluator_class from test_daemons parametrize
- Remove unused importlib import from test_anthropic_evaluator
- Add test for _detect_evaluator ANTHROPIC_API_KEY fallback

Found via full test suite run with Python 3.12 + mcp installed (74/74 pass).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Replace CLI subprocess with Anthropic SDK for Claude evaluation

2 participants