diff --git a/README.md b/README.md
index 81aea73e..a48e0e61 100644
--- a/README.md
+++ b/README.md
@@ -150,6 +150,7 @@ OpenHarness is an open-source Python implementation designed for **researchers,
Start here:
Quick Start ·
Provider Compatibility ·
+ LLM Providers ·
Showcase ·
Contributing ·
Changelog
@@ -242,8 +243,55 @@ oh -p "Fix the bug" --output-format stream-json
## 🔌 Provider Compatibility
+OpenHarness supports a wide variety of LLM providers through its extensible provider registry system. The system automatically detects providers based on API keys, base URLs, and model names.
+
+**📖 [Complete Provider Guide](README_PROVIDERS.md)** | **🚀 [Interactive Demo](scripts/demo_providers.py)**
+
OpenHarness supports three API formats: **Anthropic** (default), **OpenAI-compatible** (`--api-format openai`), and **GitHub Copilot** (`--api-format copilot`). The OpenAI format covers a wide range of providers.
+### Quick Provider Setup
+
+```bash
+# OpenRouter (100+ models)
+export OPENROUTER_API_KEY="sk-or-v1-..."
+oh --model anthropic/claude-3-haiku
+
+# DeepSeek
+export DEEPSEEK_API_KEY="your-key"
+oh --model deepseek-chat
+
+# Groq (fast inference)
+export GROQ_API_KEY="gsk_..."
+oh --model llama3-70b-8192
+
+# Ollama (local)
+oh --base-url http://localhost:11434/v1 --model llama2
+```
+
+### Adding New Providers
+
+To add support for a new LLM provider:
+
+1. Edit `src/openharness/api/registry.py`
+2. Add a `ProviderSpec` to the `PROVIDERS` tuple
+3. Test with `python scripts/demo_providers.py`
+
+Example:
+```python
+ProviderSpec(
+ name="myprovider",
+ keywords=("myprovider", "myai"),
+ env_key="MYPROVIDER_API_KEY",
+ display_name="MyProvider",
+ backend_type="openai_compat",
+ default_base_url="https://api.myprovider.com/v1",
+ detect_by_key_prefix="mp_", # optional
+ detect_by_base_keyword="myprovider", # optional
+),
+```
+
+### Anthropic Format (default)
+
### Anthropic Format (default)
| Provider profile | Detection signal | Notes |
diff --git a/README_PROVIDERS.md b/README_PROVIDERS.md
new file mode 100644
index 00000000..e22b2f55
--- /dev/null
+++ b/README_PROVIDERS.md
@@ -0,0 +1,103 @@
+# Adding LLM Providers to OpenHarness
+
+OpenHarness supports a wide variety of LLM providers through its extensible provider registry system. This guide shows you how to add new providers and configure them.
+
+## Quick Start
+
+### For Users: Using Existing Providers
+
+OpenHarness already supports many popular providers. Here's how to use them:
+
+```bash
+# OpenRouter (access to 100+ models)
+export OPENROUTER_API_KEY="sk-or-v1-..."
+oh --model anthropic/claude-3-haiku "Hello world"
+
+# DeepSeek
+export DEEPSEEK_API_KEY="your-key"
+oh --model deepseek-chat "Code review this function"
+
+# Groq (fast inference)
+export GROQ_API_KEY="gsk_..."
+oh --model llama3-70b-8192 "Analyze this code"
+
+# Ollama (local models)
+oh --base-url http://localhost:11434/v1 --model llama2 "Local AI chat"
+```
+
+### For Developers: Adding New Providers
+
+To add a new LLM provider:
+
+1. **Edit the registry** (`src/openharness/api/registry.py`)
+2. **Add your provider spec** to the `PROVIDERS` tuple
+3. **Test the configuration**
+
+Example provider spec:
+```python
+ProviderSpec(
+ name="myprovider",
+ keywords=("myprovider", "myai"),
+ env_key="MYPROVIDER_API_KEY",
+ display_name="MyProvider",
+ backend_type="openai_compat", # or "anthropic"
+ default_base_url="https://api.myprovider.com/v1",
+ detect_by_key_prefix="mp_", # optional
+ detect_by_base_keyword="myprovider", # optional
+ is_gateway=False,
+ is_local=False,
+ is_oauth=False,
+),
+```
+
+## Demo
+
+Run the interactive demo to see how providers work:
+
+```bash
+python scripts/demo_providers.py
+```
+
+This shows:
+- How provider detection works
+- How to add new providers
+- Configuration examples
+- CLI usage patterns
+
+## Supported Providers
+
+OpenHarness currently supports:
+
+- **Anthropic** (Claude models)
+- **OpenAI** (GPT models)
+- **OpenRouter** (100+ models via gateway)
+- **DeepSeek**
+- **Groq** (fast inference)
+- **GitHub Copilot** (OAuth)
+- **Ollama** (local models)
+- And many more...
+
+See `docs/LLM_PROVIDERS.md` for the complete list and detailed configuration instructions.
+
+## Key Concepts
+
+### Provider Detection Priority
+1. API key prefix (e.g., `sk-or-` → OpenRouter)
+2. Base URL keywords (e.g., `deepseek.com` → DeepSeek)
+3. Model name keywords (e.g., `claude` → Anthropic)
+
+### Backend Types
+- `anthropic`: Native Anthropic SDK (best for Claude)
+- `openai_compat`: OpenAI SDK (works with most providers)
+- `copilot`: GitHub Copilot OAuth
+
+### Configuration Methods
+- Environment variables: `export PROVIDER_API_KEY="..."`
+- Command line: `oh --model model-name --base-url https://...`
+- Settings file: `~/.openharness/settings.json`
+
+## Need Help?
+
+- Check the [full documentation](docs/LLM_PROVIDERS.md)
+- Run the demo: `python scripts/demo_providers.py`
+- Test detection: `oh --model your-model-name --dry-run`
\ No newline at end of file
diff --git a/docs/LLM_PROVIDERS.md b/docs/LLM_PROVIDERS.md
new file mode 100644
index 00000000..e9eb0375
--- /dev/null
+++ b/docs/LLM_PROVIDERS.md
@@ -0,0 +1,274 @@
+# Adding LLM Providers to OpenHarness
+
+OpenHarness supports a wide variety of LLM providers through its extensible provider registry system. This guide shows how to add new providers and configure them for use.
+
+## How Provider Detection Works
+
+OpenHarness automatically detects providers using a priority system:
+
+1. **API Key Prefix**: Special key prefixes (e.g., `sk-or-` for OpenRouter)
+2. **Base URL Keywords**: Substrings in the API base URL
+3. **Model Name Keywords**: Keywords in model names
+
+## Adding a New Provider
+
+### Step 1: Add to the Provider Registry
+
+Edit `src/openharness/api/registry.py` and add your provider to the `PROVIDERS` tuple:
+
+```python
+ProviderSpec(
+ name="your_provider",
+ keywords=("keyword1", "keyword2"), # Model name keywords
+ env_key="YOUR_PROVIDER_API_KEY", # Environment variable name
+ display_name="Your Provider", # Human-readable name
+ backend_type="openai_compat", # "anthropic" | "openai_compat" | "copilot"
+ default_base_url="https://api.yourprovider.com/v1",
+ detect_by_key_prefix="", # API key prefix (optional)
+ detect_by_base_keyword="yourprovider", # Base URL keyword (optional)
+ is_gateway=False, # True if routes to multiple models
+ is_local=False, # True for local deployments
+ is_oauth=False, # True for OAuth providers
+),
+```
+
+### Step 2: Configuration
+
+Users can configure the provider in several ways:
+
+#### Environment Variables
+```bash
+export YOUR_PROVIDER_API_KEY="your-api-key-here"
+```
+
+#### Command Line
+```bash
+# Auto-detection by model name
+oh --model your-model-name
+
+# Explicit base URL
+oh --base-url https://api.yourprovider.com/v1
+
+# Explicit API format (if needed)
+oh --api-format openai
+```
+
+#### Settings File
+```json
+{
+ "api_key": "your-api-key-here",
+ "base_url": "https://api.yourprovider.com/v1",
+ "model": "your-model-name"
+}
+```
+
+## Examples
+
+### OpenRouter
+
+OpenRouter is already configured in the registry:
+
+```python
+ProviderSpec(
+ name="openrouter",
+ keywords=("openrouter",),
+ env_key="OPENROUTER_API_KEY",
+ display_name="OpenRouter",
+ backend_type="openai_compat",
+ default_base_url="https://openrouter.ai/api/v1",
+ detect_by_key_prefix="sk-or-",
+ detect_by_base_keyword="openrouter",
+ is_gateway=True,
+ is_local=False,
+ is_oauth=False,
+),
+```
+
+Usage:
+```bash
+export OPENROUTER_API_KEY="sk-or-..."
+oh --model openai/gpt-4o-mini
+```
+
+### Adding a Custom Provider
+
+Let's add support for a hypothetical provider called "ExampleAI":
+
+1. **Add to registry**:
+```python
+ProviderSpec(
+ name="exampleai",
+ keywords=("example", "exampleai"),
+ env_key="EXAMPLEAI_API_KEY",
+ display_name="ExampleAI",
+ backend_type="openai_compat",
+ default_base_url="https://api.exampleai.com/v1",
+ detect_by_key_prefix="exa_",
+ detect_by_base_keyword="exampleai",
+ is_gateway=False,
+ is_local=False,
+ is_oauth=False,
+),
+```
+
+2. **Usage**:
+```bash
+export EXAMPLEAI_API_KEY="exa_your_key_here"
+oh --model example/gpt-4
+
+# Or with explicit base URL
+oh --base-url https://api.exampleai.com/v1 --model gpt-4
+```
+
+### Popular Providers
+
+Here are some popular providers and their configurations:
+
+#### Anthropic (Native)
+- **Backend**: `anthropic`
+- **Models**: `claude-3-5-sonnet-20241022`, `claude-3-haiku-20240307`
+- **Key**: `ANTHROPIC_API_KEY`
+
+#### OpenAI
+- **Backend**: `openai_compat`
+- **Models**: `gpt-4o`, `gpt-4-turbo`
+- **Key**: `OPENAI_API_KEY`
+- **Base URL**: `https://api.openai.com/v1`
+
+#### DeepSeek
+```python
+ProviderSpec(
+ name="deepseek",
+ keywords=("deepseek",),
+ env_key="DEEPSEEK_API_KEY",
+ display_name="DeepSeek",
+ backend_type="openai_compat",
+ default_base_url="https://api.deepseek.com/v1",
+ detect_by_key_prefix="",
+ detect_by_base_keyword="deepseek",
+ is_gateway=False,
+ is_local=False,
+ is_oauth=False,
+),
+```
+
+Usage:
+```bash
+export DEEPSEEK_API_KEY="your-key"
+oh --model deepseek-chat
+```
+
+#### Groq
+```python
+ProviderSpec(
+ name="groq",
+ keywords=("groq",),
+ env_key="GROQ_API_KEY",
+ display_name="Groq",
+ backend_type="openai_compat",
+ default_base_url="https://api.groq.com/openai/v1",
+ detect_by_key_prefix="gsk_",
+ detect_by_base_keyword="groq",
+ is_gateway=False,
+ is_local=False,
+ is_oauth=False,
+),
+```
+
+Usage:
+```bash
+export GROQ_API_KEY="gsk_..."
+oh --model llama3-70b-8192
+```
+
+#### Ollama (Local)
+```python
+ProviderSpec(
+ name="ollama",
+ keywords=("ollama",),
+ env_key="",
+ display_name="Ollama",
+ backend_type="openai_compat",
+ default_base_url="http://localhost:11434/v1",
+ detect_by_key_prefix="",
+ detect_by_base_keyword="localhost:11434",
+ is_gateway=False,
+ is_local=True,
+ is_oauth=False,
+),
+```
+
+Usage:
+```bash
+# Start Ollama server locally
+ollama serve
+
+# Use with OpenHarness
+oh --base-url http://localhost:11434/v1 --model llama2
+```
+
+## Backend Types
+
+### Anthropic Backend
+- Uses the official Anthropic Python SDK
+- Best for Claude models
+- Supports advanced features like tool calling
+
+### OpenAI Compatible Backend
+- Uses the OpenAI Python SDK
+- Works with any OpenAI-compatible API
+- Most providers use this backend
+
+### Copilot Backend
+- Special OAuth flow for GitHub Copilot
+- Requires `api_format=copilot`
+
+## Detection Priority
+
+The system checks for providers in this order:
+
+1. **API Key Prefix**: `sk-or-` → OpenRouter, `gsk_` → Groq
+2. **Base URL**: `openrouter.ai` → OpenRouter, `deepseek.com` → DeepSeek
+3. **Model Keywords**: `claude` → Anthropic, `gpt` → OpenAI
+
+## Testing Your Provider
+
+1. **Add to registry**
+2. **Set environment variable**
+3. **Test detection**:
+ ```bash
+ oh --model your-model-name --dry-run
+ ```
+4. **Test actual usage**:
+ ```bash
+ oh --model your-model-name "Hello world"
+ ```
+
+## Troubleshooting
+
+### Provider Not Detected
+- Check that keywords match your model names
+- Verify API key prefix or base URL keywords
+- Use explicit `--base-url` and `--api-format openai`
+
+### Authentication Errors
+- Verify API key is set correctly
+- Check API key format (some providers have specific prefixes)
+- Ensure the API key has necessary permissions
+
+### Connection Issues
+- Verify base URL is correct
+- Check network connectivity
+- Some providers require specific regions or endpoints
+
+## Contributing
+
+When adding a new provider:
+
+1. Test with multiple models
+2. Verify API compatibility
+3. Add appropriate keywords for detection
+4. Update this documentation
+5. Consider adding tests
+
+The provider registry in `src/openharness/api/registry.py` is the single source of truth for all provider configurations.
\ No newline at end of file
diff --git a/scripts/demo_providers.py b/scripts/demo_providers.py
new file mode 100644
index 00000000..a948ebb8
--- /dev/null
+++ b/scripts/demo_providers.py
@@ -0,0 +1,286 @@
+#!/usr/bin/env python3
+"""
+Demo script for adding and testing LLM providers in OpenHarness.
+
+This script demonstrates the provider registry structure and how to add new providers.
+It runs without requiring OpenHarness dependencies to be installed.
+
+Usage:
+ python scripts/demo_providers.py
+"""
+
+from __future__ import annotations
+
+import os
+import sys
+from pathlib import Path
+from dataclasses import dataclass
+from typing import Tuple
+
+
+@dataclass(frozen=True)
+class ProviderSpec:
+ """One LLM provider's metadata."""
+ name: str
+ keywords: tuple[str, ...]
+ env_key: str
+ display_name: str = ""
+ backend_type: str = "openai_compat"
+ default_base_url: str = ""
+ detect_by_key_prefix: str = ""
+ detect_by_base_keyword: str = ""
+ is_gateway: bool = False
+ is_local: bool = False
+ is_oauth: bool = False
+
+ @property
+ def label(self) -> str:
+ return self.display_name or self.name.title()
+
+
+# Sample providers (subset from the actual registry)
+SAMPLE_PROVIDERS = (
+ ProviderSpec(
+ name="anthropic",
+ keywords=("anthropic", "claude"),
+ env_key="ANTHROPIC_API_KEY",
+ display_name="Anthropic",
+ backend_type="anthropic",
+ ),
+ ProviderSpec(
+ name="openai",
+ keywords=("openai", "gpt", "o1", "o3", "o4"),
+ env_key="OPENAI_API_KEY",
+ display_name="OpenAI",
+ backend_type="openai_compat",
+ ),
+ ProviderSpec(
+ name="openrouter",
+ keywords=("openrouter",),
+ env_key="OPENROUTER_API_KEY",
+ display_name="OpenRouter",
+ backend_type="openai_compat",
+ default_base_url="https://openrouter.ai/api/v1",
+ detect_by_key_prefix="sk-or-",
+ detect_by_base_keyword="openrouter",
+ is_gateway=True,
+ ),
+ ProviderSpec(
+ name="deepseek",
+ keywords=("deepseek",),
+ env_key="DEEPSEEK_API_KEY",
+ display_name="DeepSeek",
+ backend_type="openai_compat",
+ default_base_url="https://api.deepseek.com/v1",
+ detect_by_base_keyword="deepseek",
+ ),
+ ProviderSpec(
+ name="groq",
+ keywords=("groq",),
+ env_key="GROQ_API_KEY",
+ display_name="Groq",
+ backend_type="openai_compat",
+ default_base_url="https://api.groq.com/openai/v1",
+ detect_by_key_prefix="gsk_",
+ detect_by_base_keyword="groq",
+ ),
+ ProviderSpec(
+ name="ollama",
+ keywords=("ollama",),
+ env_key="",
+ display_name="Ollama",
+ backend_type="openai_compat",
+ default_base_url="http://localhost:11434/v1",
+ detect_by_base_keyword="localhost:11434",
+ is_local=True,
+ ),
+)
+
+
+def demo_provider_detection():
+ """Demonstrate how provider detection works."""
+ print("🔍 Provider Detection Demo")
+ print("=" * 50)
+
+ def detect_provider(model: str, api_key: str | None = None, base_url: str | None = None) -> ProviderSpec | None:
+ """Simplified detection logic."""
+ # 1. API key prefix
+ if api_key:
+ for spec in SAMPLE_PROVIDERS:
+ if spec.detect_by_key_prefix and api_key.startswith(spec.detect_by_key_prefix):
+ return spec
+
+ # 2. Base URL keyword
+ if base_url:
+ base_lower = base_url.lower()
+ for spec in SAMPLE_PROVIDERS:
+ if spec.detect_by_base_keyword and spec.detect_by_base_keyword in base_lower:
+ return spec
+
+ # 3. Model keyword
+ if model:
+ model_lower = model.lower()
+ for spec in SAMPLE_PROVIDERS:
+ if any(kw in model_lower for kw in spec.keywords):
+ return spec
+ return None
+
+ test_cases = [
+ # (model, api_key, base_url, expected_provider)
+ ("claude-3-5-sonnet-20241022", None, None, "anthropic"),
+ ("gpt-4o", None, None, "openai"),
+ ("deepseek-chat", None, None, "deepseek"),
+ ("openai/gpt-4o-mini", "sk-or-v1-123", None, "openrouter"),
+ ("llama3-70b-8192", "gsk_123", None, "groq"),
+ ("custom-model", None, "https://api.deepseek.com/v1", "deepseek"),
+ ("ollama-model", None, "http://localhost:11434/v1", "ollama"),
+ ]
+
+ for model, api_key, base_url, expected in test_cases:
+ detected = detect_provider(model, api_key, base_url)
+ provider_name = detected.name if detected else "unknown"
+ status = "✅" if provider_name == expected else "❌"
+ print(f"{status} {model} → {provider_name} (expected: {expected})")
+
+
+def demo_adding_provider():
+ """Demonstrate adding a new provider."""
+ print("\n🆕 Adding a New Provider Demo")
+ print("=" * 50)
+
+ # Example: Adding a fictional provider "ExampleAI"
+ new_provider = ProviderSpec(
+ name="exampleai",
+ keywords=("example", "exampleai"),
+ env_key="EXAMPLEAI_API_KEY",
+ display_name="ExampleAI",
+ backend_type="openai_compat",
+ default_base_url="https://api.exampleai.com/v1",
+ detect_by_key_prefix="exa_",
+ detect_by_base_keyword="exampleai",
+ is_gateway=False,
+ is_local=False,
+ is_oauth=False,
+ )
+
+ print("New provider spec:")
+ print(f" Name: {new_provider.name}")
+ print(f" Display: {new_provider.display_name}")
+ print(f" Backend: {new_provider.backend_type}")
+ print(f" Base URL: {new_provider.default_base_url}")
+ print(f" Keywords: {new_provider.keywords}")
+ print(f" Key Prefix: {new_provider.detect_by_key_prefix}")
+
+ # Test detection with the new provider
+ print("\nTesting detection with new provider:")
+
+ def test_detection(model, api_key=None, base_url=None):
+ """Test detection with the new provider included."""
+ all_providers = list(SAMPLE_PROVIDERS) + [new_provider]
+
+ # Check against all providers
+ for spec in all_providers:
+ if api_key and spec.detect_by_key_prefix and api_key.startswith(spec.detect_by_key_prefix):
+ return spec
+ if base_url and spec.detect_by_base_keyword and spec.detect_by_base_keyword in base_url.lower():
+ return spec
+ if model and any(kw in model.lower() for kw in spec.keywords):
+ return spec
+ return None
+
+ test_cases = [
+ ("example-gpt-4", None, None),
+ ("custom-model", "exa_123", None),
+ ("any-model", None, "https://api.exampleai.com/v1"),
+ ]
+
+ for model, api_key, base_url in test_cases:
+ detected = test_detection(model, api_key, base_url)
+ result = detected.name if detected else "not detected"
+ print(f" {model} → {result}")
+
+
+def demo_provider_configuration():
+ """Show different ways to configure providers."""
+ print("\n⚙️ Provider Configuration Examples")
+ print("=" * 50)
+
+ providers = [
+ ("Anthropic", "ANTHROPIC_API_KEY", "claude-3-5-sonnet-20241022", None),
+ ("OpenAI", "OPENAI_API_KEY", "gpt-4o", "https://api.openai.com/v1"),
+ ("OpenRouter", "OPENROUTER_API_KEY", "openai/gpt-4o-mini", "https://openrouter.ai/api/v1"),
+ ("DeepSeek", "DEEPSEEK_API_KEY", "deepseek-chat", "https://api.deepseek.com/v1"),
+ ("Groq", "GROQ_API_KEY", "llama3-70b-8192", "https://api.groq.com/openai/v1"),
+ ("Ollama", None, "llama2", "http://localhost:11434/v1"),
+ ]
+
+ for name, env_var, model, base_url in providers:
+ print(f"\n{name}:")
+ if env_var:
+ print(f" export {env_var}='your-key-here'")
+ print(f" oh --model {model}")
+ if base_url:
+ print(f" # Base URL: {base_url}")
+ else:
+ print(" # Uses default base URL from registry")
+
+
+def demo_registry_inspection():
+ """Show what's currently in the provider registry."""
+ print("\n📋 Sample Provider Registry")
+ print("=" * 50)
+
+ print(f"Total providers: {len(SAMPLE_PROVIDERS)}")
+
+ categories = {
+ "Gateways": [p for p in SAMPLE_PROVIDERS if p.is_gateway],
+ "Cloud Providers": [p for p in SAMPLE_PROVIDERS if not p.is_gateway and not p.is_local and not p.is_oauth],
+ "Local Deployments": [p for p in SAMPLE_PROVIDERS if p.is_local],
+ "OAuth Providers": [p for p in SAMPLE_PROVIDERS if p.is_oauth],
+ }
+
+ for category, providers in categories.items():
+ if providers:
+ print(f"\n{category} ({len(providers)}):")
+ for provider in providers:
+ keywords = ", ".join(provider.keywords)
+ print(f" - {provider.display_name} ({provider.name}): {keywords}")
+
+
+def demo_cli_usage():
+ """Show example CLI commands for different providers."""
+ print("\n💻 CLI Usage Examples")
+ print("=" * 50)
+
+ examples = [
+ ("Anthropic Claude", "oh --model claude-3-5-sonnet-20241022 'Hello world'"),
+ ("OpenAI GPT-4", "oh --model gpt-4o 'Write a function'"),
+ ("OpenRouter (any model)", "export OPENROUTER_API_KEY='sk-or-...'\noh --model anthropic/claude-3-haiku 'Quick task'"),
+ ("DeepSeek", "export DEEPSEEK_API_KEY='...'\noh --model deepseek-chat 'Code review'"),
+ ("Groq (fast inference)", "export GROQ_API_KEY='gsk_...'\noh --model llama3-70b-8192 'Analyze this'"),
+ ("Ollama (local)", "oh --base-url http://localhost:11434/v1 --model llama2 'Local AI chat'"),
+ ("Custom provider", "export CUSTOM_API_KEY='...'\noh --base-url https://api.custom.com/v1 --model gpt-4 'Use custom API'"),
+ ]
+
+ for description, command in examples:
+ print(f"\n{description}:")
+ print(f" {command}")
+
+
+def main():
+ """Run all demos."""
+ print("🚀 OpenHarness LLM Provider Demo")
+ print("=" * 60)
+
+ demo_registry_inspection()
+ demo_provider_detection()
+ demo_adding_provider()
+ demo_provider_configuration()
+ demo_cli_usage()
+
+ print("\n" + "=" * 60)
+ print("✨ Demo complete! Check docs/LLM_PROVIDERS.md for more details.")
+
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/src/openharness/cli.py b/src/openharness/cli.py
index d7e21cc5..a625e123 100644
--- a/src/openharness/cli.py
+++ b/src/openharness/cli.py
@@ -4,6 +4,7 @@
import json
import sys
+import time
from pathlib import Path
from typing import Optional
@@ -29,11 +30,13 @@
plugin_app = typer.Typer(name="plugin", help="Manage plugins")
auth_app = typer.Typer(name="auth", help="Manage authentication")
cron_app = typer.Typer(name="cron", help="Manage cron scheduler and jobs")
+evidence_app = typer.Typer(name="evidence", help="Manage run evidence archives")
app.add_typer(mcp_app)
app.add_typer(plugin_app)
app.add_typer(auth_app)
app.add_typer(cron_app)
+app.add_typer(evidence_app)
# ---- mcp subcommands ----
@@ -246,6 +249,102 @@ def cron_logs_cmd(
print(line)
+# ---- evidence subcommands ----
+
+@evidence_app.command("list")
+def evidence_list() -> None:
+ """List all runs with evidence."""
+ from openharness.evidence import EvidenceStore
+
+ store = EvidenceStore()
+ runs = store.list_runs()
+ if not runs:
+ print("No runs with evidence found.")
+ return
+
+ print(f"Found {len(runs)} runs:")
+ for run_id in runs:
+ summary = store.get_run_summary(run_id)
+ evidence_count = summary["total_records"]
+ print(f" {run_id} ({evidence_count} records)")
+
+
+@evidence_app.command("summary")
+def evidence_summary(
+ run_id: str = typer.Argument(..., help="Run ID to summarize"),
+) -> None:
+ """Show detailed summary of evidence for a run."""
+ from openharness.evidence import EvidenceStore
+
+ store = EvidenceStore()
+ summary = store.get_run_summary(run_id)
+
+ print(f"Run: {run_id}")
+ print(f"Total Records: {summary['total_records']}")
+
+ if summary['time_range']['start'] and summary['time_range']['end']:
+ duration = summary['time_range']['end'] - summary['time_range']['start']
+ print(f"Duration: {duration:.2f} seconds")
+ print(f"Time Range: {time.ctime(summary['time_range']['start'])} - {time.ctime(summary['time_range']['end'])}")
+
+ print("\nEvidence Counts:")
+ for evidence_type, count in summary['evidence_counts'].items():
+ print(f" {evidence_type}: {count}")
+
+
+@evidence_app.command("export")
+def evidence_export(
+ run_id: str = typer.Argument(..., help="Run ID to export"),
+ output: str | None = typer.Option(None, "--output", "-o", help="Output file path"),
+ format: str = typer.Option("json", "--format", "-f", help="Export format (json, archive)"),
+) -> None:
+ """Export evidence for a run."""
+ from pathlib import Path
+ from openharness.evidence import EvidenceArchiver
+
+ archiver = EvidenceArchiver()
+ output_path = Path(output) if output else None
+
+ if format == "json":
+ result_path = archiver.export_run_to_json(run_id, output_path)
+ print(f"Exported to: {result_path}")
+ elif format == "archive":
+ result_path = archiver.create_run_archive(run_id, output_path)
+ print(f"Archived to: {result_path}")
+ else:
+ print(f"Unsupported format: {format}", file=sys.stderr)
+ raise typer.Exit(1)
+
+
+@evidence_app.command("report")
+def evidence_report(
+ run_id: str = typer.Argument(..., help="Run ID to report on"),
+ output: str | None = typer.Option(None, "--output", "-o", help="Output file path"),
+) -> None:
+ """Generate a human-readable report for a run."""
+ from pathlib import Path
+ from openharness.evidence import EvidenceArchiver
+
+ archiver = EvidenceArchiver()
+ output_path = Path(output) if output else None
+
+ result_path = archiver.create_run_report(run_id, output_path)
+ print(f"Report generated: {result_path}")
+
+
+@evidence_app.command("cleanup")
+def evidence_cleanup(
+ days: int = typer.Option(30, "--days", "-d", help="Remove evidence older than this many days"),
+) -> None:
+ """Clean up old evidence archives and run data."""
+ from openharness.evidence import EvidenceArchiver
+
+ archiver = EvidenceArchiver()
+ results = archiver.cleanup_archives(days)
+
+ print(f"Cleaned up {results['removed_runs']} old runs and {results['removed_archives']} old archives")
+
+
# ---- auth subcommands ----
# Mapping from provider name to human-readable label for interactive prompts.
diff --git a/src/openharness/evidence/README.md b/src/openharness/evidence/README.md
new file mode 100644
index 00000000..3fbaa35d
--- /dev/null
+++ b/src/openharness/evidence/README.md
@@ -0,0 +1,157 @@
+# Run-Level Evidence Layer
+
+The run-level evidence layer provides structured archiving for agent runs in OpenHarness. It captures comprehensive evidence of agent execution, including conversations, tasks, performance metrics, and errors.
+
+## Overview
+
+The evidence layer consists of several components:
+
+- **Evidence Types**: Data models for different types of evidence records
+- **Evidence Store**: Storage and retrieval system using JSON Lines format
+- **Evidence Collector**: Collection utilities for capturing evidence during runs
+- **Evidence Archiver**: Archiving, export, and reporting utilities
+- **CLI Commands**: Command-line interface for managing evidence
+
+## Evidence Types
+
+The system captures the following types of evidence:
+
+- `run_start` / `run_end`: Run lifecycle events
+- `task_start` / `task_progress` / `task_end`: Task execution evidence
+- `conversation_message`: Chat messages and tool calls
+- `hook_execution`: Hook execution results
+- `state_change`: Application state transitions
+- `performance_metric`: Performance measurements
+- `error`: Errors and exceptions
+
+## Usage
+
+### Basic Collection
+
+```python
+from openharness.evidence import EvidenceCollector
+
+collector = EvidenceCollector(run_id="my-run-123")
+
+# Record run start
+collector.record_run_start(
+ session_id="session-456",
+ cwd="/workspace",
+ command_line="oh --model gpt-4"
+)
+
+# Record task execution
+collector.record_task_start(task_record)
+
+# Record conversation
+collector.record_conversation_message(message)
+
+# Record run end
+collector.record_run_end()
+```
+
+### Context Manager
+
+```python
+from openharness.evidence import EvidenceCollector
+
+collector = EvidenceCollector()
+
+async with collector.collect_run_evidence(
+ session_id="session-456",
+ cwd="/workspace"
+) as collector:
+ # Run your agent logic here
+ # Evidence is automatically collected
+ pass
+```
+
+### CLI Commands
+
+```bash
+# List all runs with evidence
+oh evidence list
+
+# Show summary of a run
+oh evidence summary
+
+# Export evidence to JSON
+oh evidence export --format json
+
+# Create compressed archive
+oh evidence export --format archive
+
+# Generate human-readable report
+oh evidence report
+
+# Clean up old evidence
+oh evidence cleanup --days 30
+```
+
+## Storage Format
+
+Evidence is stored in JSON Lines format under `~/.openharness/evidence//`:
+
+```
+evidence/
+├── run-123/
+│ ├── run_start.jsonl
+│ ├── task_start.jsonl
+│ ├── conversation_message.jsonl
+│ └── run_end.jsonl
+└── run-456/
+ └── ...
+```
+
+Each line contains a complete evidence record:
+
+```json
+{
+ "id": "uuid",
+ "timestamp": 1234567890.123,
+ "type": "run_start",
+ "run_id": "run-123",
+ "agent_id": "agent-1",
+ "session_id": "session-456",
+ "cwd": "/workspace",
+ "command_line": "oh --model gpt-4"
+}
+```
+
+## Integration Points
+
+The evidence layer integrates with existing OpenHarness components:
+
+- **Task Manager**: Automatically records task lifecycle events
+- **Query Engine**: Captures conversation history and tool usage
+- **Hook System**: Records hook execution results
+- **Swarm Coordinator**: Tracks multi-agent interactions
+- **Error Handling**: Captures exceptions and failures
+
+## Configuration
+
+Evidence collection can be configured through:
+
+- Environment variables
+- Configuration files
+- Programmatic settings
+
+The evidence directory location can be customized by setting the `EvidenceStore` base directory.
+
+## Performance Considerations
+
+- Evidence is written asynchronously to minimize impact on agent performance
+- Large evidence collections can be archived and cleaned up automatically
+- JSON Lines format allows for efficient streaming and partial reads
+- Compression is used for long-term storage
+
+## Security
+
+Evidence may contain sensitive information such as:
+
+- API keys (redacted in storage)
+- File paths and contents
+- Conversation history
+- Error messages
+
+Consider access controls and encryption for production deployments.
\ No newline at end of file
diff --git a/src/openharness/evidence/__init__.py b/src/openharness/evidence/__init__.py
new file mode 100644
index 00000000..cc01406e
--- /dev/null
+++ b/src/openharness/evidence/__init__.py
@@ -0,0 +1,33 @@
+"""Run-level evidence layer for structured archiving of agent runs."""
+
+from __future__ import annotations
+
+from openharness.evidence.archiver import EvidenceArchiver
+from openharness.evidence.collector import EvidenceCollector
+from openharness.evidence.store import EvidenceStore
+from openharness.evidence.types import (
+ EvidenceRecord,
+ EvidenceType,
+ RunEvidence,
+ TaskEvidence,
+ ConversationEvidence,
+ HookEvidence,
+ StateEvidence,
+ PerformanceEvidence,
+ ErrorEvidence,
+)
+
+__all__ = [
+ "EvidenceArchiver",
+ "EvidenceCollector",
+ "EvidenceStore",
+ "EvidenceRecord",
+ "EvidenceType",
+ "RunEvidence",
+ "TaskEvidence",
+ "ConversationEvidence",
+ "HookEvidence",
+ "StateEvidence",
+ "PerformanceEvidence",
+ "ErrorEvidence",
+]
\ No newline at end of file
diff --git a/src/openharness/evidence/archiver.py b/src/openharness/evidence/archiver.py
new file mode 100644
index 00000000..b6f312e6
--- /dev/null
+++ b/src/openharness/evidence/archiver.py
@@ -0,0 +1,175 @@
+"""Evidence archiving and management utilities."""
+
+from __future__ import annotations
+
+import json
+import time
+from pathlib import Path
+from typing import Any
+from uuid import uuid4
+
+from openharness.evidence.store import EvidenceStore
+
+
+class EvidenceArchiver:
+ """Utilities for archiving and managing evidence collections."""
+
+ def __init__(self, store: EvidenceStore | None = None) -> None:
+ self.store = store or EvidenceStore()
+
+ def create_run_archive(
+ self,
+ run_id: str,
+ archive_path: Path | None = None,
+ include_metadata: bool = True,
+ ) -> Path:
+ """Create a compressed archive of all evidence for a run."""
+ if archive_path is None:
+ timestamp = int(time.time())
+ archive_path = self.store.base_dir / f"{run_id}_{timestamp}.tar.gz"
+
+ self.store.archive_run(run_id, archive_path)
+ return archive_path
+
+ def export_run_to_json(
+ self,
+ run_id: str,
+ output_path: Path | None = None,
+ pretty: bool = True,
+ ) -> Path:
+ """Export all evidence for a run to a single JSON file."""
+ if output_path is None:
+ output_path = self.store.base_dir / f"{run_id}_export.json"
+
+ evidence_list = list(self.store.get_evidence(run_id))
+ evidence_data = [evidence.__dict__ for evidence in evidence_list]
+
+ with open(output_path, "w", encoding="utf-8") as f:
+ json.dump(
+ {
+ "run_id": run_id,
+ "export_timestamp": time.time(),
+ "evidence_count": len(evidence_data),
+ "evidence": evidence_data,
+ },
+ f,
+ indent=2 if pretty else None,
+ ensure_ascii=False,
+ )
+
+ return output_path
+
+ def import_run_from_json(self, json_path: Path, new_run_id: str | None = None) -> str:
+ """Import evidence from a JSON export file."""
+ with open(json_path, "r", encoding="utf-8") as f:
+ data = json.load(f)
+
+ run_id = new_run_id or data["run_id"] or str(uuid4())
+
+ # Import each evidence record
+ for evidence_dict in data["evidence"]:
+ # Create a generic EvidenceRecord from the dict
+ from openharness.evidence.types import EvidenceRecord
+
+ evidence = EvidenceRecord()
+ for key, value in evidence_dict.items():
+ if hasattr(evidence, key):
+ setattr(evidence, key, value)
+
+ # Override run_id if specified
+ if new_run_id:
+ evidence.run_id = new_run_id
+
+ self.store.store_evidence(evidence)
+
+ return run_id
+
+ def create_run_report(
+ self,
+ run_id: str,
+ report_path: Path | None = None,
+ include_details: bool = True,
+ ) -> Path:
+ """Create a human-readable report of a run's evidence."""
+ if report_path is None:
+ report_path = self.store.base_dir / f"{run_id}_report.md"
+
+ summary = self.store.get_run_summary(run_id)
+ evidence_list = list(self.store.get_evidence(run_id))
+
+ with open(report_path, "w", encoding="utf-8") as f:
+ f.write(f"# Run Evidence Report: {run_id}\n\n")
+
+ f.write("## Summary\n\n")
+ f.write(f"- **Total Records**: {summary['total_records']}\n")
+ if summary['time_range']['start'] and summary['time_range']['end']:
+ duration = summary['time_range']['end'] - summary['time_range']['start']
+ f.write(f"- **Duration**: {duration:.2f} seconds\n")
+ f.write(f"- **Time Range**: {time.ctime(summary['time_range']['start'])} - {time.ctime(summary['time_range']['end'])}\n")
+
+ f.write("\n## Evidence Counts\n\n")
+ for evidence_type, count in summary['evidence_counts'].items():
+ f.write(f"- **{evidence_type}**: {count}\n")
+
+ if include_details:
+ f.write("\n## Detailed Evidence\n\n")
+
+ # Group by type
+ by_type = {}
+ for evidence in evidence_list:
+ by_type.setdefault(evidence.type, []).append(evidence)
+
+ for evidence_type, records in by_type.items():
+ f.write(f"### {evidence_type.title()}\n\n")
+
+ for record in sorted(records, key=lambda r: r.timestamp):
+ f.write(f"**{time.ctime(record.timestamp)}**\n\n")
+
+ # Show relevant fields based on type
+ if hasattr(record, 'description') and record.description:
+ f.write(f"- Description: {record.description}\n")
+ if hasattr(record, 'status') and record.status:
+ f.write(f"- Status: {record.status}\n")
+ if hasattr(record, 'error_message') and record.error_message:
+ f.write(f"- Error: {record.error_message}\n")
+ if hasattr(record, 'content') and record.content:
+ content_preview = record.content[:200] + "..." if len(record.content) > 200 else record.content
+ f.write(f"- Content: {content_preview}\n")
+
+ f.write("\n")
+
+ return report_path
+
+ def cleanup_archives(self, max_age_days: int = 30) -> dict[str, int]:
+ """Clean up old evidence archives and runs."""
+ results = {
+ "removed_runs": self.store.cleanup_old_runs(max_age_days),
+ "removed_archives": 0,
+ }
+
+ # Also clean up archive files
+ archive_pattern = self.store.base_dir / "*.tar.gz"
+ cutoff_time = time.time() - (max_age_days * 24 * 60 * 60)
+
+ for archive_file in self.store.base_dir.glob("*.tar.gz"):
+ if archive_file.stat().st_mtime < cutoff_time:
+ archive_file.unlink()
+ results["removed_archives"] += 1
+
+ return results
+
+ def list_archives(self) -> list[dict[str, Any]]:
+ """List all available evidence archives."""
+ archives = []
+
+ for archive_file in self.store.base_dir.glob("*.tar.gz"):
+ stat = archive_file.stat()
+ archives.append({
+ "path": archive_file,
+ "name": archive_file.name,
+ "size": stat.st_size,
+ "created": stat.st_ctime,
+ "modified": stat.st_mtime,
+ })
+
+ return sorted(archives, key=lambda x: x["created"], reverse=True)
\ No newline at end of file
diff --git a/src/openharness/evidence/collector.py b/src/openharness/evidence/collector.py
new file mode 100644
index 00000000..37ad79a5
--- /dev/null
+++ b/src/openharness/evidence/collector.py
@@ -0,0 +1,306 @@
+"""Evidence collection system for capturing run-level data."""
+
+from __future__ import annotations
+
+import asyncio
+import time
+import traceback
+from contextlib import asynccontextmanager
+from pathlib import Path
+from typing import Any, AsyncIterator
+from uuid import uuid4
+
+from openharness.engine.messages import ConversationMessage
+from openharness.evidence.store import EvidenceStore
+from openharness.evidence.types import (
+ ConversationEvidence,
+ ErrorEvidence,
+ EvidenceRecord,
+ HookEvidence,
+ PerformanceEvidence,
+ RunEvidence,
+ StateEvidence,
+ TaskEvidence,
+)
+from openharness.hooks.types import AggregatedHookResult, HookResult
+from openharness.tasks.types import TaskRecord
+
+
+class EvidenceCollector:
+ """Collects and stores evidence from agent runs."""
+
+ def __init__(self, run_id: str | None = None, store: EvidenceStore | None = None) -> None:
+ self.run_id = run_id or str(uuid4())
+ self.store = store or EvidenceStore()
+ self.agent_id = ""
+ self._start_time = time.time()
+
+ def set_agent_id(self, agent_id: str) -> None:
+ """Set the current agent ID for evidence records."""
+ self.agent_id = agent_id
+
+ def record_run_start(
+ self,
+ session_id: str = "",
+ cwd: str = "",
+ command_line: str = "",
+ config: dict[str, Any] | None = None,
+ environment: dict[str, str] | None = None,
+ ) -> None:
+ """Record the start of a run."""
+ evidence = RunEvidence(
+ type="run_start",
+ run_id=self.run_id,
+ agent_id=self.agent_id,
+ session_id=session_id,
+ cwd=cwd,
+ command_line=command_line,
+ config=config or {},
+ environment=environment or {},
+ timestamp=self._start_time,
+ )
+ self.store.store_evidence(evidence)
+
+ def record_run_end(self, final_status: str = "completed") -> None:
+ """Record the end of a run."""
+ evidence = RunEvidence(
+ type="run_end",
+ run_id=self.run_id,
+ agent_id=self.agent_id,
+ metadata={"final_status": final_status, "duration": time.time() - self._start_time},
+ )
+ self.store.store_evidence(evidence)
+
+ def record_task_start(self, task: TaskRecord) -> None:
+ """Record the start of a task."""
+ evidence = TaskEvidence(
+ type="task_start",
+ run_id=self.run_id,
+ agent_id=self.agent_id,
+ task_id=task.id,
+ task_type=task.type,
+ description=task.description,
+ status=task.status,
+ command=task.command,
+ cwd=task.cwd,
+ output_file=str(task.output_file),
+ metadata={"created_at": task.created_at, "started_at": task.started_at},
+ )
+ self.store.store_evidence(evidence)
+
+ def record_task_progress(self, task_id: str, progress_data: dict[str, Any]) -> None:
+ """Record progress on a task."""
+ evidence = TaskEvidence(
+ type="task_progress",
+ run_id=self.run_id,
+ agent_id=self.agent_id,
+ task_id=task_id,
+ metadata=progress_data,
+ )
+ self.store.store_evidence(evidence)
+
+ def record_task_end(self, task: TaskRecord) -> None:
+ """Record the end of a task."""
+ duration = 0.0
+ if task.started_at and task.ended_at:
+ duration = task.ended_at - task.started_at
+
+ evidence = TaskEvidence(
+ type="task_end",
+ run_id=self.run_id,
+ agent_id=self.agent_id,
+ task_id=task.id,
+ status=task.status,
+ return_code=task.return_code,
+ duration=duration,
+ metadata={
+ "ended_at": task.ended_at,
+ "return_code": task.return_code,
+ "duration": duration,
+ },
+ )
+ self.store.store_evidence(evidence)
+
+ def record_conversation_message(
+ self,
+ message: ConversationMessage,
+ token_count: int = 0,
+ model: str = "",
+ ) -> None:
+ """Record a conversation message."""
+ evidence = ConversationEvidence(
+ type="conversation_message",
+ run_id=self.run_id,
+ agent_id=self.agent_id,
+ message_type=message.message_type,
+ content=message.content,
+ role=getattr(message, "role", ""),
+ tool_calls=getattr(message, "tool_calls", []),
+ tool_results=getattr(message, "tool_results", []),
+ token_count=token_count,
+ model=model,
+ metadata={"message_id": getattr(message, "id", "")},
+ )
+ self.store.store_evidence(evidence)
+
+ def record_tool_call(
+ self,
+ tool_name: str,
+ arguments: dict[str, Any],
+ tool_call_id: str = "",
+ ) -> None:
+ """Record a tool call."""
+ evidence = ConversationEvidence(
+ type="tool_call",
+ run_id=self.run_id,
+ agent_id=self.agent_id,
+ metadata={
+ "tool_name": tool_name,
+ "arguments": arguments,
+ "tool_call_id": tool_call_id,
+ },
+ )
+ self.store.store_evidence(evidence)
+
+ def record_tool_result(
+ self,
+ tool_call_id: str,
+ result: Any,
+ success: bool = True,
+ error_message: str = "",
+ ) -> None:
+ """Record a tool result."""
+ evidence = ConversationEvidence(
+ type="tool_result",
+ run_id=self.run_id,
+ agent_id=self.agent_id,
+ metadata={
+ "tool_call_id": tool_call_id,
+ "result": str(result),
+ "success": success,
+ "error_message": error_message,
+ },
+ )
+ self.store.store_evidence(evidence)
+
+ def record_hook_execution(
+ self,
+ event: str,
+ result: AggregatedHookResult,
+ duration: float = 0.0,
+ ) -> None:
+ """Record hook execution results."""
+ for hook_result in result.results:
+ evidence = HookEvidence(
+ type="hook_execution",
+ run_id=self.run_id,
+ agent_id=self.agent_id,
+ event=event,
+ hook_type=hook_result.hook_type,
+ success=hook_result.success,
+ output=hook_result.output,
+ blocked=hook_result.blocked,
+ reason=hook_result.reason,
+ duration=duration,
+ metadata=hook_result.metadata,
+ )
+ self.store.store_evidence(evidence)
+
+ def record_state_change(
+ self,
+ state_type: str,
+ previous_state: dict[str, Any],
+ new_state: dict[str, Any],
+ change_reason: str = "",
+ ) -> None:
+ """Record a state change."""
+ evidence = StateEvidence(
+ type="state_change",
+ run_id=self.run_id,
+ agent_id=self.agent_id,
+ state_type=state_type,
+ previous_state=previous_state,
+ new_state=new_state,
+ change_reason=change_reason,
+ )
+ self.store.store_evidence(evidence)
+
+ def record_performance_metric(
+ self,
+ metric_name: str,
+ value: float,
+ unit: str = "",
+ category: str = "",
+ context: dict[str, Any] | None = None,
+ ) -> None:
+ """Record a performance metric."""
+ evidence = PerformanceEvidence(
+ type="performance_metric",
+ run_id=self.run_id,
+ agent_id=self.agent_id,
+ metric_name=metric_name,
+ value=value,
+ unit=unit,
+ category=category,
+ context=context or {},
+ )
+ self.store.store_evidence(evidence)
+
+ def record_error(
+ self,
+ error_type: str,
+ error_message: str,
+ context: dict[str, Any] | None = None,
+ exc: Exception | None = None,
+ recoverable: bool = False,
+ ) -> None:
+ """Record an error or exception."""
+ tb_str = ""
+ if exc:
+ tb_str = "".join(traceback.format_exception(type(exc), exc, exc.__traceback__))
+
+ evidence = ErrorEvidence(
+ type="error",
+ run_id=self.run_id,
+ agent_id=self.agent_id,
+ error_type=error_type,
+ error_message=error_message,
+ traceback=tb_str,
+ context=context or {},
+ recoverable=recoverable,
+ )
+ self.store.store_evidence(evidence)
+
+ @asynccontextmanager
+ async def collect_run_evidence(
+ self,
+ session_id: str = "",
+ cwd: str = "",
+ command_line: str = "",
+ config: dict[str, Any] | None = None,
+ environment: dict[str, str] | None = None,
+ ) -> AsyncIterator[EvidenceCollector]:
+ """Context manager for collecting evidence for an entire run."""
+ try:
+ self.record_run_start(
+ session_id=session_id,
+ cwd=cwd,
+ command_line=command_line,
+ config=config,
+ environment=environment,
+ )
+ yield self
+ except Exception as e:
+ self.record_error(
+ "run_execution_error",
+ str(e),
+ context={"phase": "run_execution"},
+ exc=e,
+ )
+ raise
+ finally:
+ self.record_run_end()
+
+ def get_run_summary(self) -> dict[str, Any]:
+ """Get a summary of the current run's evidence."""
+ return self.store.get_run_summary(self.run_id)
\ No newline at end of file
diff --git a/src/openharness/evidence/store.py b/src/openharness/evidence/store.py
new file mode 100644
index 00000000..8b800b9d
--- /dev/null
+++ b/src/openharness/evidence/store.py
@@ -0,0 +1,179 @@
+"""Evidence storage and retrieval system."""
+
+from __future__ import annotations
+
+import json
+import time
+from pathlib import Path
+from typing import Any, Iterator
+
+from openharness.evidence.types import EvidenceRecord
+
+
+class EvidenceStore:
+ """Structured storage for run-level evidence."""
+
+ def __init__(self, base_dir: Path | None = None) -> None:
+ if base_dir is None:
+ # Lazy import to avoid dependency issues during testing
+ try:
+ from openharness.config.paths import get_data_dir
+ self.base_dir = get_data_dir() / "evidence"
+ except ImportError:
+ # Fallback for testing without full environment
+ self.base_dir = Path.home() / ".openharness" / "evidence"
+ else:
+ self.base_dir = base_dir
+ self.base_dir.mkdir(parents=True, exist_ok=True)
+
+ def _get_run_dir(self, run_id: str) -> Path:
+ """Get the directory for a specific run."""
+ return self.base_dir / run_id
+
+ def _get_evidence_file(self, run_id: str, evidence_type: str) -> Path:
+ """Get the file path for evidence of a specific type."""
+ run_dir = self._get_run_dir(run_id)
+ run_dir.mkdir(parents=True, exist_ok=True)
+ return run_dir / f"{evidence_type}.jsonl"
+
+ def store_evidence(self, evidence: EvidenceRecord) -> None:
+ """Store an evidence record."""
+ if not evidence.timestamp:
+ evidence.timestamp = time.time()
+
+ file_path = self._get_evidence_file(evidence.run_id, evidence.type)
+ record_data = {
+ "id": evidence.id,
+ "timestamp": evidence.timestamp,
+ "type": evidence.type,
+ "run_id": evidence.run_id,
+ "agent_id": evidence.agent_id,
+ "metadata": evidence.metadata,
+ **{
+ k: v for k, v in evidence.__dict__.items()
+ if k not in {"id", "timestamp", "type", "run_id", "agent_id", "metadata"}
+ and v is not None and v != "" and v != [] and v != {}
+ }
+ }
+
+ with open(file_path, "a", encoding="utf-8") as f:
+ json.dump(record_data, f, ensure_ascii=False)
+ f.write("\n")
+
+ def get_evidence(
+ self,
+ run_id: str,
+ evidence_type: str | None = None,
+ start_time: float | None = None,
+ end_time: float | None = None,
+ ) -> Iterator[EvidenceRecord]:
+ """Retrieve evidence records for a run."""
+ if evidence_type:
+ files = [self._get_evidence_file(run_id, evidence_type)]
+ else:
+ run_dir = self._get_run_dir(run_id)
+ if not run_dir.exists():
+ return
+ files = list(run_dir.glob("*.jsonl"))
+
+ for file_path in files:
+ if not file_path.exists():
+ continue
+
+ with open(file_path, "r", encoding="utf-8") as f:
+ for line in f:
+ if not line.strip():
+ continue
+
+ try:
+ data = json.loads(line)
+ if start_time and data["timestamp"] < start_time:
+ continue
+ if end_time and data["timestamp"] > end_time:
+ continue
+
+ # Create the appropriate evidence record type
+ evidence = EvidenceRecord(
+ id=data["id"],
+ timestamp=data["timestamp"],
+ type=data["type"],
+ run_id=data["run_id"],
+ agent_id=data.get("agent_id", ""),
+ metadata=data.get("metadata", {}),
+ )
+
+ # Add type-specific fields
+ for k, v in data.items():
+ if k not in {"id", "timestamp", "type", "run_id", "agent_id", "metadata"}:
+ setattr(evidence, k, v)
+
+ yield evidence
+ except (json.JSONDecodeError, KeyError):
+ continue
+
+ def list_runs(self) -> list[str]:
+ """List all run IDs that have evidence."""
+ if not self.base_dir.exists():
+ return []
+
+ return [d.name for d in self.base_dir.iterdir() if d.is_dir()]
+
+ def get_run_summary(self, run_id: str) -> dict[str, Any]:
+ """Get a summary of evidence for a run."""
+ summary = {
+ "run_id": run_id,
+ "evidence_counts": {},
+ "time_range": {"start": None, "end": None},
+ "total_records": 0,
+ }
+
+ for evidence in self.get_evidence(run_id):
+ summary["total_records"] += 1
+
+ # Count by type
+ summary["evidence_counts"][evidence.type] = (
+ summary["evidence_counts"].get(evidence.type, 0) + 1
+ )
+
+ # Track time range
+ if summary["time_range"]["start"] is None or evidence.timestamp < summary["time_range"]["start"]:
+ summary["time_range"]["start"] = evidence.timestamp
+ if summary["time_range"]["end"] is None or evidence.timestamp > summary["time_range"]["end"]:
+ summary["time_range"]["end"] = evidence.timestamp
+
+ return summary
+
+ def archive_run(self, run_id: str, archive_path: Path) -> None:
+ """Archive all evidence for a run to a compressed file."""
+ import tarfile
+
+ run_dir = self._get_run_dir(run_id)
+ if not run_dir.exists():
+ raise FileNotFoundError(f"No evidence found for run {run_id}")
+
+ with tarfile.open(archive_path, "w:gz") as tar:
+ tar.add(run_dir, arcname=run_id)
+
+ def cleanup_old_runs(self, max_age_days: int) -> int:
+ """Remove evidence for runs older than the specified age."""
+ import shutil
+
+ cutoff_time = time.time() - (max_age_days * 24 * 60 * 60)
+ removed_count = 0
+
+ for run_dir in self.base_dir.iterdir():
+ if not run_dir.is_dir():
+ continue
+
+ # Check if any evidence file is older than cutoff
+ should_remove = True
+ for evidence_file in run_dir.glob("*.jsonl"):
+ if evidence_file.stat().st_mtime > cutoff_time:
+ should_remove = False
+ break
+
+ if should_remove:
+ shutil.rmtree(run_dir)
+ removed_count += 1
+
+ return removed_count
\ No newline at end of file
diff --git a/src/openharness/evidence/types.py b/src/openharness/evidence/types.py
new file mode 100644
index 00000000..0a89481b
--- /dev/null
+++ b/src/openharness/evidence/types.py
@@ -0,0 +1,121 @@
+"""Evidence data models for run-level archiving."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any, Literal
+from uuid import uuid4
+
+
+EvidenceType = Literal[
+ "run_start",
+ "run_end",
+ "task_start",
+ "task_progress",
+ "task_end",
+ "conversation_message",
+ "tool_call",
+ "tool_result",
+ "hook_execution",
+ "state_change",
+ "performance_metric",
+ "error",
+]
+
+
+@dataclass
+class EvidenceRecord:
+ """Base class for all evidence records."""
+
+ id: str = field(default_factory=lambda: str(uuid4()))
+ timestamp: float = 0.0
+ type: EvidenceType = "run_start"
+ run_id: str = ""
+ agent_id: str = ""
+ metadata: dict[str, Any] = field(default_factory=dict)
+
+
+@dataclass
+class RunEvidence(EvidenceRecord):
+ """Evidence for run lifecycle events."""
+
+ session_id: str = ""
+ cwd: str = ""
+ command_line: str = ""
+ config: dict[str, Any] = field(default_factory=dict)
+ environment: dict[str, str] = field(default_factory=dict)
+
+
+@dataclass
+class TaskEvidence(EvidenceRecord):
+ """Evidence for task execution."""
+
+ task_id: str = ""
+ task_type: str = ""
+ description: str = ""
+ status: str = ""
+ command: str = ""
+ cwd: str = ""
+ output_file: str = ""
+ return_code: int | None = None
+ duration: float = 0.0
+ error_message: str = ""
+
+
+@dataclass
+class ConversationEvidence(EvidenceRecord):
+ """Evidence for conversation messages."""
+
+ message_type: str = "" # "user", "assistant", "system", "tool"
+ content: str = ""
+ role: str = ""
+ tool_calls: list[dict[str, Any]] = field(default_factory=list)
+ tool_results: list[dict[str, Any]] = field(default_factory=list)
+ token_count: int = 0
+ model: str = ""
+
+
+@dataclass
+class HookEvidence(EvidenceRecord):
+ """Evidence for hook executions."""
+
+ event: str = ""
+ hook_type: str = ""
+ success: bool = True
+ output: str = ""
+ blocked: bool = False
+ reason: str = ""
+ duration: float = 0.0
+
+
+@dataclass
+class StateEvidence(EvidenceRecord):
+ """Evidence for application state changes."""
+
+ state_type: str = "" # "app_state", "task_state", "swarm_state"
+ previous_state: dict[str, Any] = field(default_factory=dict)
+ new_state: dict[str, Any] = field(default_factory=dict)
+ change_reason: str = ""
+
+
+@dataclass
+class PerformanceEvidence(EvidenceRecord):
+ """Evidence for performance metrics."""
+
+ metric_name: str = ""
+ value: float = 0.0
+ unit: str = ""
+ category: str = "" # "cost", "latency", "throughput", "resource"
+ context: dict[str, Any] = field(default_factory=dict)
+
+
+@dataclass
+class ErrorEvidence(EvidenceRecord):
+ """Evidence for errors and exceptions."""
+
+ error_type: str = ""
+ error_message: str = ""
+ traceback: str = ""
+ context: dict[str, Any] = field(default_factory=dict)
+ recoverable: bool = False
\ No newline at end of file
diff --git a/src/openharness/platforms.py b/src/openharness/platforms.py
index bfd66ad9..ccebf609 100644
--- a/src/openharness/platforms.py
+++ b/src/openharness/platforms.py
@@ -36,7 +36,7 @@ def detect_platform(
if system == "darwin":
return "macos"
- if system == "windows":
+ if system in ("windows", "win32"):
return "windows"
if system == "linux":
if "microsoft" in kernel_release or env_map.get("WSL_DISTRO_NAME") or env_map.get("WSL_INTEROP"):
diff --git a/src/openharness/swarm/lockfile.py b/src/openharness/swarm/lockfile.py
index 335696d9..0480eafe 100644
--- a/src/openharness/swarm/lockfile.py
+++ b/src/openharness/swarm/lockfile.py
@@ -40,7 +40,10 @@ def exclusive_file_lock(
@contextmanager
def _exclusive_posix_lock(lock_path: Path) -> Iterator[None]:
- import fcntl
+ try:
+ import fcntl
+ except ImportError as e:
+ raise SwarmLockUnavailableError(f"fcntl not available: {e}") from e
lock_path.parent.mkdir(parents=True, exist_ok=True)
lock_path.touch(exist_ok=True)
@@ -54,7 +57,10 @@ def _exclusive_posix_lock(lock_path: Path) -> Iterator[None]:
@contextmanager
def _exclusive_windows_lock(lock_path: Path) -> Iterator[None]:
- import msvcrt
+ try:
+ import msvcrt
+ except ImportError as e:
+ raise SwarmLockUnavailableError(f"msvcrt not available: {e}") from e
lock_path.parent.mkdir(parents=True, exist_ok=True)
with lock_path.open("a+b") as lock_file:
diff --git a/tests/test_evidence.py b/tests/test_evidence.py
new file mode 100644
index 00000000..5253e63d
--- /dev/null
+++ b/tests/test_evidence.py
@@ -0,0 +1,124 @@
+"""Tests for the evidence layer."""
+
+from __future__ import annotations
+
+import tempfile
+from pathlib import Path
+
+from openharness.evidence import EvidenceCollector, EvidenceStore, EvidenceArchiver
+from openharness.evidence.types import RunEvidence, TaskEvidence
+
+
+def test_evidence_store():
+ """Test basic evidence storage and retrieval."""
+ with tempfile.TemporaryDirectory() as temp_dir:
+ store = EvidenceStore(Path(temp_dir))
+
+ # Create and store evidence
+ evidence = RunEvidence(
+ type="run_start",
+ run_id="test-run-123",
+ agent_id="test-agent",
+ session_id="test-session",
+ cwd="/tmp",
+ command_line="test command",
+ )
+ store.store_evidence(evidence)
+
+ # Retrieve evidence
+ records = list(store.get_evidence("test-run-123"))
+ assert len(records) == 1
+ assert records[0].run_id == "test-run-123"
+ assert records[0].type == "run_start"
+
+
+def test_evidence_collector():
+ """Test evidence collection."""
+ with tempfile.TemporaryDirectory() as temp_dir:
+ store = EvidenceStore(Path(temp_dir))
+ collector = EvidenceCollector("test-run-456", store)
+
+ # Record run start
+ collector.record_run_start(
+ session_id="test-session",
+ cwd="/tmp",
+ command_line="test command",
+ )
+
+ # Record task
+ collector.record_task_start(
+ TaskEvidence(
+ task_id="task-123",
+ task_type="local_agent",
+ description="Test task",
+ status="running",
+ cwd="/tmp",
+ output_file=Path("/tmp/task.log"),
+ command="echo hello",
+ )
+ )
+
+ # Check evidence was stored
+ records = list(store.get_evidence("test-run-456"))
+ assert len(records) == 2
+
+ run_records = [r for r in records if r.type == "run_start"]
+ task_records = [r for r in records if r.type == "task_start"]
+
+ assert len(run_records) == 1
+ assert len(task_records) == 1
+ assert task_records[0].task_id == "task-123"
+
+
+def test_evidence_archiver():
+ """Test evidence archiving."""
+ with tempfile.TemporaryDirectory() as temp_dir:
+ temp_path = Path(temp_dir)
+ store = EvidenceStore(temp_path)
+ archiver = EvidenceArchiver(store)
+
+ # Create some evidence
+ evidence = RunEvidence(
+ type="run_start",
+ run_id="archive-test-run",
+ agent_id="test-agent",
+ )
+ store.store_evidence(evidence)
+
+ # Export to JSON
+ json_path = archiver.export_run_to_json("archive-test-run")
+ assert json_path.exists()
+
+ # Create archive
+ archive_path = archiver.create_run_archive("archive-test-run")
+ assert archive_path.exists()
+
+ # Create report
+ report_path = archiver.create_run_report("archive-test-run")
+ assert report_path.exists()
+ assert "Run Evidence Report" in report_path.read_text()
+
+
+def test_run_summary():
+ """Test run summary generation."""
+ with tempfile.TemporaryDirectory() as temp_dir:
+ store = EvidenceStore(Path(temp_dir))
+
+ # Create multiple evidence records
+ records = [
+ RunEvidence(type="run_start", run_id="summary-test", agent_id="agent1"),
+ TaskEvidence(type="task_start", run_id="summary-test", agent_id="agent1", task_id="task1"),
+ TaskEvidence(type="task_end", run_id="summary-test", agent_id="agent1", task_id="task1"),
+ RunEvidence(type="run_end", run_id="summary-test", agent_id="agent1"),
+ ]
+
+ for record in records:
+ store.store_evidence(record)
+
+ summary = store.get_run_summary("summary-test")
+ assert summary["run_id"] == "summary-test"
+ assert summary["total_records"] == 4
+ assert summary["evidence_counts"]["run_start"] == 1
+ assert summary["evidence_counts"]["run_end"] == 1
+ assert summary["evidence_counts"]["task_start"] == 1
+ assert summary["evidence_counts"]["task_end"] == 1
\ No newline at end of file