diff --git a/README.md b/README.md index 81aea73e..a48e0e61 100644 --- a/README.md +++ b/README.md @@ -150,6 +150,7 @@ OpenHarness is an open-source Python implementation designed for **researchers, Start here: Quick Start · Provider Compatibility · + LLM Providers · Showcase · Contributing · Changelog @@ -242,8 +243,55 @@ oh -p "Fix the bug" --output-format stream-json ## 🔌 Provider Compatibility +OpenHarness supports a wide variety of LLM providers through its extensible provider registry system. The system automatically detects providers based on API keys, base URLs, and model names. + +**📖 [Complete Provider Guide](README_PROVIDERS.md)** | **🚀 [Interactive Demo](scripts/demo_providers.py)** + OpenHarness supports three API formats: **Anthropic** (default), **OpenAI-compatible** (`--api-format openai`), and **GitHub Copilot** (`--api-format copilot`). The OpenAI format covers a wide range of providers. +### Quick Provider Setup + +```bash +# OpenRouter (100+ models) +export OPENROUTER_API_KEY="sk-or-v1-..." +oh --model anthropic/claude-3-haiku + +# DeepSeek +export DEEPSEEK_API_KEY="your-key" +oh --model deepseek-chat + +# Groq (fast inference) +export GROQ_API_KEY="gsk_..." +oh --model llama3-70b-8192 + +# Ollama (local) +oh --base-url http://localhost:11434/v1 --model llama2 +``` + +### Adding New Providers + +To add support for a new LLM provider: + +1. Edit `src/openharness/api/registry.py` +2. Add a `ProviderSpec` to the `PROVIDERS` tuple +3. Test with `python scripts/demo_providers.py` + +Example: +```python +ProviderSpec( + name="myprovider", + keywords=("myprovider", "myai"), + env_key="MYPROVIDER_API_KEY", + display_name="MyProvider", + backend_type="openai_compat", + default_base_url="https://api.myprovider.com/v1", + detect_by_key_prefix="mp_", # optional + detect_by_base_keyword="myprovider", # optional +), +``` + +### Anthropic Format (default) + ### Anthropic Format (default) | Provider profile | Detection signal | Notes | diff --git a/README_PROVIDERS.md b/README_PROVIDERS.md new file mode 100644 index 00000000..e22b2f55 --- /dev/null +++ b/README_PROVIDERS.md @@ -0,0 +1,103 @@ +# Adding LLM Providers to OpenHarness + +OpenHarness supports a wide variety of LLM providers through its extensible provider registry system. This guide shows you how to add new providers and configure them. + +## Quick Start + +### For Users: Using Existing Providers + +OpenHarness already supports many popular providers. Here's how to use them: + +```bash +# OpenRouter (access to 100+ models) +export OPENROUTER_API_KEY="sk-or-v1-..." +oh --model anthropic/claude-3-haiku "Hello world" + +# DeepSeek +export DEEPSEEK_API_KEY="your-key" +oh --model deepseek-chat "Code review this function" + +# Groq (fast inference) +export GROQ_API_KEY="gsk_..." +oh --model llama3-70b-8192 "Analyze this code" + +# Ollama (local models) +oh --base-url http://localhost:11434/v1 --model llama2 "Local AI chat" +``` + +### For Developers: Adding New Providers + +To add a new LLM provider: + +1. **Edit the registry** (`src/openharness/api/registry.py`) +2. **Add your provider spec** to the `PROVIDERS` tuple +3. **Test the configuration** + +Example provider spec: +```python +ProviderSpec( + name="myprovider", + keywords=("myprovider", "myai"), + env_key="MYPROVIDER_API_KEY", + display_name="MyProvider", + backend_type="openai_compat", # or "anthropic" + default_base_url="https://api.myprovider.com/v1", + detect_by_key_prefix="mp_", # optional + detect_by_base_keyword="myprovider", # optional + is_gateway=False, + is_local=False, + is_oauth=False, +), +``` + +## Demo + +Run the interactive demo to see how providers work: + +```bash +python scripts/demo_providers.py +``` + +This shows: +- How provider detection works +- How to add new providers +- Configuration examples +- CLI usage patterns + +## Supported Providers + +OpenHarness currently supports: + +- **Anthropic** (Claude models) +- **OpenAI** (GPT models) +- **OpenRouter** (100+ models via gateway) +- **DeepSeek** +- **Groq** (fast inference) +- **GitHub Copilot** (OAuth) +- **Ollama** (local models) +- And many more... + +See `docs/LLM_PROVIDERS.md` for the complete list and detailed configuration instructions. + +## Key Concepts + +### Provider Detection Priority +1. API key prefix (e.g., `sk-or-` → OpenRouter) +2. Base URL keywords (e.g., `deepseek.com` → DeepSeek) +3. Model name keywords (e.g., `claude` → Anthropic) + +### Backend Types +- `anthropic`: Native Anthropic SDK (best for Claude) +- `openai_compat`: OpenAI SDK (works with most providers) +- `copilot`: GitHub Copilot OAuth + +### Configuration Methods +- Environment variables: `export PROVIDER_API_KEY="..."` +- Command line: `oh --model model-name --base-url https://...` +- Settings file: `~/.openharness/settings.json` + +## Need Help? + +- Check the [full documentation](docs/LLM_PROVIDERS.md) +- Run the demo: `python scripts/demo_providers.py` +- Test detection: `oh --model your-model-name --dry-run` \ No newline at end of file diff --git a/docs/LLM_PROVIDERS.md b/docs/LLM_PROVIDERS.md new file mode 100644 index 00000000..e9eb0375 --- /dev/null +++ b/docs/LLM_PROVIDERS.md @@ -0,0 +1,274 @@ +# Adding LLM Providers to OpenHarness + +OpenHarness supports a wide variety of LLM providers through its extensible provider registry system. This guide shows how to add new providers and configure them for use. + +## How Provider Detection Works + +OpenHarness automatically detects providers using a priority system: + +1. **API Key Prefix**: Special key prefixes (e.g., `sk-or-` for OpenRouter) +2. **Base URL Keywords**: Substrings in the API base URL +3. **Model Name Keywords**: Keywords in model names + +## Adding a New Provider + +### Step 1: Add to the Provider Registry + +Edit `src/openharness/api/registry.py` and add your provider to the `PROVIDERS` tuple: + +```python +ProviderSpec( + name="your_provider", + keywords=("keyword1", "keyword2"), # Model name keywords + env_key="YOUR_PROVIDER_API_KEY", # Environment variable name + display_name="Your Provider", # Human-readable name + backend_type="openai_compat", # "anthropic" | "openai_compat" | "copilot" + default_base_url="https://api.yourprovider.com/v1", + detect_by_key_prefix="", # API key prefix (optional) + detect_by_base_keyword="yourprovider", # Base URL keyword (optional) + is_gateway=False, # True if routes to multiple models + is_local=False, # True for local deployments + is_oauth=False, # True for OAuth providers +), +``` + +### Step 2: Configuration + +Users can configure the provider in several ways: + +#### Environment Variables +```bash +export YOUR_PROVIDER_API_KEY="your-api-key-here" +``` + +#### Command Line +```bash +# Auto-detection by model name +oh --model your-model-name + +# Explicit base URL +oh --base-url https://api.yourprovider.com/v1 + +# Explicit API format (if needed) +oh --api-format openai +``` + +#### Settings File +```json +{ + "api_key": "your-api-key-here", + "base_url": "https://api.yourprovider.com/v1", + "model": "your-model-name" +} +``` + +## Examples + +### OpenRouter + +OpenRouter is already configured in the registry: + +```python +ProviderSpec( + name="openrouter", + keywords=("openrouter",), + env_key="OPENROUTER_API_KEY", + display_name="OpenRouter", + backend_type="openai_compat", + default_base_url="https://openrouter.ai/api/v1", + detect_by_key_prefix="sk-or-", + detect_by_base_keyword="openrouter", + is_gateway=True, + is_local=False, + is_oauth=False, +), +``` + +Usage: +```bash +export OPENROUTER_API_KEY="sk-or-..." +oh --model openai/gpt-4o-mini +``` + +### Adding a Custom Provider + +Let's add support for a hypothetical provider called "ExampleAI": + +1. **Add to registry**: +```python +ProviderSpec( + name="exampleai", + keywords=("example", "exampleai"), + env_key="EXAMPLEAI_API_KEY", + display_name="ExampleAI", + backend_type="openai_compat", + default_base_url="https://api.exampleai.com/v1", + detect_by_key_prefix="exa_", + detect_by_base_keyword="exampleai", + is_gateway=False, + is_local=False, + is_oauth=False, +), +``` + +2. **Usage**: +```bash +export EXAMPLEAI_API_KEY="exa_your_key_here" +oh --model example/gpt-4 + +# Or with explicit base URL +oh --base-url https://api.exampleai.com/v1 --model gpt-4 +``` + +### Popular Providers + +Here are some popular providers and their configurations: + +#### Anthropic (Native) +- **Backend**: `anthropic` +- **Models**: `claude-3-5-sonnet-20241022`, `claude-3-haiku-20240307` +- **Key**: `ANTHROPIC_API_KEY` + +#### OpenAI +- **Backend**: `openai_compat` +- **Models**: `gpt-4o`, `gpt-4-turbo` +- **Key**: `OPENAI_API_KEY` +- **Base URL**: `https://api.openai.com/v1` + +#### DeepSeek +```python +ProviderSpec( + name="deepseek", + keywords=("deepseek",), + env_key="DEEPSEEK_API_KEY", + display_name="DeepSeek", + backend_type="openai_compat", + default_base_url="https://api.deepseek.com/v1", + detect_by_key_prefix="", + detect_by_base_keyword="deepseek", + is_gateway=False, + is_local=False, + is_oauth=False, +), +``` + +Usage: +```bash +export DEEPSEEK_API_KEY="your-key" +oh --model deepseek-chat +``` + +#### Groq +```python +ProviderSpec( + name="groq", + keywords=("groq",), + env_key="GROQ_API_KEY", + display_name="Groq", + backend_type="openai_compat", + default_base_url="https://api.groq.com/openai/v1", + detect_by_key_prefix="gsk_", + detect_by_base_keyword="groq", + is_gateway=False, + is_local=False, + is_oauth=False, +), +``` + +Usage: +```bash +export GROQ_API_KEY="gsk_..." +oh --model llama3-70b-8192 +``` + +#### Ollama (Local) +```python +ProviderSpec( + name="ollama", + keywords=("ollama",), + env_key="", + display_name="Ollama", + backend_type="openai_compat", + default_base_url="http://localhost:11434/v1", + detect_by_key_prefix="", + detect_by_base_keyword="localhost:11434", + is_gateway=False, + is_local=True, + is_oauth=False, +), +``` + +Usage: +```bash +# Start Ollama server locally +ollama serve + +# Use with OpenHarness +oh --base-url http://localhost:11434/v1 --model llama2 +``` + +## Backend Types + +### Anthropic Backend +- Uses the official Anthropic Python SDK +- Best for Claude models +- Supports advanced features like tool calling + +### OpenAI Compatible Backend +- Uses the OpenAI Python SDK +- Works with any OpenAI-compatible API +- Most providers use this backend + +### Copilot Backend +- Special OAuth flow for GitHub Copilot +- Requires `api_format=copilot` + +## Detection Priority + +The system checks for providers in this order: + +1. **API Key Prefix**: `sk-or-` → OpenRouter, `gsk_` → Groq +2. **Base URL**: `openrouter.ai` → OpenRouter, `deepseek.com` → DeepSeek +3. **Model Keywords**: `claude` → Anthropic, `gpt` → OpenAI + +## Testing Your Provider + +1. **Add to registry** +2. **Set environment variable** +3. **Test detection**: + ```bash + oh --model your-model-name --dry-run + ``` +4. **Test actual usage**: + ```bash + oh --model your-model-name "Hello world" + ``` + +## Troubleshooting + +### Provider Not Detected +- Check that keywords match your model names +- Verify API key prefix or base URL keywords +- Use explicit `--base-url` and `--api-format openai` + +### Authentication Errors +- Verify API key is set correctly +- Check API key format (some providers have specific prefixes) +- Ensure the API key has necessary permissions + +### Connection Issues +- Verify base URL is correct +- Check network connectivity +- Some providers require specific regions or endpoints + +## Contributing + +When adding a new provider: + +1. Test with multiple models +2. Verify API compatibility +3. Add appropriate keywords for detection +4. Update this documentation +5. Consider adding tests + +The provider registry in `src/openharness/api/registry.py` is the single source of truth for all provider configurations. \ No newline at end of file diff --git a/scripts/demo_providers.py b/scripts/demo_providers.py new file mode 100644 index 00000000..a948ebb8 --- /dev/null +++ b/scripts/demo_providers.py @@ -0,0 +1,286 @@ +#!/usr/bin/env python3 +""" +Demo script for adding and testing LLM providers in OpenHarness. + +This script demonstrates the provider registry structure and how to add new providers. +It runs without requiring OpenHarness dependencies to be installed. + +Usage: + python scripts/demo_providers.py +""" + +from __future__ import annotations + +import os +import sys +from pathlib import Path +from dataclasses import dataclass +from typing import Tuple + + +@dataclass(frozen=True) +class ProviderSpec: + """One LLM provider's metadata.""" + name: str + keywords: tuple[str, ...] + env_key: str + display_name: str = "" + backend_type: str = "openai_compat" + default_base_url: str = "" + detect_by_key_prefix: str = "" + detect_by_base_keyword: str = "" + is_gateway: bool = False + is_local: bool = False + is_oauth: bool = False + + @property + def label(self) -> str: + return self.display_name or self.name.title() + + +# Sample providers (subset from the actual registry) +SAMPLE_PROVIDERS = ( + ProviderSpec( + name="anthropic", + keywords=("anthropic", "claude"), + env_key="ANTHROPIC_API_KEY", + display_name="Anthropic", + backend_type="anthropic", + ), + ProviderSpec( + name="openai", + keywords=("openai", "gpt", "o1", "o3", "o4"), + env_key="OPENAI_API_KEY", + display_name="OpenAI", + backend_type="openai_compat", + ), + ProviderSpec( + name="openrouter", + keywords=("openrouter",), + env_key="OPENROUTER_API_KEY", + display_name="OpenRouter", + backend_type="openai_compat", + default_base_url="https://openrouter.ai/api/v1", + detect_by_key_prefix="sk-or-", + detect_by_base_keyword="openrouter", + is_gateway=True, + ), + ProviderSpec( + name="deepseek", + keywords=("deepseek",), + env_key="DEEPSEEK_API_KEY", + display_name="DeepSeek", + backend_type="openai_compat", + default_base_url="https://api.deepseek.com/v1", + detect_by_base_keyword="deepseek", + ), + ProviderSpec( + name="groq", + keywords=("groq",), + env_key="GROQ_API_KEY", + display_name="Groq", + backend_type="openai_compat", + default_base_url="https://api.groq.com/openai/v1", + detect_by_key_prefix="gsk_", + detect_by_base_keyword="groq", + ), + ProviderSpec( + name="ollama", + keywords=("ollama",), + env_key="", + display_name="Ollama", + backend_type="openai_compat", + default_base_url="http://localhost:11434/v1", + detect_by_base_keyword="localhost:11434", + is_local=True, + ), +) + + +def demo_provider_detection(): + """Demonstrate how provider detection works.""" + print("🔍 Provider Detection Demo") + print("=" * 50) + + def detect_provider(model: str, api_key: str | None = None, base_url: str | None = None) -> ProviderSpec | None: + """Simplified detection logic.""" + # 1. API key prefix + if api_key: + for spec in SAMPLE_PROVIDERS: + if spec.detect_by_key_prefix and api_key.startswith(spec.detect_by_key_prefix): + return spec + + # 2. Base URL keyword + if base_url: + base_lower = base_url.lower() + for spec in SAMPLE_PROVIDERS: + if spec.detect_by_base_keyword and spec.detect_by_base_keyword in base_lower: + return spec + + # 3. Model keyword + if model: + model_lower = model.lower() + for spec in SAMPLE_PROVIDERS: + if any(kw in model_lower for kw in spec.keywords): + return spec + return None + + test_cases = [ + # (model, api_key, base_url, expected_provider) + ("claude-3-5-sonnet-20241022", None, None, "anthropic"), + ("gpt-4o", None, None, "openai"), + ("deepseek-chat", None, None, "deepseek"), + ("openai/gpt-4o-mini", "sk-or-v1-123", None, "openrouter"), + ("llama3-70b-8192", "gsk_123", None, "groq"), + ("custom-model", None, "https://api.deepseek.com/v1", "deepseek"), + ("ollama-model", None, "http://localhost:11434/v1", "ollama"), + ] + + for model, api_key, base_url, expected in test_cases: + detected = detect_provider(model, api_key, base_url) + provider_name = detected.name if detected else "unknown" + status = "✅" if provider_name == expected else "❌" + print(f"{status} {model} → {provider_name} (expected: {expected})") + + +def demo_adding_provider(): + """Demonstrate adding a new provider.""" + print("\n🆕 Adding a New Provider Demo") + print("=" * 50) + + # Example: Adding a fictional provider "ExampleAI" + new_provider = ProviderSpec( + name="exampleai", + keywords=("example", "exampleai"), + env_key="EXAMPLEAI_API_KEY", + display_name="ExampleAI", + backend_type="openai_compat", + default_base_url="https://api.exampleai.com/v1", + detect_by_key_prefix="exa_", + detect_by_base_keyword="exampleai", + is_gateway=False, + is_local=False, + is_oauth=False, + ) + + print("New provider spec:") + print(f" Name: {new_provider.name}") + print(f" Display: {new_provider.display_name}") + print(f" Backend: {new_provider.backend_type}") + print(f" Base URL: {new_provider.default_base_url}") + print(f" Keywords: {new_provider.keywords}") + print(f" Key Prefix: {new_provider.detect_by_key_prefix}") + + # Test detection with the new provider + print("\nTesting detection with new provider:") + + def test_detection(model, api_key=None, base_url=None): + """Test detection with the new provider included.""" + all_providers = list(SAMPLE_PROVIDERS) + [new_provider] + + # Check against all providers + for spec in all_providers: + if api_key and spec.detect_by_key_prefix and api_key.startswith(spec.detect_by_key_prefix): + return spec + if base_url and spec.detect_by_base_keyword and spec.detect_by_base_keyword in base_url.lower(): + return spec + if model and any(kw in model.lower() for kw in spec.keywords): + return spec + return None + + test_cases = [ + ("example-gpt-4", None, None), + ("custom-model", "exa_123", None), + ("any-model", None, "https://api.exampleai.com/v1"), + ] + + for model, api_key, base_url in test_cases: + detected = test_detection(model, api_key, base_url) + result = detected.name if detected else "not detected" + print(f" {model} → {result}") + + +def demo_provider_configuration(): + """Show different ways to configure providers.""" + print("\n⚙️ Provider Configuration Examples") + print("=" * 50) + + providers = [ + ("Anthropic", "ANTHROPIC_API_KEY", "claude-3-5-sonnet-20241022", None), + ("OpenAI", "OPENAI_API_KEY", "gpt-4o", "https://api.openai.com/v1"), + ("OpenRouter", "OPENROUTER_API_KEY", "openai/gpt-4o-mini", "https://openrouter.ai/api/v1"), + ("DeepSeek", "DEEPSEEK_API_KEY", "deepseek-chat", "https://api.deepseek.com/v1"), + ("Groq", "GROQ_API_KEY", "llama3-70b-8192", "https://api.groq.com/openai/v1"), + ("Ollama", None, "llama2", "http://localhost:11434/v1"), + ] + + for name, env_var, model, base_url in providers: + print(f"\n{name}:") + if env_var: + print(f" export {env_var}='your-key-here'") + print(f" oh --model {model}") + if base_url: + print(f" # Base URL: {base_url}") + else: + print(" # Uses default base URL from registry") + + +def demo_registry_inspection(): + """Show what's currently in the provider registry.""" + print("\n📋 Sample Provider Registry") + print("=" * 50) + + print(f"Total providers: {len(SAMPLE_PROVIDERS)}") + + categories = { + "Gateways": [p for p in SAMPLE_PROVIDERS if p.is_gateway], + "Cloud Providers": [p for p in SAMPLE_PROVIDERS if not p.is_gateway and not p.is_local and not p.is_oauth], + "Local Deployments": [p for p in SAMPLE_PROVIDERS if p.is_local], + "OAuth Providers": [p for p in SAMPLE_PROVIDERS if p.is_oauth], + } + + for category, providers in categories.items(): + if providers: + print(f"\n{category} ({len(providers)}):") + for provider in providers: + keywords = ", ".join(provider.keywords) + print(f" - {provider.display_name} ({provider.name}): {keywords}") + + +def demo_cli_usage(): + """Show example CLI commands for different providers.""" + print("\n💻 CLI Usage Examples") + print("=" * 50) + + examples = [ + ("Anthropic Claude", "oh --model claude-3-5-sonnet-20241022 'Hello world'"), + ("OpenAI GPT-4", "oh --model gpt-4o 'Write a function'"), + ("OpenRouter (any model)", "export OPENROUTER_API_KEY='sk-or-...'\noh --model anthropic/claude-3-haiku 'Quick task'"), + ("DeepSeek", "export DEEPSEEK_API_KEY='...'\noh --model deepseek-chat 'Code review'"), + ("Groq (fast inference)", "export GROQ_API_KEY='gsk_...'\noh --model llama3-70b-8192 'Analyze this'"), + ("Ollama (local)", "oh --base-url http://localhost:11434/v1 --model llama2 'Local AI chat'"), + ("Custom provider", "export CUSTOM_API_KEY='...'\noh --base-url https://api.custom.com/v1 --model gpt-4 'Use custom API'"), + ] + + for description, command in examples: + print(f"\n{description}:") + print(f" {command}") + + +def main(): + """Run all demos.""" + print("🚀 OpenHarness LLM Provider Demo") + print("=" * 60) + + demo_registry_inspection() + demo_provider_detection() + demo_adding_provider() + demo_provider_configuration() + demo_cli_usage() + + print("\n" + "=" * 60) + print("✨ Demo complete! Check docs/LLM_PROVIDERS.md for more details.") + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/src/openharness/cli.py b/src/openharness/cli.py index d7e21cc5..a625e123 100644 --- a/src/openharness/cli.py +++ b/src/openharness/cli.py @@ -4,6 +4,7 @@ import json import sys +import time from pathlib import Path from typing import Optional @@ -29,11 +30,13 @@ plugin_app = typer.Typer(name="plugin", help="Manage plugins") auth_app = typer.Typer(name="auth", help="Manage authentication") cron_app = typer.Typer(name="cron", help="Manage cron scheduler and jobs") +evidence_app = typer.Typer(name="evidence", help="Manage run evidence archives") app.add_typer(mcp_app) app.add_typer(plugin_app) app.add_typer(auth_app) app.add_typer(cron_app) +app.add_typer(evidence_app) # ---- mcp subcommands ---- @@ -246,6 +249,102 @@ def cron_logs_cmd( print(line) +# ---- evidence subcommands ---- + +@evidence_app.command("list") +def evidence_list() -> None: + """List all runs with evidence.""" + from openharness.evidence import EvidenceStore + + store = EvidenceStore() + runs = store.list_runs() + if not runs: + print("No runs with evidence found.") + return + + print(f"Found {len(runs)} runs:") + for run_id in runs: + summary = store.get_run_summary(run_id) + evidence_count = summary["total_records"] + print(f" {run_id} ({evidence_count} records)") + + +@evidence_app.command("summary") +def evidence_summary( + run_id: str = typer.Argument(..., help="Run ID to summarize"), +) -> None: + """Show detailed summary of evidence for a run.""" + from openharness.evidence import EvidenceStore + + store = EvidenceStore() + summary = store.get_run_summary(run_id) + + print(f"Run: {run_id}") + print(f"Total Records: {summary['total_records']}") + + if summary['time_range']['start'] and summary['time_range']['end']: + duration = summary['time_range']['end'] - summary['time_range']['start'] + print(f"Duration: {duration:.2f} seconds") + print(f"Time Range: {time.ctime(summary['time_range']['start'])} - {time.ctime(summary['time_range']['end'])}") + + print("\nEvidence Counts:") + for evidence_type, count in summary['evidence_counts'].items(): + print(f" {evidence_type}: {count}") + + +@evidence_app.command("export") +def evidence_export( + run_id: str = typer.Argument(..., help="Run ID to export"), + output: str | None = typer.Option(None, "--output", "-o", help="Output file path"), + format: str = typer.Option("json", "--format", "-f", help="Export format (json, archive)"), +) -> None: + """Export evidence for a run.""" + from pathlib import Path + from openharness.evidence import EvidenceArchiver + + archiver = EvidenceArchiver() + output_path = Path(output) if output else None + + if format == "json": + result_path = archiver.export_run_to_json(run_id, output_path) + print(f"Exported to: {result_path}") + elif format == "archive": + result_path = archiver.create_run_archive(run_id, output_path) + print(f"Archived to: {result_path}") + else: + print(f"Unsupported format: {format}", file=sys.stderr) + raise typer.Exit(1) + + +@evidence_app.command("report") +def evidence_report( + run_id: str = typer.Argument(..., help="Run ID to report on"), + output: str | None = typer.Option(None, "--output", "-o", help="Output file path"), +) -> None: + """Generate a human-readable report for a run.""" + from pathlib import Path + from openharness.evidence import EvidenceArchiver + + archiver = EvidenceArchiver() + output_path = Path(output) if output else None + + result_path = archiver.create_run_report(run_id, output_path) + print(f"Report generated: {result_path}") + + +@evidence_app.command("cleanup") +def evidence_cleanup( + days: int = typer.Option(30, "--days", "-d", help="Remove evidence older than this many days"), +) -> None: + """Clean up old evidence archives and run data.""" + from openharness.evidence import EvidenceArchiver + + archiver = EvidenceArchiver() + results = archiver.cleanup_archives(days) + + print(f"Cleaned up {results['removed_runs']} old runs and {results['removed_archives']} old archives") + + # ---- auth subcommands ---- # Mapping from provider name to human-readable label for interactive prompts. diff --git a/src/openharness/evidence/README.md b/src/openharness/evidence/README.md new file mode 100644 index 00000000..3fbaa35d --- /dev/null +++ b/src/openharness/evidence/README.md @@ -0,0 +1,157 @@ +# Run-Level Evidence Layer + +The run-level evidence layer provides structured archiving for agent runs in OpenHarness. It captures comprehensive evidence of agent execution, including conversations, tasks, performance metrics, and errors. + +## Overview + +The evidence layer consists of several components: + +- **Evidence Types**: Data models for different types of evidence records +- **Evidence Store**: Storage and retrieval system using JSON Lines format +- **Evidence Collector**: Collection utilities for capturing evidence during runs +- **Evidence Archiver**: Archiving, export, and reporting utilities +- **CLI Commands**: Command-line interface for managing evidence + +## Evidence Types + +The system captures the following types of evidence: + +- `run_start` / `run_end`: Run lifecycle events +- `task_start` / `task_progress` / `task_end`: Task execution evidence +- `conversation_message`: Chat messages and tool calls +- `hook_execution`: Hook execution results +- `state_change`: Application state transitions +- `performance_metric`: Performance measurements +- `error`: Errors and exceptions + +## Usage + +### Basic Collection + +```python +from openharness.evidence import EvidenceCollector + +collector = EvidenceCollector(run_id="my-run-123") + +# Record run start +collector.record_run_start( + session_id="session-456", + cwd="/workspace", + command_line="oh --model gpt-4" +) + +# Record task execution +collector.record_task_start(task_record) + +# Record conversation +collector.record_conversation_message(message) + +# Record run end +collector.record_run_end() +``` + +### Context Manager + +```python +from openharness.evidence import EvidenceCollector + +collector = EvidenceCollector() + +async with collector.collect_run_evidence( + session_id="session-456", + cwd="/workspace" +) as collector: + # Run your agent logic here + # Evidence is automatically collected + pass +``` + +### CLI Commands + +```bash +# List all runs with evidence +oh evidence list + +# Show summary of a run +oh evidence summary + +# Export evidence to JSON +oh evidence export --format json + +# Create compressed archive +oh evidence export --format archive + +# Generate human-readable report +oh evidence report + +# Clean up old evidence +oh evidence cleanup --days 30 +``` + +## Storage Format + +Evidence is stored in JSON Lines format under `~/.openharness/evidence//`: + +``` +evidence/ +├── run-123/ +│ ├── run_start.jsonl +│ ├── task_start.jsonl +│ ├── conversation_message.jsonl +│ └── run_end.jsonl +└── run-456/ + └── ... +``` + +Each line contains a complete evidence record: + +```json +{ + "id": "uuid", + "timestamp": 1234567890.123, + "type": "run_start", + "run_id": "run-123", + "agent_id": "agent-1", + "session_id": "session-456", + "cwd": "/workspace", + "command_line": "oh --model gpt-4" +} +``` + +## Integration Points + +The evidence layer integrates with existing OpenHarness components: + +- **Task Manager**: Automatically records task lifecycle events +- **Query Engine**: Captures conversation history and tool usage +- **Hook System**: Records hook execution results +- **Swarm Coordinator**: Tracks multi-agent interactions +- **Error Handling**: Captures exceptions and failures + +## Configuration + +Evidence collection can be configured through: + +- Environment variables +- Configuration files +- Programmatic settings + +The evidence directory location can be customized by setting the `EvidenceStore` base directory. + +## Performance Considerations + +- Evidence is written asynchronously to minimize impact on agent performance +- Large evidence collections can be archived and cleaned up automatically +- JSON Lines format allows for efficient streaming and partial reads +- Compression is used for long-term storage + +## Security + +Evidence may contain sensitive information such as: + +- API keys (redacted in storage) +- File paths and contents +- Conversation history +- Error messages + +Consider access controls and encryption for production deployments. \ No newline at end of file diff --git a/src/openharness/evidence/__init__.py b/src/openharness/evidence/__init__.py new file mode 100644 index 00000000..cc01406e --- /dev/null +++ b/src/openharness/evidence/__init__.py @@ -0,0 +1,33 @@ +"""Run-level evidence layer for structured archiving of agent runs.""" + +from __future__ import annotations + +from openharness.evidence.archiver import EvidenceArchiver +from openharness.evidence.collector import EvidenceCollector +from openharness.evidence.store import EvidenceStore +from openharness.evidence.types import ( + EvidenceRecord, + EvidenceType, + RunEvidence, + TaskEvidence, + ConversationEvidence, + HookEvidence, + StateEvidence, + PerformanceEvidence, + ErrorEvidence, +) + +__all__ = [ + "EvidenceArchiver", + "EvidenceCollector", + "EvidenceStore", + "EvidenceRecord", + "EvidenceType", + "RunEvidence", + "TaskEvidence", + "ConversationEvidence", + "HookEvidence", + "StateEvidence", + "PerformanceEvidence", + "ErrorEvidence", +] \ No newline at end of file diff --git a/src/openharness/evidence/archiver.py b/src/openharness/evidence/archiver.py new file mode 100644 index 00000000..b6f312e6 --- /dev/null +++ b/src/openharness/evidence/archiver.py @@ -0,0 +1,175 @@ +"""Evidence archiving and management utilities.""" + +from __future__ import annotations + +import json +import time +from pathlib import Path +from typing import Any +from uuid import uuid4 + +from openharness.evidence.store import EvidenceStore + + +class EvidenceArchiver: + """Utilities for archiving and managing evidence collections.""" + + def __init__(self, store: EvidenceStore | None = None) -> None: + self.store = store or EvidenceStore() + + def create_run_archive( + self, + run_id: str, + archive_path: Path | None = None, + include_metadata: bool = True, + ) -> Path: + """Create a compressed archive of all evidence for a run.""" + if archive_path is None: + timestamp = int(time.time()) + archive_path = self.store.base_dir / f"{run_id}_{timestamp}.tar.gz" + + self.store.archive_run(run_id, archive_path) + return archive_path + + def export_run_to_json( + self, + run_id: str, + output_path: Path | None = None, + pretty: bool = True, + ) -> Path: + """Export all evidence for a run to a single JSON file.""" + if output_path is None: + output_path = self.store.base_dir / f"{run_id}_export.json" + + evidence_list = list(self.store.get_evidence(run_id)) + evidence_data = [evidence.__dict__ for evidence in evidence_list] + + with open(output_path, "w", encoding="utf-8") as f: + json.dump( + { + "run_id": run_id, + "export_timestamp": time.time(), + "evidence_count": len(evidence_data), + "evidence": evidence_data, + }, + f, + indent=2 if pretty else None, + ensure_ascii=False, + ) + + return output_path + + def import_run_from_json(self, json_path: Path, new_run_id: str | None = None) -> str: + """Import evidence from a JSON export file.""" + with open(json_path, "r", encoding="utf-8") as f: + data = json.load(f) + + run_id = new_run_id or data["run_id"] or str(uuid4()) + + # Import each evidence record + for evidence_dict in data["evidence"]: + # Create a generic EvidenceRecord from the dict + from openharness.evidence.types import EvidenceRecord + + evidence = EvidenceRecord() + for key, value in evidence_dict.items(): + if hasattr(evidence, key): + setattr(evidence, key, value) + + # Override run_id if specified + if new_run_id: + evidence.run_id = new_run_id + + self.store.store_evidence(evidence) + + return run_id + + def create_run_report( + self, + run_id: str, + report_path: Path | None = None, + include_details: bool = True, + ) -> Path: + """Create a human-readable report of a run's evidence.""" + if report_path is None: + report_path = self.store.base_dir / f"{run_id}_report.md" + + summary = self.store.get_run_summary(run_id) + evidence_list = list(self.store.get_evidence(run_id)) + + with open(report_path, "w", encoding="utf-8") as f: + f.write(f"# Run Evidence Report: {run_id}\n\n") + + f.write("## Summary\n\n") + f.write(f"- **Total Records**: {summary['total_records']}\n") + if summary['time_range']['start'] and summary['time_range']['end']: + duration = summary['time_range']['end'] - summary['time_range']['start'] + f.write(f"- **Duration**: {duration:.2f} seconds\n") + f.write(f"- **Time Range**: {time.ctime(summary['time_range']['start'])} - {time.ctime(summary['time_range']['end'])}\n") + + f.write("\n## Evidence Counts\n\n") + for evidence_type, count in summary['evidence_counts'].items(): + f.write(f"- **{evidence_type}**: {count}\n") + + if include_details: + f.write("\n## Detailed Evidence\n\n") + + # Group by type + by_type = {} + for evidence in evidence_list: + by_type.setdefault(evidence.type, []).append(evidence) + + for evidence_type, records in by_type.items(): + f.write(f"### {evidence_type.title()}\n\n") + + for record in sorted(records, key=lambda r: r.timestamp): + f.write(f"**{time.ctime(record.timestamp)}**\n\n") + + # Show relevant fields based on type + if hasattr(record, 'description') and record.description: + f.write(f"- Description: {record.description}\n") + if hasattr(record, 'status') and record.status: + f.write(f"- Status: {record.status}\n") + if hasattr(record, 'error_message') and record.error_message: + f.write(f"- Error: {record.error_message}\n") + if hasattr(record, 'content') and record.content: + content_preview = record.content[:200] + "..." if len(record.content) > 200 else record.content + f.write(f"- Content: {content_preview}\n") + + f.write("\n") + + return report_path + + def cleanup_archives(self, max_age_days: int = 30) -> dict[str, int]: + """Clean up old evidence archives and runs.""" + results = { + "removed_runs": self.store.cleanup_old_runs(max_age_days), + "removed_archives": 0, + } + + # Also clean up archive files + archive_pattern = self.store.base_dir / "*.tar.gz" + cutoff_time = time.time() - (max_age_days * 24 * 60 * 60) + + for archive_file in self.store.base_dir.glob("*.tar.gz"): + if archive_file.stat().st_mtime < cutoff_time: + archive_file.unlink() + results["removed_archives"] += 1 + + return results + + def list_archives(self) -> list[dict[str, Any]]: + """List all available evidence archives.""" + archives = [] + + for archive_file in self.store.base_dir.glob("*.tar.gz"): + stat = archive_file.stat() + archives.append({ + "path": archive_file, + "name": archive_file.name, + "size": stat.st_size, + "created": stat.st_ctime, + "modified": stat.st_mtime, + }) + + return sorted(archives, key=lambda x: x["created"], reverse=True) \ No newline at end of file diff --git a/src/openharness/evidence/collector.py b/src/openharness/evidence/collector.py new file mode 100644 index 00000000..37ad79a5 --- /dev/null +++ b/src/openharness/evidence/collector.py @@ -0,0 +1,306 @@ +"""Evidence collection system for capturing run-level data.""" + +from __future__ import annotations + +import asyncio +import time +import traceback +from contextlib import asynccontextmanager +from pathlib import Path +from typing import Any, AsyncIterator +from uuid import uuid4 + +from openharness.engine.messages import ConversationMessage +from openharness.evidence.store import EvidenceStore +from openharness.evidence.types import ( + ConversationEvidence, + ErrorEvidence, + EvidenceRecord, + HookEvidence, + PerformanceEvidence, + RunEvidence, + StateEvidence, + TaskEvidence, +) +from openharness.hooks.types import AggregatedHookResult, HookResult +from openharness.tasks.types import TaskRecord + + +class EvidenceCollector: + """Collects and stores evidence from agent runs.""" + + def __init__(self, run_id: str | None = None, store: EvidenceStore | None = None) -> None: + self.run_id = run_id or str(uuid4()) + self.store = store or EvidenceStore() + self.agent_id = "" + self._start_time = time.time() + + def set_agent_id(self, agent_id: str) -> None: + """Set the current agent ID for evidence records.""" + self.agent_id = agent_id + + def record_run_start( + self, + session_id: str = "", + cwd: str = "", + command_line: str = "", + config: dict[str, Any] | None = None, + environment: dict[str, str] | None = None, + ) -> None: + """Record the start of a run.""" + evidence = RunEvidence( + type="run_start", + run_id=self.run_id, + agent_id=self.agent_id, + session_id=session_id, + cwd=cwd, + command_line=command_line, + config=config or {}, + environment=environment or {}, + timestamp=self._start_time, + ) + self.store.store_evidence(evidence) + + def record_run_end(self, final_status: str = "completed") -> None: + """Record the end of a run.""" + evidence = RunEvidence( + type="run_end", + run_id=self.run_id, + agent_id=self.agent_id, + metadata={"final_status": final_status, "duration": time.time() - self._start_time}, + ) + self.store.store_evidence(evidence) + + def record_task_start(self, task: TaskRecord) -> None: + """Record the start of a task.""" + evidence = TaskEvidence( + type="task_start", + run_id=self.run_id, + agent_id=self.agent_id, + task_id=task.id, + task_type=task.type, + description=task.description, + status=task.status, + command=task.command, + cwd=task.cwd, + output_file=str(task.output_file), + metadata={"created_at": task.created_at, "started_at": task.started_at}, + ) + self.store.store_evidence(evidence) + + def record_task_progress(self, task_id: str, progress_data: dict[str, Any]) -> None: + """Record progress on a task.""" + evidence = TaskEvidence( + type="task_progress", + run_id=self.run_id, + agent_id=self.agent_id, + task_id=task_id, + metadata=progress_data, + ) + self.store.store_evidence(evidence) + + def record_task_end(self, task: TaskRecord) -> None: + """Record the end of a task.""" + duration = 0.0 + if task.started_at and task.ended_at: + duration = task.ended_at - task.started_at + + evidence = TaskEvidence( + type="task_end", + run_id=self.run_id, + agent_id=self.agent_id, + task_id=task.id, + status=task.status, + return_code=task.return_code, + duration=duration, + metadata={ + "ended_at": task.ended_at, + "return_code": task.return_code, + "duration": duration, + }, + ) + self.store.store_evidence(evidence) + + def record_conversation_message( + self, + message: ConversationMessage, + token_count: int = 0, + model: str = "", + ) -> None: + """Record a conversation message.""" + evidence = ConversationEvidence( + type="conversation_message", + run_id=self.run_id, + agent_id=self.agent_id, + message_type=message.message_type, + content=message.content, + role=getattr(message, "role", ""), + tool_calls=getattr(message, "tool_calls", []), + tool_results=getattr(message, "tool_results", []), + token_count=token_count, + model=model, + metadata={"message_id": getattr(message, "id", "")}, + ) + self.store.store_evidence(evidence) + + def record_tool_call( + self, + tool_name: str, + arguments: dict[str, Any], + tool_call_id: str = "", + ) -> None: + """Record a tool call.""" + evidence = ConversationEvidence( + type="tool_call", + run_id=self.run_id, + agent_id=self.agent_id, + metadata={ + "tool_name": tool_name, + "arguments": arguments, + "tool_call_id": tool_call_id, + }, + ) + self.store.store_evidence(evidence) + + def record_tool_result( + self, + tool_call_id: str, + result: Any, + success: bool = True, + error_message: str = "", + ) -> None: + """Record a tool result.""" + evidence = ConversationEvidence( + type="tool_result", + run_id=self.run_id, + agent_id=self.agent_id, + metadata={ + "tool_call_id": tool_call_id, + "result": str(result), + "success": success, + "error_message": error_message, + }, + ) + self.store.store_evidence(evidence) + + def record_hook_execution( + self, + event: str, + result: AggregatedHookResult, + duration: float = 0.0, + ) -> None: + """Record hook execution results.""" + for hook_result in result.results: + evidence = HookEvidence( + type="hook_execution", + run_id=self.run_id, + agent_id=self.agent_id, + event=event, + hook_type=hook_result.hook_type, + success=hook_result.success, + output=hook_result.output, + blocked=hook_result.blocked, + reason=hook_result.reason, + duration=duration, + metadata=hook_result.metadata, + ) + self.store.store_evidence(evidence) + + def record_state_change( + self, + state_type: str, + previous_state: dict[str, Any], + new_state: dict[str, Any], + change_reason: str = "", + ) -> None: + """Record a state change.""" + evidence = StateEvidence( + type="state_change", + run_id=self.run_id, + agent_id=self.agent_id, + state_type=state_type, + previous_state=previous_state, + new_state=new_state, + change_reason=change_reason, + ) + self.store.store_evidence(evidence) + + def record_performance_metric( + self, + metric_name: str, + value: float, + unit: str = "", + category: str = "", + context: dict[str, Any] | None = None, + ) -> None: + """Record a performance metric.""" + evidence = PerformanceEvidence( + type="performance_metric", + run_id=self.run_id, + agent_id=self.agent_id, + metric_name=metric_name, + value=value, + unit=unit, + category=category, + context=context or {}, + ) + self.store.store_evidence(evidence) + + def record_error( + self, + error_type: str, + error_message: str, + context: dict[str, Any] | None = None, + exc: Exception | None = None, + recoverable: bool = False, + ) -> None: + """Record an error or exception.""" + tb_str = "" + if exc: + tb_str = "".join(traceback.format_exception(type(exc), exc, exc.__traceback__)) + + evidence = ErrorEvidence( + type="error", + run_id=self.run_id, + agent_id=self.agent_id, + error_type=error_type, + error_message=error_message, + traceback=tb_str, + context=context or {}, + recoverable=recoverable, + ) + self.store.store_evidence(evidence) + + @asynccontextmanager + async def collect_run_evidence( + self, + session_id: str = "", + cwd: str = "", + command_line: str = "", + config: dict[str, Any] | None = None, + environment: dict[str, str] | None = None, + ) -> AsyncIterator[EvidenceCollector]: + """Context manager for collecting evidence for an entire run.""" + try: + self.record_run_start( + session_id=session_id, + cwd=cwd, + command_line=command_line, + config=config, + environment=environment, + ) + yield self + except Exception as e: + self.record_error( + "run_execution_error", + str(e), + context={"phase": "run_execution"}, + exc=e, + ) + raise + finally: + self.record_run_end() + + def get_run_summary(self) -> dict[str, Any]: + """Get a summary of the current run's evidence.""" + return self.store.get_run_summary(self.run_id) \ No newline at end of file diff --git a/src/openharness/evidence/store.py b/src/openharness/evidence/store.py new file mode 100644 index 00000000..8b800b9d --- /dev/null +++ b/src/openharness/evidence/store.py @@ -0,0 +1,179 @@ +"""Evidence storage and retrieval system.""" + +from __future__ import annotations + +import json +import time +from pathlib import Path +from typing import Any, Iterator + +from openharness.evidence.types import EvidenceRecord + + +class EvidenceStore: + """Structured storage for run-level evidence.""" + + def __init__(self, base_dir: Path | None = None) -> None: + if base_dir is None: + # Lazy import to avoid dependency issues during testing + try: + from openharness.config.paths import get_data_dir + self.base_dir = get_data_dir() / "evidence" + except ImportError: + # Fallback for testing without full environment + self.base_dir = Path.home() / ".openharness" / "evidence" + else: + self.base_dir = base_dir + self.base_dir.mkdir(parents=True, exist_ok=True) + + def _get_run_dir(self, run_id: str) -> Path: + """Get the directory for a specific run.""" + return self.base_dir / run_id + + def _get_evidence_file(self, run_id: str, evidence_type: str) -> Path: + """Get the file path for evidence of a specific type.""" + run_dir = self._get_run_dir(run_id) + run_dir.mkdir(parents=True, exist_ok=True) + return run_dir / f"{evidence_type}.jsonl" + + def store_evidence(self, evidence: EvidenceRecord) -> None: + """Store an evidence record.""" + if not evidence.timestamp: + evidence.timestamp = time.time() + + file_path = self._get_evidence_file(evidence.run_id, evidence.type) + record_data = { + "id": evidence.id, + "timestamp": evidence.timestamp, + "type": evidence.type, + "run_id": evidence.run_id, + "agent_id": evidence.agent_id, + "metadata": evidence.metadata, + **{ + k: v for k, v in evidence.__dict__.items() + if k not in {"id", "timestamp", "type", "run_id", "agent_id", "metadata"} + and v is not None and v != "" and v != [] and v != {} + } + } + + with open(file_path, "a", encoding="utf-8") as f: + json.dump(record_data, f, ensure_ascii=False) + f.write("\n") + + def get_evidence( + self, + run_id: str, + evidence_type: str | None = None, + start_time: float | None = None, + end_time: float | None = None, + ) -> Iterator[EvidenceRecord]: + """Retrieve evidence records for a run.""" + if evidence_type: + files = [self._get_evidence_file(run_id, evidence_type)] + else: + run_dir = self._get_run_dir(run_id) + if not run_dir.exists(): + return + files = list(run_dir.glob("*.jsonl")) + + for file_path in files: + if not file_path.exists(): + continue + + with open(file_path, "r", encoding="utf-8") as f: + for line in f: + if not line.strip(): + continue + + try: + data = json.loads(line) + if start_time and data["timestamp"] < start_time: + continue + if end_time and data["timestamp"] > end_time: + continue + + # Create the appropriate evidence record type + evidence = EvidenceRecord( + id=data["id"], + timestamp=data["timestamp"], + type=data["type"], + run_id=data["run_id"], + agent_id=data.get("agent_id", ""), + metadata=data.get("metadata", {}), + ) + + # Add type-specific fields + for k, v in data.items(): + if k not in {"id", "timestamp", "type", "run_id", "agent_id", "metadata"}: + setattr(evidence, k, v) + + yield evidence + except (json.JSONDecodeError, KeyError): + continue + + def list_runs(self) -> list[str]: + """List all run IDs that have evidence.""" + if not self.base_dir.exists(): + return [] + + return [d.name for d in self.base_dir.iterdir() if d.is_dir()] + + def get_run_summary(self, run_id: str) -> dict[str, Any]: + """Get a summary of evidence for a run.""" + summary = { + "run_id": run_id, + "evidence_counts": {}, + "time_range": {"start": None, "end": None}, + "total_records": 0, + } + + for evidence in self.get_evidence(run_id): + summary["total_records"] += 1 + + # Count by type + summary["evidence_counts"][evidence.type] = ( + summary["evidence_counts"].get(evidence.type, 0) + 1 + ) + + # Track time range + if summary["time_range"]["start"] is None or evidence.timestamp < summary["time_range"]["start"]: + summary["time_range"]["start"] = evidence.timestamp + if summary["time_range"]["end"] is None or evidence.timestamp > summary["time_range"]["end"]: + summary["time_range"]["end"] = evidence.timestamp + + return summary + + def archive_run(self, run_id: str, archive_path: Path) -> None: + """Archive all evidence for a run to a compressed file.""" + import tarfile + + run_dir = self._get_run_dir(run_id) + if not run_dir.exists(): + raise FileNotFoundError(f"No evidence found for run {run_id}") + + with tarfile.open(archive_path, "w:gz") as tar: + tar.add(run_dir, arcname=run_id) + + def cleanup_old_runs(self, max_age_days: int) -> int: + """Remove evidence for runs older than the specified age.""" + import shutil + + cutoff_time = time.time() - (max_age_days * 24 * 60 * 60) + removed_count = 0 + + for run_dir in self.base_dir.iterdir(): + if not run_dir.is_dir(): + continue + + # Check if any evidence file is older than cutoff + should_remove = True + for evidence_file in run_dir.glob("*.jsonl"): + if evidence_file.stat().st_mtime > cutoff_time: + should_remove = False + break + + if should_remove: + shutil.rmtree(run_dir) + removed_count += 1 + + return removed_count \ No newline at end of file diff --git a/src/openharness/evidence/types.py b/src/openharness/evidence/types.py new file mode 100644 index 00000000..0a89481b --- /dev/null +++ b/src/openharness/evidence/types.py @@ -0,0 +1,121 @@ +"""Evidence data models for run-level archiving.""" + +from __future__ import annotations + +from dataclasses import dataclass, field +from pathlib import Path +from typing import Any, Literal +from uuid import uuid4 + + +EvidenceType = Literal[ + "run_start", + "run_end", + "task_start", + "task_progress", + "task_end", + "conversation_message", + "tool_call", + "tool_result", + "hook_execution", + "state_change", + "performance_metric", + "error", +] + + +@dataclass +class EvidenceRecord: + """Base class for all evidence records.""" + + id: str = field(default_factory=lambda: str(uuid4())) + timestamp: float = 0.0 + type: EvidenceType = "run_start" + run_id: str = "" + agent_id: str = "" + metadata: dict[str, Any] = field(default_factory=dict) + + +@dataclass +class RunEvidence(EvidenceRecord): + """Evidence for run lifecycle events.""" + + session_id: str = "" + cwd: str = "" + command_line: str = "" + config: dict[str, Any] = field(default_factory=dict) + environment: dict[str, str] = field(default_factory=dict) + + +@dataclass +class TaskEvidence(EvidenceRecord): + """Evidence for task execution.""" + + task_id: str = "" + task_type: str = "" + description: str = "" + status: str = "" + command: str = "" + cwd: str = "" + output_file: str = "" + return_code: int | None = None + duration: float = 0.0 + error_message: str = "" + + +@dataclass +class ConversationEvidence(EvidenceRecord): + """Evidence for conversation messages.""" + + message_type: str = "" # "user", "assistant", "system", "tool" + content: str = "" + role: str = "" + tool_calls: list[dict[str, Any]] = field(default_factory=list) + tool_results: list[dict[str, Any]] = field(default_factory=list) + token_count: int = 0 + model: str = "" + + +@dataclass +class HookEvidence(EvidenceRecord): + """Evidence for hook executions.""" + + event: str = "" + hook_type: str = "" + success: bool = True + output: str = "" + blocked: bool = False + reason: str = "" + duration: float = 0.0 + + +@dataclass +class StateEvidence(EvidenceRecord): + """Evidence for application state changes.""" + + state_type: str = "" # "app_state", "task_state", "swarm_state" + previous_state: dict[str, Any] = field(default_factory=dict) + new_state: dict[str, Any] = field(default_factory=dict) + change_reason: str = "" + + +@dataclass +class PerformanceEvidence(EvidenceRecord): + """Evidence for performance metrics.""" + + metric_name: str = "" + value: float = 0.0 + unit: str = "" + category: str = "" # "cost", "latency", "throughput", "resource" + context: dict[str, Any] = field(default_factory=dict) + + +@dataclass +class ErrorEvidence(EvidenceRecord): + """Evidence for errors and exceptions.""" + + error_type: str = "" + error_message: str = "" + traceback: str = "" + context: dict[str, Any] = field(default_factory=dict) + recoverable: bool = False \ No newline at end of file diff --git a/src/openharness/platforms.py b/src/openharness/platforms.py index bfd66ad9..ccebf609 100644 --- a/src/openharness/platforms.py +++ b/src/openharness/platforms.py @@ -36,7 +36,7 @@ def detect_platform( if system == "darwin": return "macos" - if system == "windows": + if system in ("windows", "win32"): return "windows" if system == "linux": if "microsoft" in kernel_release or env_map.get("WSL_DISTRO_NAME") or env_map.get("WSL_INTEROP"): diff --git a/src/openharness/swarm/lockfile.py b/src/openharness/swarm/lockfile.py index 335696d9..0480eafe 100644 --- a/src/openharness/swarm/lockfile.py +++ b/src/openharness/swarm/lockfile.py @@ -40,7 +40,10 @@ def exclusive_file_lock( @contextmanager def _exclusive_posix_lock(lock_path: Path) -> Iterator[None]: - import fcntl + try: + import fcntl + except ImportError as e: + raise SwarmLockUnavailableError(f"fcntl not available: {e}") from e lock_path.parent.mkdir(parents=True, exist_ok=True) lock_path.touch(exist_ok=True) @@ -54,7 +57,10 @@ def _exclusive_posix_lock(lock_path: Path) -> Iterator[None]: @contextmanager def _exclusive_windows_lock(lock_path: Path) -> Iterator[None]: - import msvcrt + try: + import msvcrt + except ImportError as e: + raise SwarmLockUnavailableError(f"msvcrt not available: {e}") from e lock_path.parent.mkdir(parents=True, exist_ok=True) with lock_path.open("a+b") as lock_file: diff --git a/tests/test_evidence.py b/tests/test_evidence.py new file mode 100644 index 00000000..5253e63d --- /dev/null +++ b/tests/test_evidence.py @@ -0,0 +1,124 @@ +"""Tests for the evidence layer.""" + +from __future__ import annotations + +import tempfile +from pathlib import Path + +from openharness.evidence import EvidenceCollector, EvidenceStore, EvidenceArchiver +from openharness.evidence.types import RunEvidence, TaskEvidence + + +def test_evidence_store(): + """Test basic evidence storage and retrieval.""" + with tempfile.TemporaryDirectory() as temp_dir: + store = EvidenceStore(Path(temp_dir)) + + # Create and store evidence + evidence = RunEvidence( + type="run_start", + run_id="test-run-123", + agent_id="test-agent", + session_id="test-session", + cwd="/tmp", + command_line="test command", + ) + store.store_evidence(evidence) + + # Retrieve evidence + records = list(store.get_evidence("test-run-123")) + assert len(records) == 1 + assert records[0].run_id == "test-run-123" + assert records[0].type == "run_start" + + +def test_evidence_collector(): + """Test evidence collection.""" + with tempfile.TemporaryDirectory() as temp_dir: + store = EvidenceStore(Path(temp_dir)) + collector = EvidenceCollector("test-run-456", store) + + # Record run start + collector.record_run_start( + session_id="test-session", + cwd="/tmp", + command_line="test command", + ) + + # Record task + collector.record_task_start( + TaskEvidence( + task_id="task-123", + task_type="local_agent", + description="Test task", + status="running", + cwd="/tmp", + output_file=Path("/tmp/task.log"), + command="echo hello", + ) + ) + + # Check evidence was stored + records = list(store.get_evidence("test-run-456")) + assert len(records) == 2 + + run_records = [r for r in records if r.type == "run_start"] + task_records = [r for r in records if r.type == "task_start"] + + assert len(run_records) == 1 + assert len(task_records) == 1 + assert task_records[0].task_id == "task-123" + + +def test_evidence_archiver(): + """Test evidence archiving.""" + with tempfile.TemporaryDirectory() as temp_dir: + temp_path = Path(temp_dir) + store = EvidenceStore(temp_path) + archiver = EvidenceArchiver(store) + + # Create some evidence + evidence = RunEvidence( + type="run_start", + run_id="archive-test-run", + agent_id="test-agent", + ) + store.store_evidence(evidence) + + # Export to JSON + json_path = archiver.export_run_to_json("archive-test-run") + assert json_path.exists() + + # Create archive + archive_path = archiver.create_run_archive("archive-test-run") + assert archive_path.exists() + + # Create report + report_path = archiver.create_run_report("archive-test-run") + assert report_path.exists() + assert "Run Evidence Report" in report_path.read_text() + + +def test_run_summary(): + """Test run summary generation.""" + with tempfile.TemporaryDirectory() as temp_dir: + store = EvidenceStore(Path(temp_dir)) + + # Create multiple evidence records + records = [ + RunEvidence(type="run_start", run_id="summary-test", agent_id="agent1"), + TaskEvidence(type="task_start", run_id="summary-test", agent_id="agent1", task_id="task1"), + TaskEvidence(type="task_end", run_id="summary-test", agent_id="agent1", task_id="task1"), + RunEvidence(type="run_end", run_id="summary-test", agent_id="agent1"), + ] + + for record in records: + store.store_evidence(record) + + summary = store.get_run_summary("summary-test") + assert summary["run_id"] == "summary-test" + assert summary["total_records"] == 4 + assert summary["evidence_counts"]["run_start"] == 1 + assert summary["evidence_counts"]["run_end"] == 1 + assert summary["evidence_counts"]["task_start"] == 1 + assert summary["evidence_counts"]["task_end"] == 1 \ No newline at end of file