hud-evals · lorenss-m · Dec 17, 2025 · Dec 9, 2025 · Dec 9, 2025 · Dec 9, 2025
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -59,4 +59,4 @@ jobs:
         uses: astral-sh/setup-uv@v5
 
       - name: Run pyright
-        run: uv run --with=".[rl,dev]" pyright
+        run: uv run --with=".[dev]" pyright
diff --git a/README.md b/README.md
diff --git a/docs/advanced/testing-environments.mdx b/docs/advanced/testing-environments.mdx
@@ -0,0 +1,105 @@
+---
+title: "Testing Environments"
+description: "Test scenarios, tools, and environment logic locally"
+icon: "flask-vial"
+---
+
+Before deploying, test locally. See [Sandboxing](/guides/sandboxing) for Docker vs no-Docker patterns.
+
+## Local Testing
+
+| Environment | `local_test.py` |
+|-------------|-----------------|
+| No Docker | `from env import env` |
+| Docker | `env.connect_url("http://localhost:8765/mcp")` |
+
+Both use the same API after setup:
+
+```python
+async with env:
+    tools = env.as_tools()                              # List available tools
+    result = await env.call_tool("my_tool", arg="val")  # Call a tool
+```
+
+## Testing Scenarios Directly
+
+Scenarios are async generators. `hud.eval()` drives them automatically, but you can test the logic directly—this is exactly what runs at the start and end of `hud.eval()`:
+
+```python
+async def checkout(user_id: str, amount: int = 100):
+    # Setup + prompt (first yield) — runs at hud.eval() start
+    answer = yield f"Complete checkout for {user_id}, ${amount}"
+
+    # Evaluation (second yield) — runs after agent submits
+    yield 1.0 if "success" in answer.lower() else 0.0
+
+async def test():
+    gen = checkout("alice", 50)
+    prompt = await anext(gen)           # What hud.eval() does at start
+    reward = await gen.asend("Success!") # What hud.eval() does after submit
+    assert reward == 1.0
+```
+
+If your scenario tests pass, `hud.eval()` will behave identically.
+
+## Mocking
+
+`env.mock()` intercepts at the tool layer—agents only see tools:
+
+```python
+env.mock()  # All tools return fake responses
+env.mock_tool("send_email", {"status": "sent"})
+
+# Check mock state
+assert env.is_mock == True
+```
+
+## Hot-Reload
+
+For Docker environments, `hud dev -w path` reloads Python on save:
+
+```bash
+hud dev -w scenarios -w tools --port 8765
+```
+
+System services (postgres, VNC, browsers) persist across reloads.
+
+## Debugging Build Failures
+
+`hud build` runs the exact same pipeline as **New → Environment** on [hud.ai](https://hud.ai)—so if it passes locally, it'll work in production. If the build fails or the container crashes on startup, use `hud debug` to run a 5-phase compliance test:
+
+```bash
+hud debug my-env:latest
+```
+
+Output shows exactly which phase failed:
+```
+✓ Phase 1: Docker image exists
+✓ Phase 2: MCP server responds to initialize
+✗ Phase 3: Tool discovery failed
+  → Error: Connection refused on port 8005
+  → Hint: Backend service may not be starting
+```
+
+You can also debug a directory (builds first) or stop at a specific phase:
+
+```bash
+hud debug .                    # Build and debug current directory
+hud debug . --max-phase 3      # Stop after phase 3
+hud debug --config mcp.json    # Debug from config file
+```
+
+## Useful Environment Properties
+
+```python
+# Check parallelization (for running multiple evals)
+env.is_parallelizable  # True if all connections are remote
+
+# List what's connected
+env.connections        # Dict of connection names → connectors
+env.is_connected       # True if in async context
+
+# Resources and prompts (beyond tools)
+await env.list_resources()  # MCP resources
+await env.list_prompts()    # MCP prompts
+```
diff --git a/docs/beta/index.mdx b/docs/beta/index.mdx
@@ -11,5 +11,5 @@ Beta features are experimental and may change in future releases.
 ## Available Beta Features
 
 <Card title="Reinforcement Fine-Tuning (RFT)" icon="brain-circuit" href="/beta/rft">
-  Fine-tune models with reinforcement learning on your HUD tasks (invite-only)
+  Fine-tune models on your HUD tasks (invite-only)
 </Card>
diff --git a/docs/build-environments/index.mdx b/docs/build-environments/index.mdx
@@ -66,9 +66,6 @@ hud eval tasks.json
 
 # Deploy to registry
 hud push
-
-# Train agents on your tasks
-hud rl tasks.json
 ```
 
 ---
@@ -83,7 +80,6 @@ hud rl tasks.json
 | Troubleshoot | `hud debug my-env:dev` |
 | Build image | `hud build` |
 | Push to registry | `hud push` |
-| RL training | `hud rl tasks.json` |
 
 ---
 
@@ -93,3 +89,20 @@ hud rl tasks.json
 * **CLI reference**: [CLI Overview](/reference/cli/overview)
 
 Have fun – and remember: *stderr for logs, stdout for MCP!*
+
+---
+
+## Available Environments
+
+Browse ready-to-use environments and templates at **[hud.ai/environments](https://hud.ai/environments)**.
+
+| Environment | Description |
+|-------------|-------------|
+| `hud-blank` | Minimal starter template |
+| `hud-browser` | Browser automation with Playwright |
+| `hud-remote-browser` | Cloud browser providers (Steel, Anchor, etc.) |
+| `hud-deepresearch` | Deep research with web search |
+| `hud-rubrics` | LLM-as-judge evaluations |
+| `coding-template` | Full coding env with VNC, Postgres, Redis |
+
+Each environment is available as a GitHub template you can fork and customize.
diff --git a/docs/build-environments/spec.mdx b/docs/build-environments/spec.mdx
@@ -24,7 +24,7 @@ graph TD
 - No non‑MCP output on stdout (all logging to stderr).
 - No required file layout, framework, or endpoints.
 
-Recommended (for HUD RL/evals): provide tools named `setup` and `evaluate`.
+Recommended (for HUD evals): provide tools named `setup` and `evaluate`.
 
 ## Make it runnable remotely (mcp.hud.ai)
 
@@ -143,7 +143,7 @@ The same structure is used by `hud init`’s template and by programmatic tasks.
 ]
 ```
 
-Switching this file to remote is as simple as replacing the `mcp_config` with the `hud` section shown above (or using `hud rl`, which will help convert it automatically).
+Switching this file to remote is as simple as replacing the `mcp_config` with the `hud` section shown above (or using `hud convert`, which will help convert it automatically).
 
 Run tasks with either the CLI or an agent:
 

diff --git a/docs/docs.json b/docs/docs.json
@@ -29,12 +29,81 @@
   "navigation": {
     "versions": [
       {
-        "version": "0.4.74",
+        "version": "0.5.0",
         "groups": [
           {
             "group": "Get Started",
             "pages": [
               "index",
+              "llm-quickstart"
+            ]
+          },
+          {
+            "group": "Essentials",
+            "pages": [
+              "quick-links/gateway",
+              "quick-links/ab-testing",
+              "quick-links/environments",
+              "quick-links/deploy"
+            ]
+          },
+          {
+            "group": "Guides",
+            "pages": [
+              "guides/integrations",
+              "guides/sandboxing",
+              "guides/best-practices",
+              "migration"
+            ]
+          },
+          {
+            "group": "Advanced",
+            "pages": [
+              "advanced/testing-environments"
+            ]
+          },
+          {
+            "group": "SDK Reference",
+            "pages": [
+              "reference/evals",
+              "reference/environments",
+              "reference/tools",
+              "reference/mcpserver",
+              "reference/agents",
+              "reference/types"
+            ]
+          },
+          {
+            "group": "CLI Reference",
+            "pages": [
+              "reference/cli/overview",
+              "reference/cli/init",
+              "reference/cli/dev",
+              "reference/cli/build",
+              "reference/cli/push",
+              "reference/cli/analyze",
+              "reference/cli/debug",
+              "reference/cli/run",
+              "reference/cli/eval",
+              "reference/cli/rft",
+              "reference/cli/misc"
+            ]
+          },
+          {
+            "group": "Community",
+            "pages": [
+              "contributing"
+            ]
+          }
+        ]
+      },
+      {
+        "version": "0.4.73",
+        "groups": [
+          {
+            "group": "Get Started",
+            "pages": [
+              "index-legacy",
               "quickstart",
               "llm-quickstart"
             ]
@@ -50,10 +119,11 @@
           {
             "group": "SDK Reference",
             "pages": [
+              "reference/eval",
               "reference/tools",
               "reference/agents",
               "reference/types",
-              "reference/environments",
+              "reference/mcpserver",
               "reference/tasks"
             ]
           },
@@ -64,17 +134,10 @@
               "build-environments/spec"
             ]
           },
-          {
-            "group": "Training (RL)",
-            "pages": [
-              "train-agents/quickstart",
-              "train-agents/tasks"
-            ]
-          },
           {
             "group": "HUD Gateway",
             "pages": [
-              "gateway/index"
+              "gateway/index-legacy"
             ]
           },
           {
@@ -103,7 +166,6 @@
               "reference/cli/debug",
               "reference/cli/run",
               "reference/cli/eval",
-              "reference/cli/rl",
               "reference/cli/rft",
               "reference/cli/misc"
             ]

diff --git a/docs/evaluate-agents/benchmarks.mdx b/docs/evaluate-agents/benchmarks.mdx
@@ -18,7 +18,30 @@ hud eval tasks.json
 hud eval hud-evals/SheetBench-50 claude --full
 ```
 
-- SDK
+- SDK (Context Manager)
+
+```python
+import hud
+
+# Single task evaluation
+async with hud.eval("hud-evals/SheetBench-50:0") as ctx:
+    agent = MyAgent()
+    result = await agent.run(ctx)
+    ctx.reward = result.reward
+
+# All tasks with variants
+async with hud.eval(
+    "hud-evals/SheetBench-50:*",
+    variants={"model": ["claude-sonnet", "gpt-4o"]},
+    group=3,
+    max_concurrent=50,
+) as ctx:
+    agent = create_agent(model=ctx.variants["model"])
+    result = await agent.run(ctx)
+    ctx.reward = result.reward
+```
+
+- SDK (Batch Execution)
 
 ```python
 from hud.datasets import run_tasks
@@ -108,8 +131,9 @@ results = await run_tasks(
 
 ## See Also
 
-- [`hud eval`](/reference/cli/eval)
-- [`hud rl`](/reference/cli/rl)
+- [Evaluation API](/reference/eval) - SDK reference for `hud.eval()`
+- [`hud eval`](/reference/cli/eval) - CLI reference
+- [`hud rft`](/reference/cli/rft)
 - [Tasks](/reference/tasks)
 - [Agents (SDK)](/reference/agents)
 

diff --git a/docs/gateway/index.mdx → docs/gateway/index-legacy.mdx b/docs/gateway/index.mdx → docs/gateway/index-legacy.mdx
@@ -1,5 +1,5 @@
 ---
-title: "HUD Gateway"
+title: "Gateway"
 description: "Unified LLM inference service with built-in auth and credit management."
 icon: "server"
 ---
@@ -128,3 +128,4 @@ This example demonstrates:
 - Automatic token usage and latency tracking
 
 View your traces on the [HUD Dashboard](https://hud.ai/home).
+