Skip to content
Merged

V5 #235

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
ce625bf
init new environments
lorenss-m Dec 9, 2025
8ca286d
add tests and various output changes
lorenss-m Dec 9, 2025
15053ae
format
lorenss-m Dec 9, 2025
d6f9f18
simplify import structure
lorenss-m Dec 9, 2025
05a7fbc
rename and clean up files
lorenss-m Dec 9, 2025
478245e
cleanup and quality
lorenss-m Dec 9, 2025
d75d30e
deps
lorenss-m Dec 9, 2025
c9622e5
format and functionality adjustments
lorenss-m Dec 9, 2025
274e1c7
misc additions
lorenss-m Dec 9, 2025
84ee0d1
runner and docs
lorenss-m Dec 9, 2025
05c212c
typing
lorenss-m Dec 9, 2025
dcdf9f0
format and test fixes
lorenss-m Dec 9, 2025
6d68ea2
test adjustments
lorenss-m Dec 9, 2025
77e0d51
Merge branch 'main' of https://github.com/hud-evals/hud-python into v5
lorenss-m Dec 9, 2025
8a50864
mock tests
lorenss-m Dec 9, 2025
e7cd40a
small adjustment
lorenss-m Dec 9, 2025
3f02439
fix tests
lorenss-m Dec 9, 2025
8316abd
rewrite hud.eval
lorenss-m Dec 10, 2025
51ff459
docs updates
lorenss-m Dec 10, 2025
b6b0372
fixes to server and cli
lorenss-m Dec 10, 2025
8766595
docs update
lorenss-m Dec 10, 2025
62a553e
lazy mcp use
lorenss-m Dec 10, 2025
b396b13
docs update and deps
lorenss-m Dec 10, 2025
b3fe296
change deps and patches
lorenss-m Dec 11, 2025
b98b2fe
touchups
lorenss-m Dec 11, 2025
2061204
fix typing
lorenss-m Dec 11, 2025
dbe94a8
scripts
lorenss-m Dec 11, 2025
0685e68
tests
lorenss-m Dec 11, 2025
867e976
fix new langchain version
lorenss-m Dec 11, 2025
fb10f48
analyze includes scripts
lorenss-m Dec 11, 2025
b4f1ab6
update lowlevel server init
lorenss-m Dec 12, 2025
e06daa0
update docs
lorenss-m Dec 12, 2025
186e23b
analyze uses fastncp
lorenss-m Dec 12, 2025
c6d8d75
add build analysis
lorenss-m Dec 12, 2025
6e026bc
docs update
lorenss-m Dec 12, 2025
079f739
update docs
lorenss-m Dec 12, 2025
0f98f23
adjust agent class and envs
lorenss-m Dec 12, 2025
9c4269b
docs
lorenss-m Dec 12, 2025
6041aee
small docs updates
lorenss-m Dec 12, 2025
6468aa9
updates to logic all round
lorenss-m Dec 13, 2025
f25f80f
misc docs updates
lorenss-m Dec 13, 2025
f10fa9b
add meta into analyze
lorenss-m Dec 13, 2025
5f30f18
update tests
lorenss-m Dec 13, 2025
f7b3c6c
fix types
lorenss-m Dec 13, 2025
dfdb94f
update a bunch of things
lorenss-m Dec 13, 2025
19b09e1
run task accepts old configs
lorenss-m Dec 14, 2025
a32964a
integartion test warning
lorenss-m Dec 14, 2025
bea996e
Merge branch 'main' of https://github.com/hud-evals/hud-python into v5
lorenss-m Dec 14, 2025
a76e099
task loading improvements
lorenss-m Dec 14, 2025
1972d48
eval test
lorenss-m Dec 14, 2025
186c4b0
Huge cleanup, new telemetry and backwards compatibility
lorenss-m Dec 14, 2025
79c7a18
prelim small updates
lorenss-m Dec 14, 2025
5e37ea8
format and tests
lorenss-m Dec 14, 2025
91b9546
update tests
lorenss-m Dec 14, 2025
c87380b
tests
lorenss-m Dec 14, 2025
d6cde1f
naming changes
lorenss-m Dec 14, 2025
2c57f0d
test fixes
lorenss-m Dec 14, 2025
357cd19
update tests
lorenss-m Dec 14, 2025
3b1cda0
docs
lorenss-m Dec 14, 2025
0b6557d
test fixes
lorenss-m Dec 14, 2025
8d568ba
test fixes
lorenss-m Dec 14, 2025
74e26c6
Merge branch 'main' of https://github.com/hud-evals/hud-python into v5
lorenss-m Dec 14, 2025
28e290f
adjustments to instrumentation
lorenss-m Dec 14, 2025
cfab31f
type fix
lorenss-m Dec 14, 2025
5a49308
add prompt schema
lorenss-m Dec 15, 2025
3a5d3b2
format
lorenss-m Dec 15, 2025
1c718d0
scenarios
lorenss-m Dec 15, 2025
6cb83e9
switch var names
lorenss-m Dec 15, 2025
80deb86
changes to dev and tools
lorenss-m Dec 15, 2025
521ef17
save tool result
lorenss-m Dec 15, 2025
8a8e806
adjust tools for generic spec
lorenss-m Dec 15, 2025
b4d0cab
udpate schema resolution
lorenss-m Dec 15, 2025
aecb97b
update test
lorenss-m Dec 15, 2025
0acfa42
docs and tests
lorenss-m Dec 15, 2025
c707e5b
Remove environments folder - now in separate repos
lorenss-m Dec 15, 2025
76089ae
update eval edge cases
lorenss-m Dec 15, 2025
6b2bac3
docs updates
lorenss-m Dec 15, 2025
3b83558
agent init
lorenss-m Dec 15, 2025
1493a19
copy context
lorenss-m Dec 15, 2025
dbc045a
small changes
lorenss-m Dec 15, 2025
27439cc
tests
lorenss-m Dec 15, 2025
d8bbbc7
test adjust
lorenss-m Dec 15, 2025
1f7ef7f
Export QwenComputerTool from hud.tools
farrelmahaztra Dec 15, 2025
64c6b84
Add QwenComputerTool to lazy import
farrelmahaztra Dec 15, 2025
9dd0338
Merge branch 'main' of https://github.com/hud-evals/hud-python into v5
lorenss-m Dec 17, 2025
43b5f2a
update tests
lorenss-m Dec 17, 2025
cb95662
test adjust
lorenss-m Dec 17, 2025
a0f853f
test fixes
lorenss-m Dec 17, 2025
88f1b3c
magic gemini
lorenss-m Dec 17, 2025
dacba9d
add trace id to mock
lorenss-m Dec 17, 2025
2fdf2c4
final tests
lorenss-m Dec 17, 2025
e607695
test fix
lorenss-m Dec 17, 2025
e3388b5
format
lorenss-m Dec 17, 2025
da9fc9d
version bump
lorenss-m Dec 17, 2025
a0f241d
remove test
lorenss-m Dec 17, 2025
b35738b
parallel and pre release check update
lorenss-m Dec 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,4 +59,4 @@ jobs:
uses: astral-sh/setup-uv@v5

- name: Run pyright
run: uv run --with=".[rl,dev]" pyright
run: uv run --with=".[dev]" pyright
402 changes: 70 additions & 332 deletions README.md

Large diffs are not rendered by default.

105 changes: 105 additions & 0 deletions docs/advanced/testing-environments.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
---
title: "Testing Environments"
description: "Test scenarios, tools, and environment logic locally"
icon: "flask-vial"
---

Before deploying, test locally. See [Sandboxing](/guides/sandboxing) for Docker vs no-Docker patterns.

## Local Testing

| Environment | `local_test.py` |
|-------------|-----------------|
| No Docker | `from env import env` |
| Docker | `env.connect_url("http://localhost:8765/mcp")` |

Both use the same API after setup:

```python
async with env:
tools = env.as_tools() # List available tools
result = await env.call_tool("my_tool", arg="val") # Call a tool
```

## Testing Scenarios Directly

Scenarios are async generators. `hud.eval()` drives them automatically, but you can test the logic directly—this is exactly what runs at the start and end of `hud.eval()`:

```python
async def checkout(user_id: str, amount: int = 100):
# Setup + prompt (first yield) — runs at hud.eval() start
answer = yield f"Complete checkout for {user_id}, ${amount}"

# Evaluation (second yield) — runs after agent submits
yield 1.0 if "success" in answer.lower() else 0.0

async def test():
gen = checkout("alice", 50)
prompt = await anext(gen) # What hud.eval() does at start
reward = await gen.asend("Success!") # What hud.eval() does after submit
assert reward == 1.0
```

If your scenario tests pass, `hud.eval()` will behave identically.

## Mocking

`env.mock()` intercepts at the tool layer—agents only see tools:

```python
env.mock() # All tools return fake responses
env.mock_tool("send_email", {"status": "sent"})

# Check mock state
assert env.is_mock == True
```

## Hot-Reload

For Docker environments, `hud dev -w path` reloads Python on save:

```bash
hud dev -w scenarios -w tools --port 8765
```

System services (postgres, VNC, browsers) persist across reloads.

## Debugging Build Failures

`hud build` runs the exact same pipeline as **New → Environment** on [hud.ai](https://hud.ai)—so if it passes locally, it'll work in production. If the build fails or the container crashes on startup, use `hud debug` to run a 5-phase compliance test:

```bash
hud debug my-env:latest
```

Output shows exactly which phase failed:
```
✓ Phase 1: Docker image exists
✓ Phase 2: MCP server responds to initialize
✗ Phase 3: Tool discovery failed
→ Error: Connection refused on port 8005
→ Hint: Backend service may not be starting
```

You can also debug a directory (builds first) or stop at a specific phase:

```bash
hud debug . # Build and debug current directory
hud debug . --max-phase 3 # Stop after phase 3
hud debug --config mcp.json # Debug from config file
```

## Useful Environment Properties

```python
# Check parallelization (for running multiple evals)
env.is_parallelizable # True if all connections are remote

# List what's connected
env.connections # Dict of connection names → connectors
env.is_connected # True if in async context

# Resources and prompts (beyond tools)
await env.list_resources() # MCP resources
await env.list_prompts() # MCP prompts
```
2 changes: 1 addition & 1 deletion docs/beta/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@ Beta features are experimental and may change in future releases.
## Available Beta Features

<Card title="Reinforcement Fine-Tuning (RFT)" icon="brain-circuit" href="/beta/rft">
Fine-tune models with reinforcement learning on your HUD tasks (invite-only)
Fine-tune models on your HUD tasks (invite-only)
</Card>
21 changes: 17 additions & 4 deletions docs/build-environments/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -66,9 +66,6 @@ hud eval tasks.json

# Deploy to registry
hud push

# Train agents on your tasks
hud rl tasks.json
```

---
Expand All @@ -83,7 +80,6 @@ hud rl tasks.json
| Troubleshoot | `hud debug my-env:dev` |
| Build image | `hud build` |
| Push to registry | `hud push` |
| RL training | `hud rl tasks.json` |

---

Expand All @@ -93,3 +89,20 @@ hud rl tasks.json
* **CLI reference**: [CLI Overview](/reference/cli/overview)

Have fun – and remember: *stderr for logs, stdout for MCP!*

---

## Available Environments

Browse ready-to-use environments and templates at **[hud.ai/environments](https://hud.ai/environments)**.

| Environment | Description |
|-------------|-------------|
| `hud-blank` | Minimal starter template |
| `hud-browser` | Browser automation with Playwright |
| `hud-remote-browser` | Cloud browser providers (Steel, Anchor, etc.) |
| `hud-deepresearch` | Deep research with web search |
| `hud-rubrics` | LLM-as-judge evaluations |
| `coding-template` | Full coding env with VNC, Postgres, Redis |

Each environment is available as a GitHub template you can fork and customize.
4 changes: 2 additions & 2 deletions docs/build-environments/spec.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ graph TD
- No non‑MCP output on stdout (all logging to stderr).
- No required file layout, framework, or endpoints.

Recommended (for HUD RL/evals): provide tools named `setup` and `evaluate`.
Recommended (for HUD evals): provide tools named `setup` and `evaluate`.

## Make it runnable remotely (mcp.hud.ai)

Expand Down Expand Up @@ -143,7 +143,7 @@ The same structure is used by `hud init`’s template and by programmatic tasks.
]
```

Switching this file to remote is as simple as replacing the `mcp_config` with the `hud` section shown above (or using `hud rl`, which will help convert it automatically).
Switching this file to remote is as simple as replacing the `mcp_config` with the `hud` section shown above (or using `hud convert`, which will help convert it automatically).

Run tasks with either the CLI or an agent:

Expand Down
84 changes: 73 additions & 11 deletions docs/docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,81 @@
"navigation": {
"versions": [
{
"version": "0.4.74",
"version": "0.5.0",
"groups": [
{
"group": "Get Started",
"pages": [
"index",
"llm-quickstart"
]
},
{
"group": "Essentials",
"pages": [
"quick-links/gateway",
"quick-links/ab-testing",
"quick-links/environments",
"quick-links/deploy"
]
},
{
"group": "Guides",
"pages": [
"guides/integrations",
"guides/sandboxing",
"guides/best-practices",
"migration"
]
},
{
"group": "Advanced",
"pages": [
"advanced/testing-environments"
]
},
{
"group": "SDK Reference",
"pages": [
"reference/evals",
"reference/environments",
"reference/tools",
"reference/mcpserver",
"reference/agents",
"reference/types"
]
},
{
"group": "CLI Reference",
"pages": [
"reference/cli/overview",
"reference/cli/init",
"reference/cli/dev",
"reference/cli/build",
"reference/cli/push",
"reference/cli/analyze",
"reference/cli/debug",
"reference/cli/run",
"reference/cli/eval",
"reference/cli/rft",
"reference/cli/misc"
]
},
{
"group": "Community",
"pages": [
"contributing"
]
}
]
},
{
"version": "0.4.73",
"groups": [
{
"group": "Get Started",
"pages": [
"index-legacy",
"quickstart",
"llm-quickstart"
]
Expand All @@ -50,10 +119,11 @@
{
"group": "SDK Reference",
"pages": [
"reference/eval",
"reference/tools",
"reference/agents",
"reference/types",
"reference/environments",
"reference/mcpserver",
"reference/tasks"
]
},
Expand All @@ -64,17 +134,10 @@
"build-environments/spec"
]
},
{
"group": "Training (RL)",
"pages": [
"train-agents/quickstart",
"train-agents/tasks"
]
},
{
"group": "HUD Gateway",
"pages": [
"gateway/index"
"gateway/index-legacy"
]
},
{
Expand Down Expand Up @@ -103,7 +166,6 @@
"reference/cli/debug",
"reference/cli/run",
"reference/cli/eval",
"reference/cli/rl",
"reference/cli/rft",
"reference/cli/misc"
]
Expand Down
30 changes: 27 additions & 3 deletions docs/evaluate-agents/benchmarks.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,30 @@ hud eval tasks.json
hud eval hud-evals/SheetBench-50 claude --full
```

- SDK
- SDK (Context Manager)

```python
import hud

# Single task evaluation
async with hud.eval("hud-evals/SheetBench-50:0") as ctx:
agent = MyAgent()
result = await agent.run(ctx)
ctx.reward = result.reward

# All tasks with variants
async with hud.eval(
"hud-evals/SheetBench-50:*",
variants={"model": ["claude-sonnet", "gpt-4o"]},
group=3,
max_concurrent=50,
) as ctx:
agent = create_agent(model=ctx.variants["model"])
result = await agent.run(ctx)
ctx.reward = result.reward
```

- SDK (Batch Execution)

```python
from hud.datasets import run_tasks
Expand Down Expand Up @@ -108,8 +131,9 @@ results = await run_tasks(

## See Also

- [`hud eval`](/reference/cli/eval)
- [`hud rl`](/reference/cli/rl)
- [Evaluation API](/reference/eval) - SDK reference for `hud.eval()`
- [`hud eval`](/reference/cli/eval) - CLI reference
- [`hud rft`](/reference/cli/rft)
- [Tasks](/reference/tasks)
- [Agents (SDK)](/reference/agents)

Expand Down
3 changes: 2 additions & 1 deletion docs/gateway/index.mdx → docs/gateway/index-legacy.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "HUD Gateway"
title: "Gateway"
description: "Unified LLM inference service with built-in auth and credit management."
icon: "server"
---
Expand Down Expand Up @@ -128,3 +128,4 @@ This example demonstrates:
- Automatic token usage and latency tracking

View your traces on the [HUD Dashboard](https://hud.ai/home).

Loading
Loading