Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,10 @@ RUN uv pip install --system -r /app/requirements.txt
# Copy application code
COPY . /app/

# Pre-create /workspaces so named-volume mounts inherit correct permissions
# (without this, Docker creates it as root read-only on fresh deployments)
RUN mkdir -p /workspaces && chmod 777 /workspaces

EXPOSE 8003

ENV PORT=8003 \
Expand Down
179 changes: 179 additions & 0 deletions docs/deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Deployment Guide

This guide covers deploying SWE-AF on a new server, including prerequisites, known issues, and quick-start instructions.

## Prerequisites

### Software

| Requirement | Minimum Version | Notes |
|---|---|---|
| Docker | 20.10+ | With BuildKit support |
| Docker Compose | 2.0+ | V2 plugin (`docker compose`, not `docker-compose`) |
| Git | 2.30+ | For cloning the repository |

### Environment Variables

Copy `.env.example` to `.env` and configure at least one authentication method:

```bash
cp .env.example .env
```

**Required (one of):**

| Variable | Purpose |
|---|---|
| `ANTHROPIC_API_KEY` | Anthropic API key for Claude models |
| `CLAUDE_CODE_OAUTH_TOKEN` | Claude Code subscription token (uses Pro/Max credits) |

**For open-source models (alternative to Claude):**

| Variable | Purpose |
|---|---|
| `OPENROUTER_API_KEY` | OpenRouter API key (200+ models) |
| `OPENAI_API_KEY` | OpenAI API key |
| `GOOGLE_API_KEY` | Google Gemini API key |

**Optional:**

| Variable | Purpose | Default |
|---|---|---|
| `GH_TOKEN` | GitHub PAT with `repo` scope for draft PRs | *(none)* |
| `AGENTFIELD_SERVER` | Control plane URL | `http://control-plane:8080` (Docker) |
| `NODE_ID` | Agent node identifier | `swe-planner` |
| `PORT` | Agent listen port | `8003` |

### Package Versions

| Package | Minimum Version | Notes |
|---|---|---|
| `agentfield` | 0.1.67+ | Python SDK (includes opencode v1.4+ fix) |
| `claude-agent-sdk` | 0.1.20+ | Claude runtime |
| opencode CLI | 1.4+ | Only if using `open_code` runtime (see Known Issues) |

## Quick Start

### Full Stack (control plane + agent)

```bash
git clone https://github.com/Agent-Field/SWE-AF
cd SWE-AF
cp .env.example .env # fill in API keys
docker compose up -d
```

This starts:
- **control-plane** on `:8080` — AgentField orchestration server
- **swe-agent** on `:8003` — SWE-AF full pipeline (`swe-planner` node)
- **swe-fast** on `:8004` — SWE-AF fast mode (`swe-fast` node)

### Agent Only (connect to existing control plane)

If you already have an AgentField control plane running:

```bash
git clone https://github.com/Agent-Field/SWE-AF
cd SWE-AF
cp .env.example .env # fill in API keys

# Set AGENTFIELD_SERVER in .env to your control plane URL
docker compose -f docker-compose.local.yml up -d
```

### Verify Deployment

```bash
# Check agent health
curl http://localhost:8003/health

# Check control plane (full stack only)
curl http://localhost:8080/api/v1/health
```

## Known Issues and Fixes

### `/workspaces` read-only filesystem error

**Symptom:**
```
[Errno 30] Read-only file system: '/workspaces'
```

**Root cause:** The `/workspaces` directory was not pre-created in the Docker image. When Docker mounts a named volume, it creates the directory as root with restrictive permissions.

**Fix:** This is fixed in the current Dockerfile. If you're using an older image, rebuild:
```bash
docker compose build --no-cache
```

The fix adds `RUN mkdir -p /workspaces && chmod 777 /workspaces` to the Dockerfile before the volume mount point.

**Ref:** [#46](https://github.com/Agent-Field/SWE-AF/issues/46)

### `Product manager failed to produce a valid PRD` with `open_code` runtime

**Symptom:** Builds using the `open_code` runtime fail at the Product Manager step with a generic error. The agent completes in a few seconds (too fast for real work).

**Root cause:** opencode CLI v1.4+ changed its CLI interface:
- `-p` (prompt) flag was removed — prompt is now a positional arg to the `run` subcommand
- `-c` now means `--continue` (resume session), not project directory

**Fix:** Upgrade the `agentfield` Python SDK to a version that includes the opencode v1.4+ compatibility fix:
```bash
pip install --upgrade agentfield
```

**Ref:** [#45](https://github.com/Agent-Field/SWE-AF/issues/45)

### Fatal API errors silently retry

**Symptom:** Build with exhausted credits or invalid API key retries multiple times before failing with a misleading error (e.g., "Product manager failed to produce a valid PRD").

**Root cause:** Non-retryable API errors (credit exhaustion, invalid key) were not distinguished from transient errors, causing all retry layers to fire.

**Fix:** This is fixed in the current version. Upgrade to get `FatalHarnessError` detection that immediately aborts on:
- Credit balance too low
- Invalid API key
- Authentication failed
- Account disabled
- Quota exceeded

**Ref:** [#49](https://github.com/Agent-Field/SWE-AF/issues/49)

### Parallel builds cross-contamination

**Symptom:** Running two builds simultaneously for the same repository causes agents to receive input from the wrong build.

**Root cause:** Both builds cloned to the same workspace path (`/workspaces/<repo-name>`), sharing git state and artifacts.

**Fix:** This is fixed in the current version. Each build now gets an isolated workspace: `/workspaces/<repo-name>-<build_id>`.

**Ref:** [#43](https://github.com/Agent-Field/SWE-AF/issues/43)

## Scaling

### Multiple concurrent builds

Each build automatically gets an isolated workspace. To run multiple builds concurrently:

```bash
# Scale the agent service
docker compose up --scale swe-agent=3 -d
```

### Resource considerations

Each build clones the target repository and runs multiple LLM calls. Plan for:
- **Disk:** ~500MB per concurrent build (repo clone + artifacts)
- **Memory:** ~512MB per agent container
- **Network:** LLM API calls are the bottleneck, not compute

## Troubleshooting

| Symptom | Check |
|---|---|
| Agent not registering with control plane | Verify `AGENTFIELD_SERVER` is reachable from the container |
| Builds timing out | Check API key validity and credit balance |
| `git clone` failures | Verify `GH_TOKEN` has `repo` scope for private repositories |
| Health check failing | Check container logs: `docker compose logs swe-agent` |
2 changes: 1 addition & 1 deletion requirements-docker.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@
#
# Same runtime dependencies as requirements.txt.

agentfield>=0.1.41
agentfield>=0.1.67
pydantic>=2.0
claude-agent-sdk==0.1.20
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@
#
# Install: python -m pip install -r requirements.txt

agentfield>=0.1.9
agentfield>=0.1.67
pydantic>=2.0
claude-agent-sdk==0.1.20
28 changes: 17 additions & 11 deletions swe_af/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,18 +195,30 @@ async def build(
if repo_url:
cfg.repo_url = repo_url

# Auto-derive repo_path from repo_url when not specified
# Generate build_id BEFORE workspace setup so each concurrent build
# gets a fully isolated workspace (repo clone, artifacts, worktrees).
# Fixes cross-contamination when parallel builds target the same repo.
# Ref: https://github.com/Agent-Field/SWE-AF/issues/43
build_id = uuid.uuid4().hex[:8]

# Auto-derive repo_path from repo_url when not specified.
# Each build gets its own clone directory scoped by build_id to prevent
# concurrent builds from sharing git state, artifacts, or worktrees.
if cfg.repo_url and not repo_path:
repo_path = f"/workspaces/{_repo_name_from_url(cfg.repo_url)}"
repo_name = _repo_name_from_url(cfg.repo_url)
repo_path = f"/workspaces/{repo_name}-{build_id}"

# Multi-repo: derive repo_path from primary repo; _clone_repos handles cloning later
if not repo_path and len(cfg.repos) > 1:
primary = next((r for r in cfg.repos if r.role == "primary"), cfg.repos[0])
repo_path = f"/workspaces/{_repo_name_from_url(primary.repo_url)}"
repo_name = _repo_name_from_url(primary.repo_url)
repo_path = f"/workspaces/{repo_name}-{build_id}"

if not repo_path:
raise ValueError("Either repo_path or repo_url must be provided")

app.note(f"Build starting (build_id={build_id})", tags=["build", "start"])

# Clone if repo_url is set and target doesn't exist yet
git_dir = os.path.join(repo_path, ".git")
if cfg.repo_url and not os.path.exists(git_dir):
Expand All @@ -222,8 +234,8 @@ async def build(
app.note(f"Clone failed (exit {clone_result.returncode}): {err}", tags=["build", "clone", "error"])
raise RuntimeError(f"git clone failed (exit {clone_result.returncode}): {err}")
elif cfg.repo_url and os.path.exists(git_dir):
# Repo already cloned by a prior build — reset to remote default branch
# so git_init creates the integration branch from a clean baseline.
# Repo already exists at this build-scoped path (unlikely but handle gracefully).
# Reset to remote default branch for a clean baseline.
default_branch = cfg.github_pr_base or "main"
app.note(
f"Repo already exists at {repo_path} — resetting to origin/{default_branch}",
Expand Down Expand Up @@ -290,12 +302,6 @@ async def build(
# Resolve runtime + flat model config once for this build.
resolved = cfg.resolved_models()

# Unique ID for this build — namespaces git branches/worktrees to prevent
# collisions when multiple builds run concurrently on the same repository.
build_id = uuid.uuid4().hex[:8]

app.note(f"Build starting (build_id={build_id})", tags=["build", "start"])

# Compute absolute artifacts directory path for logging
abs_artifacts_dir = os.path.join(os.path.abspath(repo_path), artifacts_dir)

Expand Down
9 changes: 9 additions & 0 deletions swe_af/execution/coding_loop.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from typing import Callable


from swe_af.execution.fatal_error import FatalHarnessError
from swe_af.execution.schemas import (
DAGState,
ExecutionConfig,
Expand Down Expand Up @@ -325,6 +326,8 @@ async def _run_default_path(
timeout=timeout,
label=f"review:{issue_name}:default",
)
except FatalHarnessError:
raise
except Exception as e:
if note_fn:
note_fn(
Expand Down Expand Up @@ -437,6 +440,8 @@ async def _run_flagged_path(
tags=["coding_loop", "review_error", issue_name],
)
review_result = {"approved": True, "blocking": False, "summary": f"Review unavailable: {review_result}"}
except FatalHarnessError:
raise
except Exception as e:
if note_fn:
note_fn(
Expand Down Expand Up @@ -479,6 +484,8 @@ async def _run_flagged_path(
timeout=timeout,
label=f"synthesizer:{issue_name}:iter{iteration}",
)
except FatalHarnessError:
raise
except Exception as e:
if note_fn:
note_fn(
Expand Down Expand Up @@ -625,6 +632,8 @@ async def run_coding_loop(
timeout=timeout,
label=f"coder:{issue_name}:iter{iteration}",
)
except FatalHarnessError:
raise
except Exception as e:
if note_fn:
note_fn(
Expand Down
9 changes: 9 additions & 0 deletions swe_af/execution/dag_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

from swe_af.execution.dag_utils import apply_replan, find_downstream
from swe_af.execution.envelope import unwrap_call_result
from swe_af.execution.fatal_error import FatalHarnessError
from swe_af.execution.schemas import (
AdvisorAction,
DAGState,
Expand Down Expand Up @@ -576,6 +577,8 @@ async def _cleanup_single_repo(
f"cleaned={result.get('cleaned', [])}",
tags=["execution", "worktree_cleanup", "warning"],
)
except FatalHarnessError:
raise
except Exception as e:
if note_fn:
note_fn(
Expand Down Expand Up @@ -868,6 +871,8 @@ async def _execute_single_issue(
timeout=config.agent_timeout_seconds,
label=f"issue_advisor:{issue_name}:{advisor_round + 1}",
)
except FatalHarnessError:
raise
except Exception as e:
if note_fn:
note_fn(
Expand Down Expand Up @@ -1064,6 +1069,8 @@ async def _run_execute_fn(
attempts=attempt,
)

except FatalHarnessError:
raise
except Exception as e:
last_error = str(e)
last_context = traceback.format_exc()
Expand Down Expand Up @@ -1095,6 +1102,8 @@ async def _run_execute_fn(
"retry_diagnosis": advice.get("diagnosis", ""),
}
continue
except FatalHarnessError:
raise
except Exception:
continue
elif attempt <= config.max_retries_per_issue:
Expand Down
4 changes: 4 additions & 0 deletions swe_af/execution/envelope.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@

from __future__ import annotations

from swe_af.execution.fatal_error import FatalHarnessError, is_fatal_error

# Keys present in the execution envelope returned by _build_execute_response.
_ENVELOPE_KEYS = frozenset({
"execution_id", "run_id", "node_id", "type", "target",
Expand Down Expand Up @@ -51,6 +53,8 @@ def unwrap_call_result(result, label: str = "call"):
status = str(result.get("status", "")).lower()
if status in ("failed", "error", "cancelled", "timeout"):
err = result.get("error_message") or result.get("error") or "unknown"
if is_fatal_error(str(err)):
raise FatalHarnessError(str(err))
raise RuntimeError(f"{label} failed (status={status}): {err}")

inner = result.get("result")
Expand Down
Loading
Loading