Two-Phase Threat Detection: Fast Binary Triage + Agentic Reasoning
Problem
The current threat detection job always invokes a full agentic call with reasoning for every workflow run, regardless of whether the content is benign or malicious. This is:
- Slow on the happy path — The vast majority of runs are benign, yet every run pays the full cost of a reasoning-heavy model call that reads files, analyzes patches, and produces structured JSON output.
- Expensive — Each detection invocation uses a full agentic context window with tool use (file reading, workspace analysis), even when the content is trivially safe.
- Fragile on output parsing — The current single-shot approach asks the model to produce a specific
THREAT_DETECTION_RESULT:{...} JSON line within free-form reasoning output, leading to frequent parse failures (multi-line reasons, stream-json wrapping, duplicate entries).
Current State
The detection job (pkg/workflow/threat_detection.go) compiles into a GitHub Actions job with these key steps:
- Setup: Downloads agent artifacts (output, patch, prompt) to
/tmp/gh-aw/threat-detection/
- Prompt construction (
actions/setup/js/setup_threat_detection.cjs): Reads the template from actions/setup/md/threat_detection.md (or pkg/workflow/prompts/threat_detection.md embedded version), substitutes placeholders ({AGENT_OUTPUT_FILE}, {AGENT_PATCH_FILE}, etc.), writes final prompt to /tmp/gh-aw/aw-prompts/prompt.txt
- Engine execution (
buildDetectionEngineExecutionStep): Runs the agentic engine (Copilot/Claude/Codex/Gemini) inside AWF (network-blocked sandbox) with the full prompt. The engine has file-reading tools (bash: ["*"]), reads the agent output and patch files, reasons about threats, and writes output to detection.log.
- Result parsing (
actions/setup/js/parse_threat_detection_results.cjs): Scans detection.log for lines matching THREAT_DETECTION_RESULT:{...}, parses JSON, validates boolean fields, sets job outputs (detection_success, detection_conclusion).
The prompt template (actions/setup/md/threat_detection.md) asks the model to:
- Read the workflow prompt file, agent output file, and patch/bundle files
- Analyze for prompt injection, secret leaks, and malicious patches
- Output exactly one line:
THREAT_DETECTION_RESULT:{"prompt_injection":false,"secret_leak":false,"malicious_patch":false,"reasons":[]}
This is a single-phase, all-or-nothing approach where every invocation has the same cost.
Desired State
A two-phase detection architecture:
Phase 1: Binary Triage (fast, cheap)
- A lightweight, non-agentic model call with no tools (no file reading, no bash)
- The prompt includes the agent output and patch inline (or a summary) rather than as file references
- The model is asked to respond with a single token:
yes or no
yes = threat suspected → proceed to Phase 2
no = no threat detected → conclude the job successfully
- Token limit: Set
max_tokens=1 (or equivalent per engine) to force a single-token response
- Bias toward alerting: The prompt explicitly states "If you are unsure, respond
yes"
- Retry logic: If the response is neither
yes nor no:
- Send the output back to the model with a correction prompt ("You must respond with exactly
yes or no")
- Retry up to N times (e.g., 3)
- If still unparseable after retries, default to
yes (proceed to Phase 2, i.e., fail-safe toward alerting)
- Let's allow equivalent responses with the token, trimming whitespace and lowercasing before trying to match.
Phase 2: Agentic Reasoning (only on suspected threats)
- The full agentic call as it exists today
- Full file reading tools, workspace context, detailed analysis
- Produces the structured
THREAT_DETECTION_RESULT:{...} JSON output
- If the output is not parseable, retry with a correction prompt asking the model to fix its output (potentially using structured output/JSON mode)
- The final verdict from this phase determines the job conclusion
- If the final output from this step is not parseable, after retries, then error (fail workflow).
Approach
The implementation splits into Go compiler changes (generating the two-phase workflow steps) and JavaScript runtime changes (new scripts for triage and retry logic).
Architecture
Detection Job
├── [existing] Setup, artifact download, guard, etc.
├── [NEW] Phase 1: Binary Triage
│ ├── Inline the agent output + patch summary into the prompt
│ ├── Call engine with max_tokens=1, no tools
│ ├── Parse response: "yes" / "no" / unparseable
│ ├── Retry loop (up to 3 attempts) if unparseable
│ └── Output: triage_result = "yes" | "no"
├── [CONDITIONAL] Phase 2: Agentic Reasoning (if triage_result == "yes" OR retries exhausted)
│ ├── [existing] Setup threat detection prompt
│ ├── [existing] Engine execution with tools
│ ├── [NEW] Retry if THREAT_DETECTION_RESULT not parseable
│ └── [existing] Parse and conclude
└── [MODIFIED] Conclusion step
├── If Phase 1 said "no" → conclusion=success, success=true
├── If Phase 2 ran → use Phase 2 verdict as today
└── If all retries failed → conclusion=warning or failure per continue-on-error
Needed Changes
1. Go Compiler Changes (pkg/workflow/threat_detection.go)
-
New function buildTriageStep(): Generates a GitHub Actions step that:
- Reads agent output and patch files inline (or summarizes them)
- Constructs a triage prompt asking for
yes/no only
- Calls the engine in a non-agentic mode (no tools,
max_tokens=1)
- Sets output
triage_result to yes, no, or error
-
New function buildTriageRetryStep(): Generates a step that:
- Checks if
triage_result is yes or no
- If neither, retries with a correction prompt
- After N retries, defaults to
yes
-
Modify buildDetectionJobSteps(): Insert the triage steps before the existing engine execution steps, and wrap the engine execution steps in a condition (if: steps.triage.outputs.triage_result == 'yes')
-
Modify buildDetectionConclusionStep(): Handle the new case where Phase 1 returned no (skip Phase 2, set conclusion=success)
-
New prompt template (pkg/workflow/prompts/threat_detection_triage.md and actions/setup/md/threat_detection_triage.md): A minimal prompt that includes the content inline and asks for a single yes/no token
2. JavaScript Runtime Changes
3. Engine Interface Changes (pkg/workflow/agentic_engine.go)
-
New method GetTriageExecutionSteps() (or extend GetExecutionSteps() with a mode parameter): Generates engine-specific steps for the triage call with:
- No tools/MCP
max_tokens=1 (or engine equivalent)
- No reasoning/thinking mode
- Stdin/inline prompt instead of file-based prompt
-
Each engine implementation (Copilot, Claude, Codex, Gemini) needs to support:
- Copilot:
copilot --max-tokens 1 --no-tools (or equivalent CLI flags)
- Claude:
claude --max-tokens 1 with --no-tools and --output-format text
- Codex: OpenAI API call with
max_tokens: 1 and no tools
- Gemini: Equivalent Gemini API parameter
4. Prompt Templates
5. Test Changes
- New test file
pkg/workflow/threat_detection_triage_test.go: Tests for the triage step generation, retry logic, and integration with the existing detection flow
- Update
actions/setup/js/parse_threat_detection_results.test.cjs: Add tests for the new Phase 1 short-circuit path
- New test file
actions/setup/js/triage_threat_detection.test.cjs: Tests for the triage JavaScript module
- Update
pkg/workflow/safe_jobs_threat_detection_test.go: Verify the two-phase job structure compiles correctly
6. Configuration
- No frontmatter changes required — the two-phase approach is an internal optimization, transparent to workflow authors
- Optional: Add a
threat-detection.triage: false frontmatter option to disable Phase 1 and always use the full agentic call (for debugging or when the triage model is unreliable)
Key Design Decisions
- Fail-safe toward alerting: Any unparseable triage response defaults to
yes (proceed to full analysis). This ensures no threats are missed due to model output issues.
- Inline content in triage prompt: Phase 1 does not use tools/file reading — the content is embedded directly in the prompt. This avoids tool overhead and keeps the call simple.
- Engine-specific max_tokens: Each engine has different ways to limit output tokens. The engine interface must abstract this.
- Retry with correction: Rather than immediately escalating to Phase 2 on parse failure, we retry the triage call with explicit instructions. This is cheaper than a full agentic call.
- Phase 2 retry for structured output: If Phase 2's
THREAT_DETECTION_RESULT JSON is malformed, retry with the model's output appended and a correction prompt, potentially using structured output mode where available.
Expected Impact
- ~90%+ of runs (benign) will complete detection in Phase 1 only — a single-token API call with no tools, completing in seconds
- ~10% of runs (suspicious or unparseable) will proceed to Phase 2 — same cost as today
- Net reduction in average detection time and cost of approximately 80-90%
- No change in security posture — fail-safe defaults ensure suspicious content always gets full analysis
Two-Phase Threat Detection: Fast Binary Triage + Agentic Reasoning
Problem
The current threat detection job always invokes a full agentic call with reasoning for every workflow run, regardless of whether the content is benign or malicious. This is:
THREAT_DETECTION_RESULT:{...}JSON line within free-form reasoning output, leading to frequent parse failures (multi-line reasons, stream-json wrapping, duplicate entries).Current State
The detection job (
pkg/workflow/threat_detection.go) compiles into a GitHub Actions job with these key steps:/tmp/gh-aw/threat-detection/actions/setup/js/setup_threat_detection.cjs): Reads the template fromactions/setup/md/threat_detection.md(orpkg/workflow/prompts/threat_detection.mdembedded version), substitutes placeholders ({AGENT_OUTPUT_FILE},{AGENT_PATCH_FILE}, etc.), writes final prompt to/tmp/gh-aw/aw-prompts/prompt.txtbuildDetectionEngineExecutionStep): Runs the agentic engine (Copilot/Claude/Codex/Gemini) inside AWF (network-blocked sandbox) with the full prompt. The engine has file-reading tools (bash: ["*"]), reads the agent output and patch files, reasons about threats, and writes output todetection.log.actions/setup/js/parse_threat_detection_results.cjs): Scansdetection.logfor lines matchingTHREAT_DETECTION_RESULT:{...}, parses JSON, validates boolean fields, sets job outputs (detection_success,detection_conclusion).The prompt template (
actions/setup/md/threat_detection.md) asks the model to:THREAT_DETECTION_RESULT:{"prompt_injection":false,"secret_leak":false,"malicious_patch":false,"reasons":[]}This is a single-phase, all-or-nothing approach where every invocation has the same cost.
Desired State
A two-phase detection architecture:
Phase 1: Binary Triage (fast, cheap)
yesornoyes= threat suspected → proceed to Phase 2no= no threat detected → conclude the job successfullymax_tokens=1(or equivalent per engine) to force a single-token responseyes"yesnorno:yesorno")yes(proceed to Phase 2, i.e., fail-safe toward alerting)Phase 2: Agentic Reasoning (only on suspected threats)
THREAT_DETECTION_RESULT:{...}JSON outputApproach
The implementation splits into Go compiler changes (generating the two-phase workflow steps) and JavaScript runtime changes (new scripts for triage and retry logic).
Architecture
Needed Changes
1. Go Compiler Changes (
pkg/workflow/threat_detection.go)New function
buildTriageStep(): Generates a GitHub Actions step that:yes/noonlymax_tokens=1)triage_resulttoyes,no, orerrorNew function
buildTriageRetryStep(): Generates a step that:triage_resultisyesornoyesModify
buildDetectionJobSteps(): Insert the triage steps before the existing engine execution steps, and wrap the engine execution steps in a condition (if: steps.triage.outputs.triage_result == 'yes')Modify
buildDetectionConclusionStep(): Handle the new case where Phase 1 returnedno(skip Phase 2, set conclusion=success)New prompt template (
pkg/workflow/prompts/threat_detection_triage.mdandactions/setup/md/threat_detection_triage.md): A minimal prompt that includes the content inline and asks for a singleyes/notoken2. JavaScript Runtime Changes
New file
actions/setup/js/triage_threat_detection.cjs:triage_result)Modify
actions/setup/js/parse_threat_detection_results.cjs:no(short-circuit to success)THREAT_DETECTION_RESULTis not parseable, construct a correction prompt and re-invoke the engine3. Engine Interface Changes (
pkg/workflow/agentic_engine.go)New method
GetTriageExecutionSteps()(or extendGetExecutionSteps()with a mode parameter): Generates engine-specific steps for the triage call with:max_tokens=1(or engine equivalent)Each engine implementation (Copilot, Claude, Codex, Gemini) needs to support:
copilot --max-tokens 1 --no-tools(or equivalent CLI flags)claude --max-tokens 1with--no-toolsand--output-format textmax_tokens: 1and no tools4. Prompt Templates
New triage prompt (
threat_detection_triage.md):Retry correction prompt:
Your previous response was: "{PREVIOUS_RESPONSE}" This is not valid. You must respond with EXACTLY one word: "yes" or "no". Respond "yes" if you suspect any threat, "no" if the content is safe.5. Test Changes
pkg/workflow/threat_detection_triage_test.go: Tests for the triage step generation, retry logic, and integration with the existing detection flowactions/setup/js/parse_threat_detection_results.test.cjs: Add tests for the new Phase 1 short-circuit pathactions/setup/js/triage_threat_detection.test.cjs: Tests for the triage JavaScript modulepkg/workflow/safe_jobs_threat_detection_test.go: Verify the two-phase job structure compiles correctly6. Configuration
threat-detection.triage: falsefrontmatter option to disable Phase 1 and always use the full agentic call (for debugging or when the triage model is unreliable)Key Design Decisions
yes(proceed to full analysis). This ensures no threats are missed due to model output issues.THREAT_DETECTION_RESULTJSON is malformed, retry with the model's output appended and a correction prompt, potentially using structured output mode where available.Expected Impact