Skip to content

[FEATURE]: LLM command-approval classifier ("auto mode") for permission gating #33585

Description

@thomaslwang

Feature hasn't been suggested before.

  • I have verified this feature I'm about to request hasn't been suggested before.

(Closest prior issues, and how this differs, are listed at the bottom — this proposal is distinct in mechanism.)

Describe the enhancement you want to request

Problem. opencode's permission model is per-rule binary: allow (auto-approve), ask (prompt every time), or deny. In agentic sessions there's no middle ground — nothing inspects an actual bash command / webfetch / MCP call and decides "this specific one is safe, don't prompt." Users end up approving constantly, or allow-listing broadly and losing the guardrail.

Proposal. An opt-in classifier (after Claude Code's "auto mode") that gates only the would-auto-approve path: when a rule resolves to allow, consult a model first.

  • allow → proceed silently (as today)
  • block → deny-and-continue: a tool error the agent reacts to (no halt)
  • classifier error / escalation → fail closed (human prompt)

It never overrides an explicit user deny/ask.

Design points (implemented in the reference branch):

  • Pluggable backend. Default uses the user's own configured model via the AI SDK (single-pass <block>yes/no</block>). Pluggable so a local/self-hosted or hosted guardrails model can be slotted in.
  • Reasoning-blind transcript — the classifier sees only user messages + the bare tool-call payload, never assistant prose or prior tool output (prompt-injection + anti-rationalization defense).
  • Full interception path via Permission.ask — covers edits, webfetch, MCP, task/subagent, external-dir; read-only tools short-circuit before any model call.
  • Escalation backstop — per-session denial counters (3 consecutive / 20 total), reset each user turn, so a false positive can't loop forever.
  • classifier config block (sibling of permission); off by default.

Reference implementation: openguardrails/opencode@feat/auto-mode-classifier (draft PR #33586). bun run typecheck clean; unit tests cover reasoning-blindness, fail-closed parsing, the allowlist, and the policy slots.

Closest prior issues (and how this differs):

Happy to align with any of these and to the team's preferred shape (naming, config location, experimental-flag gating).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions