Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,10 @@ callouts:
# Makes Aux links open in a new tab. Default is false
aux_links_new_tab: true

# Enable mermaid diagrams in fenced ```mermaid code blocks.
# https://just-the-docs.com/docs/ui-components/code/#mermaid-diagram-code-blocks
mermaid:
version: "10.9.0"

kramdown:
syntax_highlighter: coderay
159 changes: 159 additions & 0 deletions technical/guardrails.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
---
title: Guardrails
parent: Technical documentation
has_children: false
nav_order: 7
---

# LiteLLM guardrails

Currently, shipped guardrails:

| Guardrail | Type | File | What it does |
|----------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `MessageTrimmingGuardrail` | Pre-call | [`message-trimming`](https://github.com/os2ai/helm-deployments/blob/develop/applications/litellm/templates/message-trimming-config.yaml) | Trims oversized message histories to fit the target model's context window, then sanitizes tool-call/tool-response pairings so the trimmed (or otherwise broken) history doesn't crash strict chat templates. |

Pre-call guardrails in [LiteLLM](https://github.com/BerriAI/litellm) proxy applies to inbound chat requests before
forwarding them to the upstream model.

## Usage

The message trimming guardrail can be configured in the litellm
values [file](https://github.com/os2ai/helm-deployments/blob/develop/applications/litellm/litellm-values.yaml#L108)
configuration file in the helm chart.

__Note__: that the default configuration contains an option to set max tokens for a named model, which overrides the
global default max tokens value. This is useful for models that have a different context window size than the global
default.

```yaml
model_list:
- model_name: my-model
litellm_params:
model: openai/some-deployed-model
api_base: https://...
api_key: ""
max_tokens: 8192
guardrails:
# attach the guardrail to this model
- message_trimming

guardrails:
- guardrail_name: message_trimming
litellm_params:
guardrail: /app/custom_guardrails/message_overflow.MessageTrimmingGuardrail
mode: pre_call
default_on: true
default_config:
trim_ratio: 0.75
max_output_tokens: 2000
safety_buffer: 500
debug: false
default_max_context_tokens: 8192
max_context_tokens_by_model:
openai/some-deployed-model: 32768
pop_trailing_tool_messages: false
```

## How Message Trimming works

`async_pre_call_hook` runs on every chat completion request. The flow:

1. __Resolve context window__ for the target model (`_resolve_max_context_tokens`):
per-model override map → `litellm.get_max_tokens` → global default. Logs a warning if it falls through to the global
default.
2. __Compute a safe completion budget__ (`_calculate_safe_completion_tokens`) — leaves room for input + safety buffer +
a 25% headroom factor for tokens LiteLLM/the provider may add later.
3. __Update `max_tokens` / `max_completion_tokens`__ in the request so the model can't be asked for more than fits.
4. __Trim input messages__ (`litellm.trim_messages`) if `current_input_tokens > max_input_tokens`, dropping older
messages from the head until it fits.
5. __Sanitize__ (`_sanitize_messages`):
- `_repair_tool_call_pairings` — strip orphan `role: tool` messages and orphan `tool_calls` entries that the trimmer
may have created.
- (Optional, opt-in via `pop_trailing_tool_messages`) pop trailing `role: tool` messages and re-run the repair, then
append `"Please continue"` if the new terminus is an assistant message.
6. __Recount and re-budget__ completion tokens once more, since sanitize may have grown or shrunk the message list.

### Why `_repair_tool_call_pairings` exists

LiteLLM's build in `trim_messages` has __no tool-call awareness__ — it drops messages by token count from the head and
freely produces:

- Orphan `role: tool` messages (no surviving `assistant.tool_calls` advertised them).
- Orphan `tool_calls` entries on assistant messages (no surviving `role: tool` answered them).

Both shapes are rejected by strict chat templates (Mistral, vLLM, OpenAI strict mode). The repair pass enforces the
invariant: every surviving `tool_calls[].id` has a later matching `role: tool` message, and every surviving `role: tool`
was advertised by an earlier surviving `assistant.tool_calls` entry. See `_repair_tool_call_pairings` in [`message-trimming`](https://github.com/os2ai/helm-deployments/blob/develop/applications/litellm/templates/message-trimming-config.yaml).

### Why the trailing-tool pop is opt-in

The "normal" agent-loop shape ends on a `role: tool` message:

```mermaid
flowchart LR
U[User] --> A["Assistant{tool_calls}"]
A --> T["Tool{result}"]
T --> C([model is asked to continue here])
```

Most providers (OpenAI, Anthropic, Google, Mistral via the official APIs) __accept__ this shape — that's how tool
calling works. Popping the tool message and substituting `"Please continue"` deprives the model of the result it was
supposed to reason from, so the default is __off__.

Set `pop_trailing_tool_messages: true` only for upstream chat templates that explicitly reject `role: tool` messages —
notably the strict HuggingFace template that raises `"Only user and assistant roles are supported!"`. The per-model
override map lets you flip it for one model in a fleet without affecting the others.

### Why both repairs run when pop is enabled

The order is `repair → pop → repair → maybe-append-continue`:

- The first repair cleans up orphans created by `trim_messages`.
- The pop may break a previously-valid `[Assistant{tool_calls=[X]}, Tool X]` pair, leaving the assistant holding orphan
`tool_calls`.
- The second repair restores the invariant — strips the now-orphan `tool_calls`, drops content-empty assistants
entirely.
- *Then* we decide whether to append `"Please continue"`, after seeing the post-repair terminus. (Appending before would
risk leaving a stale "user-continue" line after a now-deleted assistant.)

## Configuration reference

Read from `default_config` of the guardrail entry in `litellm_config.yaml`. All keys optional.

| Key | Type | Default | Purpose |
|---------------------------------------|-------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `trim_ratio` | float | `0.75` | Forwarded to `litellm.trim_messages`. Fraction of `max_tokens` that trimming aims for, leaving headroom for additions later in the pipeline. |
| `max_output_tokens` | int | `2000` | Default completion budget when the request specifies neither `max_tokens` nor `max_completion_tokens`. |
| `safety_buffer` | int | `500` | Reserved tokens carved out of the context window before computing input/output budgets — covers system prompts, function schemas, and other tokens added downstream. |
| `debug` | bool | `false` | When `true`, the guardrail prints `[GUARDRAIL]`-prefixed traces to stdout. Show up in `task compose -- logs -f litellm`. |
| `default_max_context_tokens` | int | `8192` | Fallback context-window size when neither `max_context_tokens_by_model` nor `litellm.get_max_tokens` resolves the model. __Bump this if your fleet's smallest model is bigger than 8k.__ |
| `max_context_tokens_by_model` | dict | `{}` | Per-model overrides keyed by the upstream `model:` value LiteLLM forwards (NOT the friendly `model_name`). Wins over `litellm.get_max_tokens`. Use this for vLLM, Bedrock variants, custom deployments — anything not in [`litellm/model_prices_and_context_window.json`](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json). |
| `pop_trailing_tool_messages` | bool | `false` | Strip trailing `role: tool` messages before forwarding. __Leave `false` unless the upstream chat template rejects them__ — popping loses tool-call results the model needs to reason from. |
| `pop_trailing_tool_messages_by_model` | dict | `{}` | Per-model override of the flag above, same key shape as `max_context_tokens_by_model`. |

### Resolution order, illustrated

__Context window__ — first hit wins:

```mermaid
flowchart TD
A["max_context_tokens_by_model[model]"] -->|miss| B["litellm.get_max_tokens(model)"]
B -->|raises / 0| C[default_max_context_tokens]
A -. hit .-> H((use value))
B -. hit .-> H
C --> H
```

__Pop trailing tools__ — first hit wins:

```mermaid
flowchart TD
A["pop_trailing_tool_messages_by_model[model]"] -->|miss| B[pop_trailing_tool_messages]
A -. hit .-> H((use value))
B --> H
```

## References

- [LiteLLM custom guardrail docs](https://docs.litellm.ai/docs/proxy/guardrails/custom_guardrail)
Loading