Skip to content

Add LLM Auto Context Compaction#304

Draft
howard0su wants to merge 4 commits into
Luce-Org:mainfrom
howard0su:auto_compact
Draft

Add LLM Auto Context Compaction#304
howard0su wants to merge 4 commits into
Luce-Org:mainfrom
howard0su:auto_compact

Conversation

@howard0su
Copy link
Copy Markdown
Contributor

Summary

Server-side context compaction that automatically compresses conversation history when prompts overflow the context window. Uses the prefill-drafter (Qwen3-0.6B) as a dedicated summarization backend — fast and doesn't block the main model.

3-Layer Pipeline

Layer Strategy Cost
1 Edit compaction — strip blocks, truncate/dedupe tool outputs CPU, <1ms
2 Self-summarization — use prefill drafter model condenses older history into summary GPU (0.6B), ~1-3s
3 Hard truncation — progressive tail-keeping as last resort CPU

Key Design Decisions

  • Enabled by default — triggered by client context_management in request body
  • Prefill-drafter reused — when --prefill-drafter is present, Qwen3-0.6B handles summarization (avoids tying up the 27B+ target)
  • Summary length scales with max_ctx — max_ctx / 10, clamped [256, 2048]. No separate knob needed.
  • Stateless — no per-session state; compaction operates on the provided messages array only

Files Changed (9 files, +2162 / -39)

New:

  • server/src/server/compaction.h/.cpp — compaction logic
  • harness/test_compaction.py — integration tests
  • docs/llm-context-compaction.md — research report (38KB)
  • docs/small-model-compression.md — small model research (16KB)

Modified:

  • server/src/server/http_server.h/.cpp — threshold check in route_request(), pipeline in worker_loop(), usage reporting
  • server/src/server/server_main.cpp — CLI flags, compaction backend creation
  • server/CMakeLists.txt — added source file

API

Request body (Responses API):

 { "context_management": [{"type": "compaction", "compact_threshold": 4096}] }

Response includes:

 { "usage": { "compacted_tokens_saved": 1044 } }

Testing

 python harness/test_compaction.py --server-bin build/dflash_server \
   --model <target.gguf> --prefill-drafter server/models/Qwen3-0.6B-BF16.gguf

howard0su and others added 4 commits May 29, 2026 07:35
Implement server-side context compaction triggered when prompt tokens
exceed a configurable threshold (default 90% of max_ctx):

- Layer 1: Edit compaction (strip thinking blocks, truncate/dedupe tool
  outputs) — CPU only, <1ms
- Layer 2: Self-summarization via internal generate() pass — condenses
  older conversation history into a concise summary
- Layer 3: Hard truncation — progressive tail-keeping as last resort

New CLI flags: --compaction, --compaction-threshold,
--compaction-max-tokens, --compaction-keep-recent

API: context_management parameter in Responses API allows per-request
threshold override. Response includes usage.compacted_tokens_saved.

Includes integration test harness (harness/test_compaction.py) and
research documentation in docs/.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When --prefill-drafter is present and --compaction is enabled, create a
dedicated Qwen3Backend from the drafter GGUF for Layer 2 summarization.
This avoids tying up the main target model (27B+) for summary generation
and is much faster (~0.6B inference for short summaries).

The compaction backend shares the drafter_tokenizer already loaded for
pflash. If the backend fails to initialize, falls back gracefully to
using the main model for summarization.

Also adds --prefill-drafter flag support to harness/test_compaction.py.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Compaction is now always enabled server-side. Triggering is driven by
client HTTP request body (context_management parameter). Added
--no-compaction to explicitly disable if needed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary output length is always max_ctx/10 clamped to [256, 2048].
No reason for it to be independently configurable — it should always
scale with context size.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026
Record the 2026-05-28 20:23 cron revalidation: upstream and carried PR heads remain current, draft Luce-Org#304 is excluded, fresh conflicted-PR probes were retained, and a tmux-driven Codex inspection keeps Luce-Org#135 as a designed current-layout port instead of a mechanical merge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant