Add LLM Auto Context Compaction#304
Draft
howard0su wants to merge 4 commits into
Draft
Conversation
Implement server-side context compaction triggered when prompt tokens exceed a configurable threshold (default 90% of max_ctx): - Layer 1: Edit compaction (strip thinking blocks, truncate/dedupe tool outputs) — CPU only, <1ms - Layer 2: Self-summarization via internal generate() pass — condenses older conversation history into a concise summary - Layer 3: Hard truncation — progressive tail-keeping as last resort New CLI flags: --compaction, --compaction-threshold, --compaction-max-tokens, --compaction-keep-recent API: context_management parameter in Responses API allows per-request threshold override. Response includes usage.compacted_tokens_saved. Includes integration test harness (harness/test_compaction.py) and research documentation in docs/. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When --prefill-drafter is present and --compaction is enabled, create a dedicated Qwen3Backend from the drafter GGUF for Layer 2 summarization. This avoids tying up the main target model (27B+) for summary generation and is much faster (~0.6B inference for short summaries). The compaction backend shares the drafter_tokenizer already loaded for pflash. If the backend fails to initialize, falls back gracefully to using the main model for summarization. Also adds --prefill-drafter flag support to harness/test_compaction.py. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Compaction is now always enabled server-side. Triggering is driven by client HTTP request body (context_management parameter). Added --no-compaction to explicitly disable if needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary output length is always max_ctx/10 clamped to [256, 2048]. No reason for it to be independently configurable — it should always scale with context size. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-28 20:23 cron revalidation: upstream and carried PR heads remain current, draft Luce-Org#304 is excluded, fresh conflicted-PR probes were retained, and a tmux-driven Codex inspection keeps Luce-Org#135 as a designed current-layout port instead of a mechanical merge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Server-side context compaction that automatically compresses conversation history when prompts overflow the context window. Uses the prefill-drafter (Qwen3-0.6B) as a dedicated summarization backend — fast and doesn't block the main model.
3-Layer Pipeline
Key Design Decisions
Files Changed (9 files, +2162 / -39)
New:
Modified:
API
Request body (Responses API):
{ "context_management": [{"type": "compaction", "compact_threshold": 4096}] }Response includes:
{ "usage": { "compacted_tokens_saved": 1044 } }Testing