Add LLM Auto Context Compaction by howard0su · Pull Request #304 · Luce-Org/lucebox-hub

howard0su · 2026-05-29T00:15:11Z

Summary

Server-side context compaction that automatically compresses conversation history when prompts overflow the context window. Uses the prefill-drafter (Qwen3-0.6B) as a dedicated summarization backend — fast and doesn't block the main model.

3-Layer Pipeline

Layer	Strategy	Cost
1	Edit compaction — strip blocks, truncate/dedupe tool outputs	CPU, <1ms
2	Self-summarization — use prefill drafter model condenses older history into summary	GPU (0.6B), ~1-3s
3	Hard truncation — progressive tail-keeping as last resort	CPU

Key Design Decisions

Enabled by default — triggered by client context_management in request body
Prefill-drafter reused — when --prefill-drafter is present, Qwen3-0.6B handles summarization (avoids tying up the 27B+ target)
Summary length scales with max_ctx — max_ctx / 10, clamped [256, 2048]. No separate knob needed.
Stateless — no per-session state; compaction operates on the provided messages array only

Files Changed (9 files, +2162 / -39)

New:

server/src/server/compaction.h/.cpp — compaction logic
harness/test_compaction.py — integration tests
docs/llm-context-compaction.md — research report (38KB)
docs/small-model-compression.md — small model research (16KB)

Modified:

server/src/server/http_server.h/.cpp — threshold check in route_request(), pipeline in worker_loop(), usage reporting
server/src/server/server_main.cpp — CLI flags, compaction backend creation
server/CMakeLists.txt — added source file

API

Request body (Responses API):

 { "context_management": [{"type": "compaction", "compact_threshold": 4096}] }

Response includes:

 { "usage": { "compacted_tokens_saved": 1044 } }

Testing

 python harness/test_compaction.py --server-bin build/dflash_server \
   --model <target.gguf> --prefill-drafter server/models/Qwen3-0.6B-BF16.gguf

Implement server-side context compaction triggered when prompt tokens exceed a configurable threshold (default 90% of max_ctx): - Layer 1: Edit compaction (strip thinking blocks, truncate/dedupe tool outputs) — CPU only, <1ms - Layer 2: Self-summarization via internal generate() pass — condenses older conversation history into a concise summary - Layer 3: Hard truncation — progressive tail-keeping as last resort New CLI flags: --compaction, --compaction-threshold, --compaction-max-tokens, --compaction-keep-recent API: context_management parameter in Responses API allows per-request threshold override. Response includes usage.compacted_tokens_saved. Includes integration test harness (harness/test_compaction.py) and research documentation in docs/. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When --prefill-drafter is present and --compaction is enabled, create a dedicated Qwen3Backend from the drafter GGUF for Layer 2 summarization. This avoids tying up the main target model (27B+) for summary generation and is much faster (~0.6B inference for short summaries). The compaction backend shares the drafter_tokenizer already loaded for pflash. If the backend fails to initialize, falls back gracefully to using the main model for summarization. Also adds --prefill-drafter flag support to harness/test_compaction.py. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Compaction is now always enabled server-side. Triggering is driven by client HTTP request body (context_management parameter). Added --no-compaction to explicitly disable if needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Summary output length is always max_ctx/10 clamped to [256, 2048]. No reason for it to be independently configurable — it should always scale with context size. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Record the 2026-05-28 20:23 cron revalidation: upstream and carried PR heads remain current, draft Luce-Org#304 is excluded, fresh conflicted-PR probes were retained, and a tmux-driven Codex inspection keeps Luce-Org#135 as a designed current-layout port instead of a mechanical merge.

howard0su and others added 4 commits May 29, 2026 07:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLM Auto Context Compaction#304

Add LLM Auto Context Compaction#304
howard0su wants to merge 4 commits into
Luce-Org:mainfrom
howard0su:auto_compact

howard0su commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

howard0su commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant