feat(server): passthrough proxy, piecewise keep-ratio curve, query survival check by smpurkis · Pull Request #294 · Luce-Org/lucebox-hub

smpurkis · 2026-05-28T15:58:14Z

What's in this PR

The C++ server can now act as a compression proxy in front of any OpenAI-compatible backend. Three features added to the pflash compression pipeline.

`server/src/server/http_server.cpp` (+319)

Passthrough proxy — when --prefill-upstream-base is configured, the server forwards requests to an upstream backend instead of running local inference. Compressed requests go as raw prompt to /v1/completions (the compressed text already contains chat template markup, so hitting /v1/chat/completions would double-template). Uncompressed requests pass through to /v1/chat/completions with the original body. Response format is rewritten (completions → chat completions) for both streaming (SSE chunk-by-chunk rewrite) and non-streaming paths. Uses libcurl.

Piecewise keep-ratio curve — pflash_keep_ratio() replaces the single fixed --prefill-keep-ratio with linear interpolation over breakpoints. E.g. --prefill-curve 10000:0.5 40000:0.2 100000:0.1 gives 2x compression at 10K tokens, 5x at 40K, 10x at 100K+. Falls back to --prefill-keep-ratio when no curve is set. Bandit per-session override (#264) still takes precedence when session_id is present.

Query survival check — after compression, scans the compressed token IDs from the tail to check what fraction of the last user message survived. If < 80% and the query is under 1000 tokens, re-appends the full query text to the compressed output. The 1000-token cap prevents inflating output on single-message prompts where the entire prompt is the query.

`server/src/server/server_main.cpp` (+36)

CLI flag parsing for the new features + startup logging:

--prefill-upstream-base URL       Upstream OpenAI-compatible base URL
--prefill-upstream-key KEY        Bearer token for upstream
--prefill-upstream-model MODEL    Model name for forwarded requests
--prefill-curve T:R [T:R ...]    Piecewise keep-ratio curve

`server/src/server/http_server.h` (+9)

ServerConfig: pflash_upstream_base, pflash_upstream_key, pflash_upstream_model, pflash_curve. ParsedRequest: raw_body (preserves original JSON for proxy forwarding).

`server/CMakeLists.txt` (+5/-1)

find_package(CURL REQUIRED), link CURL::libcurl to dflash_server and test_server_unit.

`server/src/qwen3/qwen3_drafter.cpp` (+11/-8)

Auto-detect q35 drafter arch from GGUF filename in the 2-arg load_drafter() overload (checks for "qwen3.5" or "qwen35" in lowercased path).

`server/test/test_server_unit.cpp` (+73)

5 new tests: config upstream defaults, curve interpolation (below/at/between/above breakpoints), curve empty fallback, upstream proxy config, raw_body preservation.

Behavior matrix

Compression triggers	`--prefill-upstream-base` set	Behavior
Yes	Yes	Compress → forward to upstream `/v1/completions`
Yes	No	Compress → local inference (identical to `main`)
No	Yes	Passthrough to upstream `/v1/chat/completions`
No	No	Local inference (identical to `main`)

No upstream flags = byte-identical to main.

Evidence

36K-token coding benchmark (5 novel coding tasks, 28 assertions) against Qwen3.6-35B-A3B via the proxy:

Null (direct):     28/28 (100%)  152s
Proxy (pflash):    15/28 (54%)   129s

Server logs confirm all three features:

[server] pflash curve: 10000:0.500 40000:0.200 100000:0.100
[pflash] query survival: 760/36494 (2%)
[pflash] query below 80% but too large to re-append (36494 tokens)
[pflash] 36506 -> 8570 -> 8564 tokens (23.5% kept)
[pflash-proxy] compressed forward → .../completions  prompt=8564 tokens

Unit tests: 1617 assertions, 0 failures.

cubic-dev-ai

2 issues found across 6 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

…rvival check Add upstream proxy mode: when --prefill-upstream-base is configured, the server compresses via pflash then forwards to any OpenAI-compatible backend. Compressed requests go as raw prompt to /v1/completions (avoids double- templating); uncompressed requests pass through to /v1/chat/completions. Streaming and non-streaming supported with response format rewriting. Piecewise keep-ratio curve (--prefill-curve T:R T:R ...) scales compression with context length instead of a single fixed ratio. Falls back to --prefill-keep-ratio when no curve is set. Bandit still overrides per-session. Query survival check: after compression, verifies the last user message survived (>= 80% of tokens). Re-appends the full query if below threshold and the query is under 1000 tokens (avoids inflating output on single- message prompts). Also: auto-detect q35 drafter arch from GGUF filename. New CLI flags: --prefill-upstream-base URL --prefill-upstream-key KEY --prefill-upstream-model MODEL --prefill-curve TOKENS:RATIO [TOKENS:RATIO ...] Tested: 36K-token coding benchmark, null=100% vs proxy=79% against Qwen3.6-35B-A3B. Unit tests: 1617 assertions, 0 failures.

Document the post-push re-enumeration that found PR Luce-Org#294 advanced to 48f6962, the conflict resolution used to integrate it, validation, and updated classification.

cubic-dev-ai Bot reviewed May 28, 2026

View reviewed changes

Comment thread server/src/server/http_server.cpp

Comment thread server/src/server/http_server.cpp Outdated

smpurkis force-pushed the feat/server-passthrough-proxy branch from 0883c2e to 48f6962 Compare May 29, 2026 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): passthrough proxy, piecewise keep-ratio curve, query survival check#294

feat(server): passthrough proxy, piecewise keep-ratio curve, query survival check#294
smpurkis wants to merge 1 commit into
Luce-Org:mainfrom
smpurkis:feat/server-passthrough-proxy

smpurkis commented May 28, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smpurkis commented May 28, 2026

What's in this PR

server/src/server/http_server.cpp (+319)

server/src/server/server_main.cpp (+36)

server/src/server/http_server.h (+9)

server/CMakeLists.txt (+5/-1)

server/src/qwen3/qwen3_drafter.cpp (+11/-8)

server/test/test_server_unit.cpp (+73)

Behavior matrix

Evidence

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`server/src/server/http_server.cpp` (+319)

`server/src/server/server_main.cpp` (+36)

`server/src/server/http_server.h` (+9)

`server/CMakeLists.txt` (+5/-1)

`server/src/qwen3/qwen3_drafter.cpp` (+11/-8)

`server/test/test_server_unit.cpp` (+73)

cubic-dev-ai Bot left a comment •

edited

Loading