feat(server): passthrough proxy, piecewise keep-ratio curve, query survival check#294
Open
smpurkis wants to merge 1 commit into
Open
feat(server): passthrough proxy, piecewise keep-ratio curve, query survival check#294smpurkis wants to merge 1 commit into
smpurkis wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
2 issues found across 6 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…rvival check Add upstream proxy mode: when --prefill-upstream-base is configured, the server compresses via pflash then forwards to any OpenAI-compatible backend. Compressed requests go as raw prompt to /v1/completions (avoids double- templating); uncompressed requests pass through to /v1/chat/completions. Streaming and non-streaming supported with response format rewriting. Piecewise keep-ratio curve (--prefill-curve T:R T:R ...) scales compression with context length instead of a single fixed ratio. Falls back to --prefill-keep-ratio when no curve is set. Bandit still overrides per-session. Query survival check: after compression, verifies the last user message survived (>= 80% of tokens). Re-appends the full query if below threshold and the query is under 1000 tokens (avoids inflating output on single- message prompts). Also: auto-detect q35 drafter arch from GGUF filename. New CLI flags: --prefill-upstream-base URL --prefill-upstream-key KEY --prefill-upstream-model MODEL --prefill-curve TOKENS:RATIO [TOKENS:RATIO ...] Tested: 36K-token coding benchmark, null=100% vs proxy=79% against Qwen3.6-35B-A3B. Unit tests: 1617 assertions, 0 failures.
0883c2e to
48f6962
Compare
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Document the post-push re-enumeration that found PR Luce-Org#294 advanced to 48f6962, the conflict resolution used to integrate it, validation, and updated classification.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What's in this PR
The C++ server can now act as a compression proxy in front of any OpenAI-compatible backend. Three features added to the pflash compression pipeline.
server/src/server/http_server.cpp(+319)Passthrough proxy — when
--prefill-upstream-baseis configured, the server forwards requests to an upstream backend instead of running local inference. Compressed requests go as rawpromptto/v1/completions(the compressed text already contains chat template markup, so hitting/v1/chat/completionswould double-template). Uncompressed requests pass through to/v1/chat/completionswith the original body. Response format is rewritten (completions → chat completions) for both streaming (SSE chunk-by-chunk rewrite) and non-streaming paths. Uses libcurl.Piecewise keep-ratio curve —
pflash_keep_ratio()replaces the single fixed--prefill-keep-ratiowith linear interpolation over breakpoints. E.g.--prefill-curve 10000:0.5 40000:0.2 100000:0.1gives 2x compression at 10K tokens, 5x at 40K, 10x at 100K+. Falls back to--prefill-keep-ratiowhen no curve is set. Bandit per-session override (#264) still takes precedence whensession_idis present.Query survival check — after compression, scans the compressed token IDs from the tail to check what fraction of the last user message survived. If < 80% and the query is under 1000 tokens, re-appends the full query text to the compressed output. The 1000-token cap prevents inflating output on single-message prompts where the entire prompt is the query.
server/src/server/server_main.cpp(+36)CLI flag parsing for the new features + startup logging:
server/src/server/http_server.h(+9)ServerConfig:pflash_upstream_base,pflash_upstream_key,pflash_upstream_model,pflash_curve.ParsedRequest:raw_body(preserves original JSON for proxy forwarding).server/CMakeLists.txt(+5/-1)find_package(CURL REQUIRED), linkCURL::libcurltodflash_serverandtest_server_unit.server/src/qwen3/qwen3_drafter.cpp(+11/-8)Auto-detect q35 drafter arch from GGUF filename in the 2-arg
load_drafter()overload (checks for "qwen3.5" or "qwen35" in lowercased path).server/test/test_server_unit.cpp(+73)5 new tests: config upstream defaults, curve interpolation (below/at/between/above breakpoints), curve empty fallback, upstream proxy config,
raw_bodypreservation.Behavior matrix
--prefill-upstream-baseset/v1/completionsmain)/v1/chat/completionsmain)No upstream flags = byte-identical to
main.Evidence
36K-token coding benchmark (5 novel coding tasks, 28 assertions) against Qwen3.6-35B-A3B via the proxy:
Server logs confirm all three features:
Unit tests: 1617 assertions, 0 failures.