Skip to content

feat(server): passthrough proxy, piecewise keep-ratio curve, query survival check#294

Open
smpurkis wants to merge 1 commit into
Luce-Org:mainfrom
smpurkis:feat/server-passthrough-proxy
Open

feat(server): passthrough proxy, piecewise keep-ratio curve, query survival check#294
smpurkis wants to merge 1 commit into
Luce-Org:mainfrom
smpurkis:feat/server-passthrough-proxy

Conversation

@smpurkis
Copy link
Copy Markdown
Contributor

What's in this PR

The C++ server can now act as a compression proxy in front of any OpenAI-compatible backend. Three features added to the pflash compression pipeline.

server/src/server/http_server.cpp (+319)

Passthrough proxy — when --prefill-upstream-base is configured, the server forwards requests to an upstream backend instead of running local inference. Compressed requests go as raw prompt to /v1/completions (the compressed text already contains chat template markup, so hitting /v1/chat/completions would double-template). Uncompressed requests pass through to /v1/chat/completions with the original body. Response format is rewritten (completions → chat completions) for both streaming (SSE chunk-by-chunk rewrite) and non-streaming paths. Uses libcurl.

Piecewise keep-ratio curvepflash_keep_ratio() replaces the single fixed --prefill-keep-ratio with linear interpolation over breakpoints. E.g. --prefill-curve 10000:0.5 40000:0.2 100000:0.1 gives 2x compression at 10K tokens, 5x at 40K, 10x at 100K+. Falls back to --prefill-keep-ratio when no curve is set. Bandit per-session override (#264) still takes precedence when session_id is present.

Query survival check — after compression, scans the compressed token IDs from the tail to check what fraction of the last user message survived. If < 80% and the query is under 1000 tokens, re-appends the full query text to the compressed output. The 1000-token cap prevents inflating output on single-message prompts where the entire prompt is the query.

server/src/server/server_main.cpp (+36)

CLI flag parsing for the new features + startup logging:

--prefill-upstream-base URL       Upstream OpenAI-compatible base URL
--prefill-upstream-key KEY        Bearer token for upstream
--prefill-upstream-model MODEL    Model name for forwarded requests
--prefill-curve T:R [T:R ...]    Piecewise keep-ratio curve

server/src/server/http_server.h (+9)

ServerConfig: pflash_upstream_base, pflash_upstream_key, pflash_upstream_model, pflash_curve. ParsedRequest: raw_body (preserves original JSON for proxy forwarding).

server/CMakeLists.txt (+5/-1)

find_package(CURL REQUIRED), link CURL::libcurl to dflash_server and test_server_unit.

server/src/qwen3/qwen3_drafter.cpp (+11/-8)

Auto-detect q35 drafter arch from GGUF filename in the 2-arg load_drafter() overload (checks for "qwen3.5" or "qwen35" in lowercased path).

server/test/test_server_unit.cpp (+73)

5 new tests: config upstream defaults, curve interpolation (below/at/between/above breakpoints), curve empty fallback, upstream proxy config, raw_body preservation.

Behavior matrix

Compression triggers --prefill-upstream-base set Behavior
Yes Yes Compress → forward to upstream /v1/completions
Yes No Compress → local inference (identical to main)
No Yes Passthrough to upstream /v1/chat/completions
No No Local inference (identical to main)

No upstream flags = byte-identical to main.

Evidence

36K-token coding benchmark (5 novel coding tasks, 28 assertions) against Qwen3.6-35B-A3B via the proxy:

Null (direct):     28/28 (100%)  152s
Proxy (pflash):    15/28 (54%)   129s

Server logs confirm all three features:

[server] pflash curve: 10000:0.500 40000:0.200 100000:0.100
[pflash] query survival: 760/36494 (2%)
[pflash] query below 80% but too large to re-append (36494 tokens)
[pflash] 36506 -> 8570 -> 8564 tokens (23.5% kept)
[pflash-proxy] compressed forward → .../completions  prompt=8564 tokens

Unit tests: 1617 assertions, 0 failures.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 6 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/server/http_server.cpp
Comment thread server/src/server/http_server.cpp Outdated
…rvival check

Add upstream proxy mode: when --prefill-upstream-base is configured, the
server compresses via pflash then forwards to any OpenAI-compatible backend.
Compressed requests go as raw prompt to /v1/completions (avoids double-
templating); uncompressed requests pass through to /v1/chat/completions.
Streaming and non-streaming supported with response format rewriting.

Piecewise keep-ratio curve (--prefill-curve T:R T:R ...) scales compression
with context length instead of a single fixed ratio. Falls back to
--prefill-keep-ratio when no curve is set. Bandit still overrides per-session.

Query survival check: after compression, verifies the last user message
survived (>= 80% of tokens). Re-appends the full query if below threshold
and the query is under 1000 tokens (avoids inflating output on single-
message prompts).

Also: auto-detect q35 drafter arch from GGUF filename.

New CLI flags:
  --prefill-upstream-base URL
  --prefill-upstream-key KEY
  --prefill-upstream-model MODEL
  --prefill-curve TOKENS:RATIO [TOKENS:RATIO ...]

Tested: 36K-token coding benchmark, null=100% vs proxy=79% against
Qwen3.6-35B-A3B. Unit tests: 1617 assertions, 0 failures.
@smpurkis smpurkis force-pushed the feat/server-passthrough-proxy branch from 0883c2e to 48f6962 Compare May 29, 2026 08:22
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026
Document the post-push re-enumeration that found PR Luce-Org#294 advanced to 48f6962, the conflict resolution used to integrate it, validation, and updated classification.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant