-
Notifications
You must be signed in to change notification settings - Fork 7
backend fallback draft #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,222 @@ | ||
| # Native Framework Processing Fallback | ||
|
|
||
| **Status**: Draft | ||
|
|
||
| **Authors**: Alec in consultation with Ryan and Graham | ||
|
|
||
| **Category**: Architecture | ||
|
|
||
| **Sponsor**: Alec | ||
|
|
||
| **Required Reviewers**: Ryan McCormick, Itay Neeman, Neelay Shah | ||
|
|
||
| **Review Date**: [Date for review] | ||
|
|
||
| **Pull Request**: [Link to Pull Request of the Proposal itself] | ||
|
|
||
| **Implementation PR / Tracking Issue**: [Link to Pull Request or Tracking Issue for Implementation] | ||
|
|
||
| # Summary | ||
|
|
||
|
|
||
| # Motivation | ||
|
|
||
| Dynamo’s Rust-based frontend is providing excellent performance, but has recently been a source of friction for users. This is due to it implementing the pre- and post-processing pipelines for requests (tokenization, chat template application, reasoning parsers, etc), and if a particular piece of functionality is missing for some model, it requires work to add (even if the functionality is available in the underlying engine). | ||
|
|
||
| ## Goals | ||
|
|
||
| Allow the user to select between dynamo processing or native engine processing with a command line flag. The default will be the fast path of dynamo based processing with the option to add a flag: | ||
|
|
||
| --native-framework-processing | ||
| --vllm-processing | ||
| --sglang-processing | ||
| --trtllm-processing | ||
|
|
||
| ## Requirements | ||
|
|
||
| REQ 1: User SHOULD be able to select between dynamo processing and the backend enginge processing. | ||
|
|
||
| REQ 2: Preprocessing MUST happen on the frontend. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does 'frontend' refer to? |
||
| - To enable KV Cache Routing we need to have tokens in frontend in order to calculate hashes to match against the indexers hashes created from KV events emitted from the backend. | ||
|
|
||
| REQ 3: PostProcessing SHOULD move to the backend engine | ||
| - Post-processing is currently done on the engine side, however the intention was to co-located it with the backend where there are many CPU's sitting idle alongside the GPU's. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this meant to be "currently done on the frontend side"?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Woops, yes. |
||
| - This also means that post-processing scales along with the number of engine instances. | ||
|
|
||
| # Proposal | ||
|
|
||
| ## Pre-Processing | ||
|
|
||
| ### Current Architecture | ||
|
|
||
| Rust preprocessor handles: prompt templating → tokenization → parameter extraction | ||
|
|
||
| ``` | ||
| NvCreateChatCompletionRequest → (Rust) → PreprocessedRequest {token_ids, sampling_options, ...} | ||
| ``` | ||
|
|
||
| ### Proposed: Python Tokenizer Adapter | ||
|
|
||
| **Goal**: Use vLLM's native chat templates/tokenization while keeping parameter extraction in Rust. | ||
|
|
||
| **Boundary**: Only pass `(messages, model, tools)` → receive `token_ids` | ||
|
|
||
| #### Python Interface | ||
|
|
||
| ```python | ||
| class TokenizerProtocol(ABC): | ||
| def tokenize( | ||
| self, | ||
| messages: List[Dict[str, Any]], | ||
| model: str, | ||
| tools: Optional[List[Dict]] = None | ||
| ) -> List[int]: | ||
| """Combine chat template + tokenization in one call.""" | ||
| pass | ||
|
|
||
| class VllmTokenizer(TokenizerProtocol): | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it conceptually allow the sglang tokenizer as well?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I was just providing this as a first example. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where this implementation lives, you will need to import vLLM and all its tokenization utilities, correct? |
||
| def tokenize(self, messages, model, tools=None): | ||
| prompt = self.tokenizer.apply_chat_template( | ||
| messages, tokenize=False, add_generation_prompt=True, tools=tools | ||
| ) | ||
| return self.tokenizer.encode(prompt) | ||
| ``` | ||
|
|
||
| #### Rust Adapter | ||
|
|
||
| ```rust | ||
| pub trait TokenizerAdapter: Send + Sync { | ||
| fn tokenize( | ||
| &self, | ||
| messages: &[Message], | ||
| model: &str, | ||
| tools: Option<&[Tool]>, | ||
| ) -> anyhow::Result<Vec<u32>>; | ||
| } | ||
|
|
||
| pub struct PythonTokenizerAdapter { | ||
| py_tokenizer: Py<PyAny>, | ||
| } | ||
|
|
||
| impl TokenizerAdapter for PythonTokenizerAdapter { | ||
| fn tokenize(&self, messages: &[Message], model: &str, tools: Option<&[Tool]>) | ||
| -> anyhow::Result<Vec<u32>> | ||
| { | ||
| Python::with_gil(|py| { | ||
| let result = self.py_tokenizer.call_method1( | ||
| py, "tokenize", | ||
| (messages_to_pylist(py, messages)?, model, tools.map(|t| tools_to_pylist(py, t)).transpose()?) | ||
| )?; | ||
| Ok(result.extract(py)?) | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| // Helper functions convert Rust structs to Python dicts/lists | ||
| ``` | ||
|
|
||
| #### Integration | ||
|
|
||
| ```rust | ||
| impl OpenAIPreprocessor { | ||
| pub async fn preprocess_with_adapter( | ||
| &self, | ||
| request: &NvCreateChatCompletionRequest, | ||
| tokenizer: &dyn TokenizerAdapter, | ||
| ) -> Result<PreprocessedRequest> { | ||
| let mut builder = self.builder(request)?; // Rust: extract params | ||
| let token_ids = tokenizer.tokenize(...)?; // Python: tokenize | ||
| builder.token_ids(token_ids); | ||
| self.gather_multi_modal_data(request, &mut builder).await?; | ||
| builder.build() | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| #### Initialization | ||
|
|
||
| ```rust | ||
| // Rust passes tokenizer path to Python factory | ||
| let tokenizer_path = card.get_tokenizer_path()?; | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By passing only the path here, we assume the factory can figure out the type of tokenizer from that path? |
||
| let adapter = Python::with_gil(|py| { | ||
| let py_tokenizer = factory.call1(py, (tokenizer_path,))?; | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should there be a protocol for the factory too? |
||
| Ok(Arc::new(PythonTokenizerAdapter::new(py_tokenizer))) | ||
| })?; | ||
| ``` | ||
|
|
||
| ```python | ||
| # Python factory creates tokenizer from path | ||
| def create_vllm_tokenizer(tokenizer_path: str) -> TokenizerProtocol: | ||
| tokenizer = AutoTokenizer.from_pretrained(os.path.dirname(tokenizer_path)) | ||
| return VllmTokenizer(tokenizer) | ||
|
|
||
| # Main | ||
| tokenizer_factory = create_vllm_tokenizer if flags.vllm_processing else None | ||
| await run_input(runtime, mode, engine, tokenizer_factory) | ||
| ``` | ||
|
|
||
| ## Post-Processing | ||
|
|
||
| ### Current Architecture | ||
|
|
||
| **Frontend**: Post-processing (detokenization, tool calling) happens in Rust via `backend.backward_edge()` and `preprocessor_op.backward_edge()` | ||
|
|
||
| **Backend** (`serve_endpoint`): Minimal - just wraps `PythonAsyncEngine` in ingress handler | ||
|
|
||
| ```rust | ||
| // Frontend pipeline includes Rust post-processing | ||
| .link(service_backend)? | ||
| .link(backend.backward_edge())? // Detokenization in Rust | ||
| .link(preprocessor_op.backward_edge())? // Tool calling in Rust | ||
| ``` | ||
|
|
||
| ### Two Processing Modes | ||
|
|
||
| #### Mode 1: Rust Post-Processing on Backend (Proposed) | ||
|
|
||
| Move Rust detokenization/tool calling to backend servers (better CPU utilization): | ||
|
|
||
| ```rust | ||
| // Backend serve_endpoint builds larger pipeline | ||
| let pipeline = frontend | ||
| .link(tool_calling.forward_edge())? // passthrough | ||
| .link(backend.forward_edge())? // passthrough | ||
| .link(service_backend)? // PythonAsyncEngine | ||
| .link(backend.backward_edge())? // Detokenization in Rust (on backend) | ||
| .link(tool_calling.backward_edge())? // Tool calling in Rust (on backend) | ||
| .link(frontend)? | ||
| ``` | ||
|
|
||
| **Benefits**: Offloads CPU work to backend servers, scales with engine instances | ||
|
|
||
| #### Mode 2: vLLM Native Processing (`--vllm-processing`) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will this conceptually work with SGL and vLLM paths as well?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we think of this in terms of a "hello world backend" example, CLI args, and all - and see if it generalizes to vllm, sglang, trtllm, etc. that each implement the "interface" we defined out of the box for users of our officially supported backends? For example, maybe any backend would accept args like
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea I wasn't sure the best UX so just tried to write something down. Open to what you are suggestion as well. |
||
|
|
||
| Let vLLM handle detokenization and tool calling internally: | ||
|
|
||
| **Python Handler:** | ||
| ```python | ||
| # Enable vLLM's native processing | ||
| sampling_params.detokenize = True | ||
|
|
||
| # vLLM returns text + tool calls | ||
| yield postprocessor.process_tokens( | ||
| token_ids=output.token_ids, | ||
| delta_text=output.text, # Already detokenized | ||
| ... | ||
| ) | ||
| ``` | ||
|
|
||
| **VllmPostProcessor** uses vLLM's tool parsers: | ||
| ```python | ||
| tool_parser = ToolParserManager.get_tool_parser(tool_parser_name)(tokenizer) | ||
| tool_parser.extract_tool_calls_streaming(...) | ||
| ``` | ||
|
|
||
| **Frontend Pipeline** skips Rust post-processing: | ||
| ```rust | ||
| // Note: backend.backward_edge() and preprocessor_op.backward_edge() removed | ||
| let engine = frontend | ||
| .link(preprocessor_op.forward_edge())? | ||
| .link(service_backend)? | ||
| .link(frontend)? | ||
| ``` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion : Let's version it. example V.Alpha