-
Notifications
You must be signed in to change notification settings - Fork 213
feat(server): add native mixed-backend draft placement #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| // Standalone DFlash draft IPC daemon entry point. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is not proper place for this binary main. put it to src/ipc folder.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree, let me move it |
||
|
|
||
| #include "dflash_draft_ipc.h" | ||
|
|
||
| #include <algorithm> | ||
| #include <cstdio> | ||
| #include <cstdlib> | ||
| #include <cstring> | ||
|
|
||
| using namespace dflash::common; | ||
|
|
||
| int main(int argc, char ** argv) { | ||
| if (argc < 3 || std::strcmp(argv[1], "--draft-ipc-daemon") != 0) { | ||
| std::fprintf(stderr, | ||
| "usage: %s --draft-ipc-daemon <draft.safetensors|draft.gguf> " | ||
| "--ring-cap=N --stream-fd=FD [--draft-gpu=N]\n", | ||
| argv[0]); | ||
| return 2; | ||
| } | ||
|
|
||
| const char * draft_path = argv[2]; | ||
| int ring_cap = 4096; | ||
| int draft_gpu = 0; | ||
| int stream_fd = -1; | ||
| for (int i = 3; i < argc; i++) { | ||
| if (std::strncmp(argv[i], "--ring-cap=", 11) == 0) { | ||
| ring_cap = std::atoi(argv[i] + 11); | ||
| } else if (std::strcmp(argv[i], "--ring-cap") == 0) { | ||
| if (i + 1 < argc) ring_cap = std::atoi(argv[++i]); | ||
| } else if (std::strncmp(argv[i], "--draft-gpu=", 12) == 0) { | ||
| draft_gpu = std::max(0, std::atoi(argv[i] + 12)); | ||
| } else if (std::strcmp(argv[i], "--draft-gpu") == 0) { | ||
| if (i + 1 < argc) draft_gpu = std::max(0, std::atoi(argv[++i])); | ||
| } else if (std::strncmp(argv[i], "--stream-fd=", 12) == 0) { | ||
| stream_fd = std::atoi(argv[i] + 12); | ||
| } else if (std::strcmp(argv[i], "--stream-fd") == 0) { | ||
| if (i + 1 < argc) stream_fd = std::atoi(argv[++i]); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why no use mmap? stream fd seems slow to send big chunk of data.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The stream fd is only used for small control/status messages, not for the large feature payload. The current first-pass transport writes feature/noise tensors to temporary files and sends paths over the control channel. But yes I agree mmap/shared memory is the better next step for reducing host-copy and filesystem overhead, let me mark it to the follow-up vram optimization plan. |
||
| } else { | ||
| std::fprintf(stderr, "[draft-ipc-daemon] unknown option: %s\n", argv[i]); | ||
| return 2; | ||
| } | ||
| } | ||
|
|
||
| return run_dflash_draft_ipc_daemon(draft_path, ring_cap, draft_gpu, stream_fd); | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| // Remote draft execution configuration for mixed-backend target/draft placement. | ||
|
|
||
| #pragma once | ||
|
|
||
| #include <string> | ||
|
|
||
| namespace dflash::common { | ||
|
|
||
| struct RemoteDraftConfig { | ||
| std::string ipc_bin; | ||
| std::string work_dir; | ||
| int ring_cap = 0; | ||
|
|
||
| bool enabled() const { return !ipc_bin.empty(); } | ||
| bool has_aux_options() const { return !work_dir.empty() || ring_cap > 0; } | ||
| }; | ||
|
|
||
| } // namespace dflash::common |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why gemma4 cannot support this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make it supported from another PR after gemma4 and other model backend feature fixed and merged into main.