Open
Conversation
Add NVIDIA runtime/operators, GPU tests, server filters and sampling options, plus frontend sampling controls and build scripts.
- Add KV cache pool with prefix matching and reference counting - Implement multi-user inference scheduler with queue and workers - Add packed prefill and decode batch inference (Decoder::decodePacked) - Support session forking and editing in frontend - Add continuous batching with PD separation - Add segmented self-attention for packed sequences - Include benchmark and integration tests
…itance, logging) - Fix InfiniTensor#1: Replace _session_worker dict with OrderedDict LRU (max_sticky_sessions=10000) - Fix InfiniTensor#2: Add best-effort TOCTOU comment on KV-aware routing - Fix InfiniTensor#3: Add logger.debug for tokenize failures, shallow-copy payload in submit() - Fix InfiniTensor#4: KVCachePool(IKVCachePool), ChatService(IInferenceService) explicit inheritance - Fix InfiniTensor#5: Merge double lock in request_stop() - Fix InfiniTensor#6: Clean _prompt_tokens from payload after routing
- Extract SessionManager (session_manager.py): session message history + cancel events - Extract KVRuntimeBridge (kv_runtime_bridge.py): native C++ KV context lifecycle - ChatService slimmed from ~726 to ~506 lines, using delegation pattern - All IInferenceService interface signatures unchanged - HTTP API and main() parameters unchanged - Add test/test_chatservice_split.py with 19 tests covering all split modules
Previously, packed prefill/decode only handled greedy (argmax) requests; any request with temperature/top_k/top_p fell back to single-sequence processing. This adds per-sequence sampling params to the batch path via new C API bindings (PrefillPackedSampling/StepPackedSampling), with hasattr guards for backward compatibility with older DLLs.
Delete 3 outdated docs (new.md, UPDATE_PLAN.md, QA_REPORT.md) and create PROJECT_STATUS.md with progress summaries for all 6 project directions.
- server.py: add _wrap_completion/_wrap_chunk/_wrap_error helpers, generate/stream/generate_packed_non_stream return OpenAI format, SSE streams end with data: [DONE] - scheduler.py: fix continuous batching worker to parse new format (choices[0].finish_reason), convert final chunk to chat.completion for non-stream path - frontend/app.js: switch to /v1/chat/completions, max_tokens, parse new SSE format - 5 test files: update mocks and assertions for OpenAI format - PROGRESS.md, docs/PROJECT_STATUS.md: document changes
Rewrite scheduler to batch-driven mode so multiple streaming requests share the model via prepare_batch/step_batch/finalize_sequence, with dynamic shrinking and automatic fallback to legacy iterator path.
- ChatService supports shared model_lock/kv_pool/kv_bridge across workers - Add --shared-model CLI flag for single-model multi-worker mode - Add IKVCachePool.memory_pressure() and --kv-memory-threshold flow control - Optimize KV-aware routing and debug snapshot for shared pool mode - Add test/test_shared_model.py (14 tests)
Iluvatar CoreX SDK is fully CUDA-compatible, so kernels are reused from nvidia:: namespace with zero copy. Adds device enum, runtime dispatch, build scripts (clang++ -x cuda --cuda-gpu-arch=ivcore10), and test support for --device iluvatar across all test files.
The on_load hook runs too early - xmake injects cudadevrt after on_load when it detects CUDA dependencies. Use before_link to filter out cudadevrt from links, syslinks and ldflags right before the linker runs.
Root cause: xmake detects .cu files and auto-injects nvcc toolchain + cudadevrt, completely ignoring our custom iluvatar_cu rule. Solution: use on_build() to fully control compilation with clang++, never registering .cu files via add_files(). This prevents xmake from detecting CUDA and injecting nvcc/cudadevrt.
The linker does single-pass scanning of static libraries. Since llaisys-ops calls nvidia:: symbols defined in llaisys-ops-iluvatar, we need --whole-archive to force all symbols to be included.
add_ldflags was silently ignored by xmake. Use add_shflags with full .a file paths to force whole-archive inclusion of iluvatar static libraries into the shared library.
-lcudart was placed before the .a files by xmake, causing the linker to skip it (single-pass scanning). Move all iluvatar link flags into add_shflags to control exact order, and add rpath so libcudart.so is found at runtime.
All 9 GPU operators pass on Iluvatar CoreX (ivcore10). Runtime test detects 2 iluvatar devices and passes.
Added Iluvatar CoreX platform details: runtime, operators, build system, and test results. Updated summary table from 50% to 90%.
…Iluvatar test/test_infer.py --device iluvatar produces tokens identical to PyTorch reference output. Project InfiniTensor#2 now at 100%.
…sor parallelism - Communication layer: C API (comm.h), C++ dispatcher, NCCL backend - commInit accepts external unique ID for multi-rank initialization - llaisysCommGenerateUniqueId API for external ID generation - Decoder AllReduce: after attn_o and mlp_down projections (Megatron-style) - llaisysQwen2ModelSetTensorParallel C API - Python weight splitting (column/row split for Megatron-style TP) - Multi-process launcher (launch_tp.py + _tp_worker.py) - Unit tests (test_comm_api.py) and integration tests (test_allreduce.py) - Documentation: comm_design.md, PROGRESS.md, PROJECT_STATUS.md updated
When TP is enabled, nh is divided by world_size, so nh*dh != hs. The attn_out3d tensor has shape [len, tp_nh, dh] and must be viewed as [len, tp_nh*dh], not [len, hs].
There was a problem hiding this comment.
Pull request overview
该 PR 将 LLAISYS 扩展到更多 GPU/分布式与推理能力:新增 Iluvatar(CoreX) GPU 构建与运行时、补全多项算子 GPU 实现、引入通信层(NCCL)与张量并行相关脚本/接口,并补充 tokenizer 与调度/KV 复用相关的测试与 Python 侧封装。
Changes:
- 新增/完善 NVIDIA 与 Iluvatar GPU 的 xmake 构建目标、RuntimeAPI 以及部分链接策略
- 补全多项 Ops(add/argmax/embedding/linear/rearrange/rms_norm/rope/self_attention/swiglu)的 CPU + CUDA 实现,并新增 segmented self-attention C/Python 绑定
- 增加 SentencePiece tokenizer(C++/C API/Python 封装)、通信层 API(NCCL)与多进程 allreduce/TP 启动与测试脚本、前端静态页面
Reviewed changes
Copilot reviewed 146 out of 148 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| xmake/nvidia.lua | NVIDIA 设备与算子静态库构建目标 |
| xmake/iluvatar.lua | Iluvatar(CoreX) 目标:用 clang++ CUDA 前端手动编译/归档 |
| xmake.lua | 新增 sentencepiece/iluvatar 选项;聚合 target 依赖与链接策略调整 |
| test/test_utils.py | 类型注解更新;支持 iluvatar device 映射 |
| test/test_tokenizer.py | SentencePiece tokenizer ctypes 冒烟测试脚本 |
| test/test_scheduler_inmemory.py | InferenceScheduler 的内存级单测覆盖(stream/timeout/CB/packed) |
| test/test_runtime.py | CLI device 选项扩展至 iluvatar |
| test/test_kv_cache_pool.py | KVCachePool 行为单测(sealed 前缀、引用计数、回滚等) |
| test/test_infer.py | CLI device 选项扩展至 iluvatar |
| test/test_comm_api.py | Comm API(NCCL)ctypes 单测(单卡 nranks=1) |
| test/test_chat_minimal.py | 最小聊天推理脚本(Tokenizer + Qwen2) |
| test/test_allreduce.py | 多进程 allreduce 集成测试驱动脚本 |
| test/ops_gpu/add.py | GPU add 对齐测试脚本 |
| test/ops_gpu/argmax.py | GPU argmax 对齐测试脚本 |
| test/ops_gpu/embedding.py | GPU embedding 对齐测试脚本 |
| test/ops_gpu/linear.py | GPU linear 对齐测试脚本 |
| test/ops_gpu/rearrange.py | GPU rearrange 对齐测试脚本 |
| test/ops_gpu/rms_norm.py | GPU rms_norm 对齐测试脚本 |
| test/ops_gpu/rope.py | GPU rope 对齐测试脚本 |
| test/ops_gpu/self_attention.py | GPU self_attention 对齐测试脚本 |
| test/ops_gpu/swiglu.py | GPU swiglu 对齐测试脚本 |
| test/ops_gpu/run_all.py | GPU 算子测试一键运行器 |
| test/ops_gpu/init.py | ops_gpu 包占位文件 |
| test/ops/add.py | CPU/通用 add 测试增加 iluvatar 选项 |
| test/ops/argmax.py | CPU/通用 argmax 测试增加 iluvatar 选项 |
| test/ops/embedding.py | CPU/通用 embedding 测试增加 iluvatar 选项 |
| test/ops/linear.py | CPU/通用 linear 测试增加 iluvatar 选项 |
| test/ops/rms_norm.py | CPU/通用 rms_norm 测试增加 iluvatar 选项 |
| test/ops/rope.py | CPU/通用 rope 测试增加 iluvatar 选项 |
| test/ops/self_attention.py | CPU/通用 self_attention 测试增加 iluvatar 选项 |
| test/ops/self_attention_segmented.py | segmented self-attention 参考实现/对齐测试 |
| test/ops/swiglu.py | CPU/通用 swiglu 测试增加 iluvatar 选项 |
| test/_allreduce_worker.py | allreduce worker(直接调用 NCCL 生成/初始化 comm) |
| src/utils/types.hpp | 增加 include guard(pragma once) |
| src/utils/check.hpp | 宏定义间增加空行(格式) |
| src/tokenizer/sentencepiece/sentencepiece.hpp | SentencePieceTokenizer C++ 封装声明 |
| src/tokenizer/sentencepiece/sentencepiece.cpp | SentencePieceTokenizer 实现(受编译宏控制) |
| src/tensor/tensor.hpp | Tensor 头文件注释与结构整理(中文注释) |
| src/tensor/tensor.cpp | 实现 isContiguous/permute/view/slice/load/contiguous 等(替换 TO_BE_IMPLEMENTED) |
| src/ops/add/op.cpp | add:新增 NVIDIA/ILUVATAR 分支调用 CUDA 实现 |
| src/ops/add/nvidia/add_nvidia.hpp | add CUDA 声明 |
| src/ops/add/nvidia/add_nvidia.cu | add CUDA kernel 实现 |
| src/ops/add/cpu/add_cpu.hpp | add CPU 头文件格式调整 |
| src/ops/add/cpu/add_cpu.cpp | add CPU 实现格式调整 |
| src/ops/argmax/op.cpp | argmax:补全 device/dtype 校验与 CPU/GPU 分发 |
| src/ops/argmax/nvidia/argmax_nvidia.hpp | argmax CUDA 声明 |
| src/ops/argmax/nvidia/argmax_nvidia.cu | argmax CUDA kernel(单线程实现) |
| src/ops/argmax/cpu/argmax_cpu.hpp | argmax CPU 声明 |
| src/ops/argmax/cpu/argmax_cpu.cpp | argmax CPU 实现 |
| src/ops/embedding/op.cpp | embedding:补全校验与 CPU/GPU 分发 |
| src/ops/embedding/nvidia/embedding_nvidia.hpp | embedding CUDA 声明 |
| src/ops/embedding/nvidia/embedding_nvidia.cu | embedding CUDA kernel |
| src/ops/embedding/cpu/embedding_cpu.hpp | embedding CPU 声明 |
| src/ops/embedding/cpu/embedding_cpu.cpp | embedding CPU 实现 |
| src/ops/linear/op.cpp | linear:补全 bias 可选、校验与 CPU/GPU 分发 |
| src/ops/linear/nvidia/linear_nvidia.hpp | linear CUDA 声明 |
| src/ops/linear/nvidia/linear_nvidia.cu | linear CUDA kernel |
| src/ops/linear/cpu/linear_cpu.hpp | linear CPU 声明 |
| src/ops/linear/cpu/linear_cpu.cpp | linear CPU 实现 |
| src/ops/rearrange/op.cpp | rearrange:CPU/GPU 分发;GPU 侧拷贝 shape/stride 到 device |
| src/ops/rearrange/nvidia/rearrange_nvidia.hpp | rearrange CUDA 声明 |
| src/ops/rearrange/nvidia/rearrange_nvidia.cu | rearrange CUDA kernel |
| src/ops/rearrange/cpu/rearrange_cpu.hpp | rearrange CPU 声明 |
| src/ops/rearrange/cpu/rearrange_cpu.cpp | rearrange CPU 递归实现 |
| src/ops/rms_norm/op.cpp | rms_norm:补全校验与 CPU/GPU 分发 |
| src/ops/rms_norm/nvidia/rms_norm_nvidia.hpp | rms_norm CUDA 声明 |
| src/ops/rms_norm/nvidia/rms_norm_nvidia.cu | rms_norm CUDA kernel |
| src/ops/rms_norm/cpu/rms_norm_cpu.hpp | rms_norm CPU 声明 |
| src/ops/rms_norm/cpu/rms_norm_cpu.cpp | rms_norm CPU 实现 |
| src/ops/rope/op.cpp | rope:补全校验与 CPU/GPU 分发(pos_ids int64) |
| src/ops/rope/nvidia/rope_nvidia.hpp | rope CUDA 声明 |
| src/ops/rope/nvidia/rope_nvidia.cu | rope CUDA kernel |
| src/ops/rope/cpu/rope_cpu.hpp | rope CPU 声明 |
| src/ops/rope/cpu/rope_cpu.cpp | rope CPU 实现 |
| src/ops/self_attention/op.hpp | self_attention API 扩展:新增 segmented 声明 |
| src/ops/self_attention/op.cpp | self_attention 与 segmented 路径实现与分发 |
| src/ops/self_attention/nvidia/self_attention_nvidia.hpp | self_attention CUDA 声明 |
| src/ops/self_attention/nvidia/self_attention_nvidia.cu | self_attention CUDA kernel(朴素实现) |
| src/ops/self_attention/cpu/self_attention_cpu.hpp | self_attention CPU 声明(含 segmented) |
| src/ops/swiglu/op.cpp | swiglu:补全校验与 CPU/GPU 分发 |
| src/ops/swiglu/nvidia/swiglu_nvidia.hpp | swiglu CUDA 声明 |
| src/ops/swiglu/nvidia/swiglu_nvidia.cu | swiglu CUDA kernel |
| src/ops/swiglu/cpu/swiglu_cpu.hpp | swiglu CPU 声明 |
| src/ops/swiglu/cpu/swiglu_cpu.cpp | swiglu CPU 实现 |
| src/models/transformer/decoder/decoder.hpp | Transformer decoder API(prefill/packed/TP/KV ctx)声明 |
| src/models/qwen2/qwen2.hpp | Qwen2 C++ 模型封装声明(packed、sampling、TP、KV ctx) |
| src/llaisys/tokenizer.cc | Tokenizer C API 实现(SentencePiece) |
| src/llaisys/ops.cc | C API:linear 支持 bias=null;新增 segmented self-attention 导出 |
| src/llaisys/models/qwen2_kv_internal.hpp | Qwen2 KV block/context 内部结构(refcount 等) |
| src/llaisys/comm.cc | comm C API glue(getCommAPI / generateUniqueId) |
| src/device/runtime_api.hpp | RuntimeAPI:增加 iluvatar namespace 声明 |
| src/device/runtime_api.cpp | RuntimeAPI dispatcher:支持 LLAISYS_DEVICE_ILUVATAR |
| src/device/nvidia/nvidia_runtime_api.cu | NVIDIA RuntimeAPI:补全 CUDA 实现(memcpy/stream/malloc 等) |
| src/device/nvidia/nvidia_comm.cu | NCCL 后端 comm API 实现(allreduce/bcast/send/recv 等) |
| src/device/nvidia/devlink_stub.cu | CUDA devlink stub(触发 device linking) |
| src/device/nvidia/cuda_utils.hpp | CUDA error check + ScalarOps(fp16/bf16/f32) |
| src/device/iluvatar/iluvatar_utils.hpp | Iluvatar CUDA-like utils + ScalarOps |
| src/device/iluvatar/iluvatar_runtime_api.cu | Iluvatar RuntimeAPI:CUDA runtime 封装实现 |
| src/device/iluvatar/iluvatar_resource.cuh | Iluvatar DeviceResource 声明 |
| src/device/iluvatar/iluvatar_resource.cu | Iluvatar DeviceResource 构造实现 |
| src/device/iluvatar/devlink_stub.cu | Iluvatar devlink stub |
| src/device/comm_api.hpp | 通信层抽象 API 声明(含 NCCL/IXCCL 条件声明) |
| src/device/comm_api.cpp | 通信层 dispatcher + unsupported 默认实现 |
| src/core/context/context.hpp | Context 增加注释 |
| src/core/context/context.cpp | Context 生命周期/切设备增加注释 |
| scripts/run_gpu.ps1 | Windows GPU build/test/server 一键脚本 |
| scripts/launch_tp.py | TP 多进程 launcher(通过 commGenerateUniqueId) |
| scripts/benchmark_chat_scheduler.py | scheduler 压测脚本(HTTP chat) |
| python/llaisys/tokenizer.py | Python Tokenizer:SentencePiece + tokenizer.json(HF tokenizers) |
| python/llaisys/tensor_parallel.py | Qwen2 权重切分(column/row split) |
| python/llaisys/session_manager.py | Session 历史与取消事件管理 |
| python/llaisys/ops.py | Python Ops:新增 self_attention_segmented 封装 |
| python/llaisys/models/init.py | 导出 format_chat_prompt 等 |
| python/llaisys/libllaisys/tokenizer.py | ctypes tokenizer API 绑定加载 |
| python/llaisys/libllaisys/ops.py | ctypes ops:可选加载 segmented self-attention |
| python/llaisys/libllaisys/llaisys_types.py | DeviceType 增加 ILUVATAR 枚举值 |
| python/llaisys/libllaisys/init.py | lib 加载:新增 models/comm/tokenizer loader 与导出 |
| python/llaisys/kv_runtime_bridge.py | Python 侧 native KV ctx 复用桥接 |
| python/llaisys/interfaces.py | 调度器/服务/KV 池接口抽象定义 |
| python/llaisys/init.py | 顶层导出 Tokenizer |
| include/llaisys/tokenizer.h | Tokenizer C API 头文件 |
| include/llaisys/ops.h | Ops C API:新增 segmented self-attention 声明 |
| include/llaisys/models/qwen2.h | Qwen2 C API 扩展:sampling/packed/TP/KV block+context 等 |
| include/llaisys/comm.h | comm C API 头文件(backend/op/API struct) |
| include/llaisys.h | 设备枚举增加 LLAISYS_DEVICE_ILUVATAR |
| frontend/style.css | 前端样式 |
| frontend/index.html | 前端页面骨架 |
| Untitled | 新增单行命令文件 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+52
to
+59
| -- Archive into static library | ||
| local targetfile = target:targetfile() | ||
| local targetdir = path.directory(targetfile) | ||
| if not os.isdir(targetdir) then | ||
| os.mkdir(targetdir) | ||
| end | ||
| os.vrunv("ar", {"-cr", targetfile, table.unpack(objectfiles)}) | ||
| end) |
Comment on lines
+203
to
208
| TensorMeta new_meta{dtype(), new_shape, new_strides}; | ||
| return tensor_t(new Tensor(new_meta, _storage, _offset)); // 零拷贝 | ||
|
|
||
|
|
||
| return std::shared_ptr<Tensor>(new Tensor(_meta, _storage)); | ||
| } |
Comment on lines
210
to
+218
| tensor_t Tensor::view(const std::vector<size_t> &shape) const { | ||
| TO_BE_IMPLEMENTED(); | ||
| return std::shared_ptr<Tensor>(new Tensor(_meta, _storage)); | ||
| if(isContiguous() == true){ | ||
| tensor_t tmp = create(shape, this->dtype(), this->deviceType(), this->deviceId()); | ||
| tmp->_storage = this->_storage; | ||
| return tmp; | ||
| }else{ | ||
| //非连续存储 | ||
| return contiguous()->view(shape); | ||
| } |
Comment on lines
255
to
+284
| tensor_t Tensor::contiguous() const { | ||
| TO_BE_IMPLEMENTED(); | ||
| return std::shared_ptr<Tensor>(new Tensor(_meta, _storage)); | ||
| if(isContiguous()){ | ||
| return std::shared_ptr<Tensor>(new Tensor(_meta, _storage)); | ||
| }else{ | ||
| //形状 | ||
| const auto& sh = shape(); | ||
| //维度 | ||
| const auto dim = sh.size(); | ||
|
|
||
| //创建一个新的连续步长数组 | ||
| std::vector<ptrdiff_t> c_str(dim, 1); | ||
| for (size_t i = dim - 1; i-- > 0;) { | ||
| c_str[i] = c_str[i + 1] * sh[i + 1]; | ||
| } | ||
|
|
||
| //申请同设备新存储 | ||
| size_t bytes = numel() * elementSize(); | ||
| core::storage_t st = (deviceType() == LLAISYS_DEVICE_CPU) | ||
| ? core::context().runtime().allocateHostStorage(bytes) | ||
| : core::context().runtime().allocateDeviceStorage(bytes); | ||
|
|
||
| //创建新连续张量 | ||
| tensor_t dst(new Tensor(TensorMeta{dtype(), sh, c_str}, st, 0)); | ||
|
|
||
| // 4. 拷贝数据(H2H 或 H2D 视设备而定) | ||
| core::context().setDevice(deviceType(), deviceId()); | ||
| core::context().runtime().api()->memcpy_sync( | ||
| dst->data(), data(), bytes, | ||
| deviceType() == LLAISYS_DEVICE_CPU ? LLAISYS_MEMCPY_H2H : LLAISYS_MEMCPY_H2D); | ||
|
|
Comment on lines
+62
to
+67
| case LLAISYS_COMM_IXCCL: | ||
| #ifdef ENABLE_ILUVATAR_API | ||
| return llaisys::device::ixccl::getCommAPI(); | ||
| #else | ||
| return getUnsupportedCommAPI(); | ||
| #endif |
Comment on lines
+6
to
+10
| class Resource : public llaisys::device::DeviceResource { | ||
| public: | ||
| Resource(int device_id); | ||
| ~Resource(); | ||
| }; |
Comment on lines
+154
to
+168
| elseif has_config("iluvatar-gpu") then | ||
| -- No .cu files in this target, no CUDA toolchain | ||
| -- Use add_shflags to control exact link order: | ||
| -- 1. whole-archive iluvatar static libs (defines nvidia:: symbols) | ||
| -- 2. -lcudart AFTER the .a files (so cudart symbols are resolved) | ||
| add_shflags( | ||
| "-Wl,--whole-archive", | ||
| "build/linux/x86_64/release/libllaisys-ops-iluvatar.a", | ||
| "build/linux/x86_64/release/libllaisys-device-iluvatar.a", | ||
| "-Wl,--no-whole-archive", | ||
| "-L/usr/local/corex/lib64", | ||
| "-Wl,-rpath,/usr/local/corex/lib64", | ||
| "-lcudart", | ||
| {force = true} | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
一、已完成项目部分
项目 2:Nvidia + 天数
项目 3:服务器 + 前端 + 流式输出 + 会话管理 + KV 复用
项目 4:调度器 + 连续批处理 + 共享模型池 + KV 感知路由
项目 5:通信层 + NCCL 后端 + 张量并行
具体完成情况看根目录下的REPORT.md文档