完成项目2（天数），3，4，5 by KevinSusan · Pull Request #47 · InfiniTensor/llaisys

KevinSusan · 2026-03-16T14:58:08Z

一、已完成项目部分
项目 2：Nvidia + 天数
项目 3：服务器 + 前端 + 流式输出 + 会话管理 + KV 复用
项目 4：调度器 + 连续批处理 + 共享模型池 + KV 感知路由
项目 5：通信层 + NCCL 后端 + 张量并行

具体完成情况看根目录下的REPORT.md文档

Add NVIDIA runtime/operators, GPU tests, server filters and sampling options, plus frontend sampling controls and build scripts.

- Add KV cache pool with prefix matching and reference counting - Implement multi-user inference scheduler with queue and workers - Add packed prefill and decode batch inference (Decoder::decodePacked) - Support session forking and editing in frontend - Add continuous batching with PD separation - Add segmented self-attention for packed sequences - Include benchmark and integration tests

…itance, logging) - Fix InfiniTensor#1: Replace _session_worker dict with OrderedDict LRU (max_sticky_sessions=10000) - Fix InfiniTensor#2: Add best-effort TOCTOU comment on KV-aware routing - Fix InfiniTensor#3: Add logger.debug for tokenize failures, shallow-copy payload in submit() - Fix InfiniTensor#4: KVCachePool(IKVCachePool), ChatService(IInferenceService) explicit inheritance - Fix InfiniTensor#5: Merge double lock in request_stop() - Fix InfiniTensor#6: Clean _prompt_tokens from payload after routing

- Extract SessionManager (session_manager.py): session message history + cancel events - Extract KVRuntimeBridge (kv_runtime_bridge.py): native C++ KV context lifecycle - ChatService slimmed from ~726 to ~506 lines, using delegation pattern - All IInferenceService interface signatures unchanged - HTTP API and main() parameters unchanged - Add test/test_chatservice_split.py with 19 tests covering all split modules

Previously, packed prefill/decode only handled greedy (argmax) requests; any request with temperature/top_k/top_p fell back to single-sequence processing. This adds per-sequence sampling params to the batch path via new C API bindings (PrefillPackedSampling/StepPackedSampling), with hasattr guards for backward compatibility with older DLLs.

Delete 3 outdated docs (new.md, UPDATE_PLAN.md, QA_REPORT.md) and create PROJECT_STATUS.md with progress summaries for all 6 project directions.

- server.py: add _wrap_completion/_wrap_chunk/_wrap_error helpers, generate/stream/generate_packed_non_stream return OpenAI format, SSE streams end with data: [DONE] - scheduler.py: fix continuous batching worker to parse new format (choices[0].finish_reason), convert final chunk to chat.completion for non-stream path - frontend/app.js: switch to /v1/chat/completions, max_tokens, parse new SSE format - 5 test files: update mocks and assertions for OpenAI format - PROGRESS.md, docs/PROJECT_STATUS.md: document changes

Rewrite scheduler to batch-driven mode so multiple streaming requests share the model via prepare_batch/step_batch/finalize_sequence, with dynamic shrinking and automatic fallback to legacy iterator path.

- ChatService supports shared model_lock/kv_pool/kv_bridge across workers - Add --shared-model CLI flag for single-model multi-worker mode - Add IKVCachePool.memory_pressure() and --kv-memory-threshold flow control - Optimize KV-aware routing and debug snapshot for shared pool mode - Add test/test_shared_model.py (14 tests)

Iluvatar CoreX SDK is fully CUDA-compatible, so kernels are reused from nvidia:: namespace with zero copy. Adds device enum, runtime dispatch, build scripts (clang++ -x cuda --cuda-gpu-arch=ivcore10), and test support for --device iluvatar across all test files.

The on_load hook runs too early - xmake injects cudadevrt after on_load when it detects CUDA dependencies. Use before_link to filter out cudadevrt from links, syslinks and ldflags right before the linker runs.

Root cause: xmake detects .cu files and auto-injects nvcc toolchain + cudadevrt, completely ignoring our custom iluvatar_cu rule. Solution: use on_build() to fully control compilation with clang++, never registering .cu files via add_files(). This prevents xmake from detecting CUDA and injecting nvcc/cudadevrt.

The linker does single-pass scanning of static libraries. Since llaisys-ops calls nvidia:: symbols defined in llaisys-ops-iluvatar, we need --whole-archive to force all symbols to be included.

add_ldflags was silently ignored by xmake. Use add_shflags with full .a file paths to force whole-archive inclusion of iluvatar static libraries into the shared library.

-lcudart was placed before the .a files by xmake, causing the linker to skip it (single-pass scanning). Move all iluvatar link flags into add_shflags to control exact order, and add rpath so libcudart.so is found at runtime.

All 9 GPU operators pass on Iluvatar CoreX (ivcore10). Runtime test detects 2 iluvatar devices and passes.

Added Iluvatar CoreX platform details: runtime, operators, build system, and test results. Updated summary table from 50% to 90%.

…Iluvatar test/test_infer.py --device iluvatar produces tokens identical to PyTorch reference output. Project InfiniTensor#2 now at 100%.

…sor parallelism - Communication layer: C API (comm.h), C++ dispatcher, NCCL backend - commInit accepts external unique ID for multi-rank initialization - llaisysCommGenerateUniqueId API for external ID generation - Decoder AllReduce: after attn_o and mlp_down projections (Megatron-style) - llaisysQwen2ModelSetTensorParallel C API - Python weight splitting (column/row split for Megatron-style TP) - Multi-process launcher (launch_tp.py + _tp_worker.py) - Unit tests (test_comm_api.py) and integration tests (test_allreduce.py) - Documentation: comm_design.md, PROGRESS.md, PROJECT_STATUS.md updated

… test

When TP is enabled, nh is divided by world_size, so nh*dh != hs. The attn_out3d tensor has shape [len, tp_nh, dh] and must be viewed as [len, tp_nh*dh], not [len, hs].

Copilot

Pull request overview

该 PR 将 LLAISYS 扩展到更多 GPU/分布式与推理能力：新增 Iluvatar(CoreX) GPU 构建与运行时、补全多项算子 GPU 实现、引入通信层(NCCL)与张量并行相关脚本/接口，并补充 tokenizer 与调度/KV 复用相关的测试与 Python 侧封装。

Changes:

新增/完善 NVIDIA 与 Iluvatar GPU 的 xmake 构建目标、RuntimeAPI 以及部分链接策略
补全多项 Ops（add/argmax/embedding/linear/rearrange/rms_norm/rope/self_attention/swiglu）的 CPU + CUDA 实现，并新增 segmented self-attention C/Python 绑定
增加 SentencePiece tokenizer（C++/C API/Python 封装）、通信层 API（NCCL）与多进程 allreduce/TP 启动与测试脚本、前端静态页面

Reviewed changes

Copilot reviewed 146 out of 148 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
xmake/nvidia.lua	NVIDIA 设备与算子静态库构建目标
xmake/iluvatar.lua	Iluvatar(CoreX) 目标：用 clang++ CUDA 前端手动编译/归档
xmake.lua	新增 sentencepiece/iluvatar 选项；聚合 target 依赖与链接策略调整
test/test_utils.py	类型注解更新；支持 iluvatar device 映射
test/test_tokenizer.py	SentencePiece tokenizer ctypes 冒烟测试脚本
test/test_scheduler_inmemory.py	InferenceScheduler 的内存级单测覆盖（stream/timeout/CB/packed）
test/test_runtime.py	CLI device 选项扩展至 iluvatar
test/test_kv_cache_pool.py	KVCachePool 行为单测（sealed 前缀、引用计数、回滚等）
test/test_infer.py	CLI device 选项扩展至 iluvatar
test/test_comm_api.py	Comm API（NCCL）ctypes 单测（单卡 nranks=1）
test/test_chat_minimal.py	最小聊天推理脚本（Tokenizer + Qwen2）
test/test_allreduce.py	多进程 allreduce 集成测试驱动脚本
test/ops_gpu/add.py	GPU add 对齐测试脚本
test/ops_gpu/argmax.py	GPU argmax 对齐测试脚本
test/ops_gpu/embedding.py	GPU embedding 对齐测试脚本
test/ops_gpu/linear.py	GPU linear 对齐测试脚本
test/ops_gpu/rearrange.py	GPU rearrange 对齐测试脚本
test/ops_gpu/rms_norm.py	GPU rms_norm 对齐测试脚本
test/ops_gpu/rope.py	GPU rope 对齐测试脚本
test/ops_gpu/self_attention.py	GPU self_attention 对齐测试脚本
test/ops_gpu/swiglu.py	GPU swiglu 对齐测试脚本
test/ops_gpu/run_all.py	GPU 算子测试一键运行器
test/ops_gpu/init.py	ops_gpu 包占位文件
test/ops/add.py	CPU/通用 add 测试增加 iluvatar 选项
test/ops/argmax.py	CPU/通用 argmax 测试增加 iluvatar 选项
test/ops/embedding.py	CPU/通用 embedding 测试增加 iluvatar 选项
test/ops/linear.py	CPU/通用 linear 测试增加 iluvatar 选项
test/ops/rms_norm.py	CPU/通用 rms_norm 测试增加 iluvatar 选项
test/ops/rope.py	CPU/通用 rope 测试增加 iluvatar 选项
test/ops/self_attention.py	CPU/通用 self_attention 测试增加 iluvatar 选项
test/ops/self_attention_segmented.py	segmented self-attention 参考实现/对齐测试
test/ops/swiglu.py	CPU/通用 swiglu 测试增加 iluvatar 选项
test/_allreduce_worker.py	allreduce worker（直接调用 NCCL 生成/初始化 comm）
src/utils/types.hpp	增加 include guard（pragma once）
src/utils/check.hpp	宏定义间增加空行（格式）
src/tokenizer/sentencepiece/sentencepiece.hpp	SentencePieceTokenizer C++ 封装声明
src/tokenizer/sentencepiece/sentencepiece.cpp	SentencePieceTokenizer 实现（受编译宏控制）
src/tensor/tensor.hpp	Tensor 头文件注释与结构整理（中文注释）
src/tensor/tensor.cpp	实现 isContiguous/permute/view/slice/load/contiguous 等（替换 TO_BE_IMPLEMENTED）
src/ops/add/op.cpp	add：新增 NVIDIA/ILUVATAR 分支调用 CUDA 实现
src/ops/add/nvidia/add_nvidia.hpp	add CUDA 声明
src/ops/add/nvidia/add_nvidia.cu	add CUDA kernel 实现
src/ops/add/cpu/add_cpu.hpp	add CPU 头文件格式调整
src/ops/add/cpu/add_cpu.cpp	add CPU 实现格式调整
src/ops/argmax/op.cpp	argmax：补全 device/dtype 校验与 CPU/GPU 分发
src/ops/argmax/nvidia/argmax_nvidia.hpp	argmax CUDA 声明
src/ops/argmax/nvidia/argmax_nvidia.cu	argmax CUDA kernel（单线程实现）
src/ops/argmax/cpu/argmax_cpu.hpp	argmax CPU 声明
src/ops/argmax/cpu/argmax_cpu.cpp	argmax CPU 实现
src/ops/embedding/op.cpp	embedding：补全校验与 CPU/GPU 分发
src/ops/embedding/nvidia/embedding_nvidia.hpp	embedding CUDA 声明
src/ops/embedding/nvidia/embedding_nvidia.cu	embedding CUDA kernel
src/ops/embedding/cpu/embedding_cpu.hpp	embedding CPU 声明
src/ops/embedding/cpu/embedding_cpu.cpp	embedding CPU 实现
src/ops/linear/op.cpp	linear：补全 bias 可选、校验与 CPU/GPU 分发
src/ops/linear/nvidia/linear_nvidia.hpp	linear CUDA 声明
src/ops/linear/nvidia/linear_nvidia.cu	linear CUDA kernel
src/ops/linear/cpu/linear_cpu.hpp	linear CPU 声明
src/ops/linear/cpu/linear_cpu.cpp	linear CPU 实现
src/ops/rearrange/op.cpp	rearrange：CPU/GPU 分发；GPU 侧拷贝 shape/stride 到 device
src/ops/rearrange/nvidia/rearrange_nvidia.hpp	rearrange CUDA 声明
src/ops/rearrange/nvidia/rearrange_nvidia.cu	rearrange CUDA kernel
src/ops/rearrange/cpu/rearrange_cpu.hpp	rearrange CPU 声明
src/ops/rearrange/cpu/rearrange_cpu.cpp	rearrange CPU 递归实现
src/ops/rms_norm/op.cpp	rms_norm：补全校验与 CPU/GPU 分发
src/ops/rms_norm/nvidia/rms_norm_nvidia.hpp	rms_norm CUDA 声明
src/ops/rms_norm/nvidia/rms_norm_nvidia.cu	rms_norm CUDA kernel
src/ops/rms_norm/cpu/rms_norm_cpu.hpp	rms_norm CPU 声明
src/ops/rms_norm/cpu/rms_norm_cpu.cpp	rms_norm CPU 实现
src/ops/rope/op.cpp	rope：补全校验与 CPU/GPU 分发（pos_ids int64）
src/ops/rope/nvidia/rope_nvidia.hpp	rope CUDA 声明
src/ops/rope/nvidia/rope_nvidia.cu	rope CUDA kernel
src/ops/rope/cpu/rope_cpu.hpp	rope CPU 声明
src/ops/rope/cpu/rope_cpu.cpp	rope CPU 实现
src/ops/self_attention/op.hpp	self_attention API 扩展：新增 segmented 声明
src/ops/self_attention/op.cpp	self_attention 与 segmented 路径实现与分发
src/ops/self_attention/nvidia/self_attention_nvidia.hpp	self_attention CUDA 声明
src/ops/self_attention/nvidia/self_attention_nvidia.cu	self_attention CUDA kernel（朴素实现）
src/ops/self_attention/cpu/self_attention_cpu.hpp	self_attention CPU 声明（含 segmented）
src/ops/swiglu/op.cpp	swiglu：补全校验与 CPU/GPU 分发
src/ops/swiglu/nvidia/swiglu_nvidia.hpp	swiglu CUDA 声明
src/ops/swiglu/nvidia/swiglu_nvidia.cu	swiglu CUDA kernel
src/ops/swiglu/cpu/swiglu_cpu.hpp	swiglu CPU 声明
src/ops/swiglu/cpu/swiglu_cpu.cpp	swiglu CPU 实现
src/models/transformer/decoder/decoder.hpp	Transformer decoder API（prefill/packed/TP/KV ctx）声明
src/models/qwen2/qwen2.hpp	Qwen2 C++ 模型封装声明（packed、sampling、TP、KV ctx）
src/llaisys/tokenizer.cc	Tokenizer C API 实现（SentencePiece）
src/llaisys/ops.cc	C API：linear 支持 bias=null；新增 segmented self-attention 导出
src/llaisys/models/qwen2_kv_internal.hpp	Qwen2 KV block/context 内部结构（refcount 等）
src/llaisys/comm.cc	comm C API glue（getCommAPI / generateUniqueId）
src/device/runtime_api.hpp	RuntimeAPI：增加 iluvatar namespace 声明
src/device/runtime_api.cpp	RuntimeAPI dispatcher：支持 LLAISYS_DEVICE_ILUVATAR
src/device/nvidia/nvidia_runtime_api.cu	NVIDIA RuntimeAPI：补全 CUDA 实现（memcpy/stream/malloc 等）
src/device/nvidia/nvidia_comm.cu	NCCL 后端 comm API 实现（allreduce/bcast/send/recv 等）
src/device/nvidia/devlink_stub.cu	CUDA devlink stub（触发 device linking）
src/device/nvidia/cuda_utils.hpp	CUDA error check + ScalarOps(fp16/bf16/f32)
src/device/iluvatar/iluvatar_utils.hpp	Iluvatar CUDA-like utils + ScalarOps
src/device/iluvatar/iluvatar_runtime_api.cu	Iluvatar RuntimeAPI：CUDA runtime 封装实现
src/device/iluvatar/iluvatar_resource.cuh	Iluvatar DeviceResource 声明
src/device/iluvatar/iluvatar_resource.cu	Iluvatar DeviceResource 构造实现
src/device/iluvatar/devlink_stub.cu	Iluvatar devlink stub
src/device/comm_api.hpp	通信层抽象 API 声明（含 NCCL/IXCCL 条件声明）
src/device/comm_api.cpp	通信层 dispatcher + unsupported 默认实现
src/core/context/context.hpp	Context 增加注释
src/core/context/context.cpp	Context 生命周期/切设备增加注释
scripts/run_gpu.ps1	Windows GPU build/test/server 一键脚本
scripts/launch_tp.py	TP 多进程 launcher（通过 commGenerateUniqueId）
scripts/benchmark_chat_scheduler.py	scheduler 压测脚本（HTTP chat）
python/llaisys/tokenizer.py	Python Tokenizer：SentencePiece + tokenizer.json(HF tokenizers)
python/llaisys/tensor_parallel.py	Qwen2 权重切分（column/row split）
python/llaisys/session_manager.py	Session 历史与取消事件管理
python/llaisys/ops.py	Python Ops：新增 self_attention_segmented 封装
python/llaisys/models/init.py	导出 format_chat_prompt 等
python/llaisys/libllaisys/tokenizer.py	ctypes tokenizer API 绑定加载
python/llaisys/libllaisys/ops.py	ctypes ops：可选加载 segmented self-attention
python/llaisys/libllaisys/llaisys_types.py	DeviceType 增加 ILUVATAR 枚举值
python/llaisys/libllaisys/init.py	lib 加载：新增 models/comm/tokenizer loader 与导出
python/llaisys/kv_runtime_bridge.py	Python 侧 native KV ctx 复用桥接
python/llaisys/interfaces.py	调度器/服务/KV 池接口抽象定义
python/llaisys/init.py	顶层导出 Tokenizer
include/llaisys/tokenizer.h	Tokenizer C API 头文件
include/llaisys/ops.h	Ops C API：新增 segmented self-attention 声明
include/llaisys/models/qwen2.h	Qwen2 C API 扩展：sampling/packed/TP/KV block+context 等
include/llaisys/comm.h	comm C API 头文件（backend/op/API struct）
include/llaisys.h	设备枚举增加 LLAISYS_DEVICE_ILUVATAR
frontend/style.css	前端样式
frontend/index.html	前端页面骨架
Untitled	新增单行命令文件

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

xmake/iluvatar.lua

+        -- Archive into static library
+        local targetfile = target:targetfile()
+        local targetdir = path.directory(targetfile)
+        if not os.isdir(targetdir) then
+            os.mkdir(targetdir)
+        end
+        os.vrunv("ar", {"-cr", targetfile, table.unpack(objectfiles)})
+    end)


src/tensor/tensor.cpp

+    TensorMeta new_meta{dtype(), new_shape, new_strides};
+    return tensor_t(new Tensor(new_meta, _storage, _offset));   // 零拷贝
+
+
    return std::shared_ptr<Tensor>(new Tensor(_meta, _storage));
 }


src/tensor/tensor.cpp

 tensor_t Tensor::view(const std::vector<size_t> &shape) const {
-    TO_BE_IMPLEMENTED();
-    return std::shared_ptr<Tensor>(new Tensor(_meta, _storage));
+    if(isContiguous() == true){
+        tensor_t tmp = create(shape, this->dtype(), this->deviceType(), this->deviceId()); 
+        tmp->_storage = this->_storage;
+        return tmp;
+    }else{
+        //非连续存储
+        return contiguous()->view(shape);
+    }


src/tensor/tensor.cpp

 tensor_t Tensor::contiguous() const {
-    TO_BE_IMPLEMENTED();
-    return std::shared_ptr<Tensor>(new Tensor(_meta, _storage));
+    if(isContiguous()){
+        return std::shared_ptr<Tensor>(new Tensor(_meta, _storage));
+    }else{
+        //形状
+        const auto& sh  = shape();
+        //维度
+        const auto  dim = sh.size();    
+
+        //创建一个新的连续步长数组
+        std::vector<ptrdiff_t> c_str(dim, 1);
+        for (size_t i = dim - 1; i-- > 0;) {
+            c_str[i] = c_str[i + 1] * sh[i + 1];
+        }
+
+        //申请同设备新存储
+        size_t bytes = numel() * elementSize();
+        core::storage_t st = (deviceType() == LLAISYS_DEVICE_CPU)
+                             ? core::context().runtime().allocateHostStorage(bytes)
+                             : core::context().runtime().allocateDeviceStorage(bytes);
+
+        //创建新连续张量
+        tensor_t dst(new Tensor(TensorMeta{dtype(), sh, c_str}, st, 0));
+
+        // 4. 拷贝数据（H2H 或 H2D 视设备而定）
+        core::context().setDevice(deviceType(), deviceId());
+        core::context().runtime().api()->memcpy_sync(
+        dst->data(), data(), bytes,
+        deviceType() == LLAISYS_DEVICE_CPU ? LLAISYS_MEMCPY_H2H : LLAISYS_MEMCPY_H2D);
+


src/device/comm_api.cpp

+    case LLAISYS_COMM_IXCCL:
+#ifdef ENABLE_ILUVATAR_API
+        return llaisys::device::ixccl::getCommAPI();
+#else
+        return getUnsupportedCommAPI();
+#endif


src/device/iluvatar/iluvatar_resource.cuh

+class Resource : public llaisys::device::DeviceResource {
+public:
+    Resource(int device_id);
+    ~Resource();
+};


xmake.lua

+    elseif has_config("iluvatar-gpu") then
+        -- No .cu files in this target, no CUDA toolchain
+        -- Use add_shflags to control exact link order:
+        -- 1. whole-archive iluvatar static libs (defines nvidia:: symbols)
+        -- 2. -lcudart AFTER the .a files (so cudart symbols are resolved)
+        add_shflags(
+            "-Wl,--whole-archive",
+            "build/linux/x86_64/release/libllaisys-ops-iluvatar.a",
+            "build/linux/x86_64/release/libllaisys-device-iluvatar.a",
+            "-Wl,--no-whole-archive",
+            "-L/usr/local/corex/lib64",
+            "-Wl,-rpath,/usr/local/corex/lib64",
+            "-lcudart",
+            {force = true}
+        )


nina-ysml added 30 commits March 16, 2026 22:47

chore: initial push

ad3b744

完整作业

032ac99

服务

f9cc62d

feat: add gpu ops and sampling

4c3df3a

Add NVIDIA runtime/operators, GPU tests, server filters and sampling options, plus frontend sampling controls and build scripts.

docs: add project status summary and clean up outdated docs

e9ba28b

Delete 3 outdated docs (new.md, UPDATE_PLAN.md, QA_REPORT.md) and create PROJECT_STATUS.md with progress summaries for all 6 project directions.

feat: streaming batch inference for concurrent stream requests

a0647bc

Rewrite scheduler to batch-driven mode so multiple streaming requests share the model via prepare_batch/step_batch/finalize_sequence, with dynamic shrinking and automatic fallback to legacy iterator path.

docs: update PROJECT_STATUS for shared model pool and KV flow control

e9ab4ae

fix: use custom rule for iluvatar clang++ compilation

5bf8cf9

fix: remove devlink_stub from iluvatar build (clang++ doesn't need it)

091dc1b

fix: use elseif to prevent nv-gpu and iluvatar-gpu conflict

59b6dd5

fix: explicitly filter out cudadevrt link for iluvatar

afccdf4

fix: use correct Lua syntax for filtering links

355429b

fix: use on_load hook to filter cudadevrt link for iluvatar

49df19a

fix: revert set_links to add_links for iluvatar

ffb8068

fix: use before_link hook to remove cudadevrt for iluvatar

9c4e242

The on_load hook runs too early - xmake injects cudadevrt after on_load when it detects CUDA dependencies. Use before_link to filter out cudadevrt from links, syslinks and ldflags right before the linker runs.

fix: use whole-archive for iluvatar static libs to resolve symbols

5469db7

The linker does single-pass scanning of static libraries. Since llaisys-ops calls nvidia:: symbols defined in llaisys-ops-iluvatar, we need --whole-archive to force all symbols to be included.

fix: use add_shflags with .a paths for whole-archive linking

bbd8588

add_ldflags was silently ignored by xmake. Use add_shflags with full .a file paths to force whole-archive inclusion of iluvatar static libraries into the shared library.

fix: move -lcudart after whole-archive and add rpath

963ef9e

-lcudart was placed before the .a files by xmake, causing the linker to skip it (single-pass scanning). Move all iluvatar link flags into add_shflags to control exact order, and add rpath so libcudart.so is found at runtime.

fix: add ILUVATAR to Python DeviceType enum

2fc1ba5

docs: record Iluvatar server build fixes and test results

7fc9d3d

All 9 GPU operators pass on Iluvatar CoreX (ivcore10). Runtime test detects 2 iluvatar devices and passes.

docs: update PROJECT_STATUS.md - project InfiniTensor#2 now 90% complete

229376c

Added Iluvatar CoreX platform details: runtime, operators, build system, and test results. Updated summary table from 50% to 90%.

docs: project InfiniTensor#2 complete - e2e inference test passed on …

5482563

…Iluvatar test/test_infer.py --device iluvatar produces tokens identical to PyTorch reference output. Project InfiniTensor#2 now at 100%.

nina-ysml added 15 commits March 16, 2026 22:47

fix: add --compiler-options=-fPIC to nvcc for shared library linking

1ccffe7

fix: use ctypes Structure for ncclUniqueId pass-by-value in allreduce…

d364008

… test

fix: use transformers tokenizer in TP worker + add project report

4ed459b

debug: add prefill debug script for TP investigation

6e2ac11

debug: add stderr logging for TP prefill next_token

9f9ebdc

debug: let stderr pass through in TP launcher

156dfac

fix: use q2d_shape for attn_out2d view to support tensor parallelism

e6fcec6

When TP is enabled, nh is divided by world_size, so nh*dh != hs. The attn_out3d tensor has shape [len, tp_nh, dh] and must be viewed as [len, tp_nh*dh], not [len, hs].

debug: add per-step decode logging in TP worker

bb35438

chore: clean up debug code, mark TP inference verified in report

cf4a8b0

fix: correct section numbering and add pip deps in report

335ab23

chore: remove unnecessary docs

8cc2373

rename: 报告.md -> REPORT.md

bbcd97a

fix: remove deleted server branch from clone instructions

66375ef

fix: update clone URL to llaisys_tt repo

3ed8545

Copilot AI review requested due to automatic review settings March 16, 2026 14:58

Copilot started reviewing on behalf of KevinSusan March 16, 2026 15:01 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

完成项目2（天数），3，4，5#47

完成项目2（天数），3，4，5#47
KevinSusan wants to merge 45 commits intoInfiniTensor:mainfrom
KevinSusan:main

KevinSusan commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KevinSusan commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants