Skip to content

add AMX INT8 distance kernels with multi-query batch search API#1316

Open
xtangxtang wants to merge 7 commits intovolcengine:mainfrom
epeshared:int8-amx-only
Open

add AMX INT8 distance kernels with multi-query batch search API#1316
xtangxtang wants to merge 7 commits intovolcengine:mainfrom
epeshared:int8-amx-only

Conversation

@xtangxtang
Copy link
Copy Markdown

Summary

Add Intel AMX (Advanced Matrix Extensions) INT8 acceleration for vector distance computation in OpenViking's native engine backend. This includes two new AMX kernels for inner-product distance, a full-stack multi-query batch search API, and a new MultiAspectRetriever module that leverages batch search for multi-perspective recall.

Motivation

INT8 quantization reduces memory footprint by 4× compared to FP32 while maintaining search quality. Intel AMX provides hardware-accelerated matrix multiply for INT8 data via dedicated tile registers, enabling significant throughput improvements — especially when processing multiple queries simultaneously against the same database vectors.

Changes

1. AMX INT8 Distance Kernels (space_int8.h)

  • batch_inner_product_int8_amx: Single-query AMX kernel using TDPBSSD (signed int8 × signed int8 → int32). Processes up to 16 database vectors simultaneously against one query vector using tile registers.
  • batch_inner_product_int8_amx_multi_query: Multi-query AMX kernel. Computes dot(db[i], query[q]) for i=0..15, q=0..15 in a single tile operation per 64-dim chunk. Both DB vectors and query vectors share the AMX tile pipeline within each chunk iteration.

2. Multi-Query Batch Search API (6-layer stack)

Full search_knn_batch API propagated across all abstraction layers:

Layer File Addition
C++ Kernel bruteforce.h brute_force_knn_batch_int8() with 16-vector tile blocking
Vector Base vector_base.h search_knn_batch() virtual interface
Index Adapter vector_index_adapter.h Forwarding to HNSW/flat implementation
Index Manager index_manager_impl.{h,cpp} Thread-safe batch dispatch
Engine index_engine.{h,cpp} Public C++ batch API
Python Binding abi3_engine_backend.cpp pybind11 _index_engine_search_batch
Python API _python_api.py IndexEngine.search_batch()

3. AMX Build Variant

  • Added amx to x86 build profiles (build_support/x86_profiles.py, CMakeLists.txt)
  • Added AMX/VNNI CPUID detection in abi3_x86_caps.cpp
  • AMX variant compiles with -mamx-tile -mamx-int8 -mavx512vnni flags
  • Runtime variant selection: OV_ENGINE_VARIANT=amx or auto-detected

4. MultiAspectRetriever (openviking/retrieve/multi_aspect_retriever.py)

New production module for multi-perspective recall that naturally utilizes search_batch:

Query → N AspectPrompts → N Embeddings → search_batch(N) → RRF Fusion → top-k
  • AspectPrompt: frozen dataclass defining instruction prompt per aspect
  • embed_multi_aspect(): generates N embedding vectors by prepending different instruction prompts to the same query
  • reciprocal_rank_fusion(): merges N ranked result lists into a single diverse ranking
  • MultiAspectRetriever.retrieve(): end-to-end retrieve with mode="batch"|"serial"

5. Python Engine Loader Updates

  • Added x86_amx variant to engine module mappings, priority order, and display order
  • Auto-selection prefers AMX when hardware supports it

Performance

Benchmarked on Intel Xeon 6983P-C (Granite Rapids), dim=256, nb=100K, single-thread:

Scenario (N=16) AMX Batch Latency Speedup vs Serial
Micro-benchmark (kernel only) 640 µs/query 3.64×
Rerank pipeline (e2e) 10,172 µs 3.50×
Multi-embedding (e2e) 10,007 µs 3.62×
Multi-prompt retrieval (e2e) 10,088 µs 3.53×

AMX batch achieves a consistent ~3.5× speedup at N=16 across all scenarios.

Platform Requirements

  • Intel CPUs with AMX-INT8 support (Sapphire Rapids / Granite Rapids or later)
  • Linux kernel 5.16+ (for AMX XFEATURE_XTILEDATA permission via arch_prctl)
  • Falls back gracefully to AVX/scalar on unsupported hardware

Files Changed (18 files, +965/-24)

Click to expand
  • build_support/x86_profiles.py — add amx variant
  • src/CMakeLists.txt — AMX build flags
  • src/abi3_x86_caps.cpp — CPUID AMX detection
  • src/abi3_engine_backend.cpp — pybind11 batch binding
  • src/index/detail/vector/common/space_int8.h — AMX kernels
  • src/index/detail/vector/common/bruteforce.h — batch brute-force search
  • src/index/detail/vector/common/vector_base.h — AMX SIMD define
  • src/index/detail/vector/vector_index_adapter.h — adapter forwarding
  • src/index/detail/index_manager_impl.{h,cpp} — manager batch API
  • src/index/index_engine.{h,cpp} — engine batch API
  • src/index/index_manager.h — abstract batch interface
  • openviking/storage/vectordb/engine/__init__.py — variant mappings
  • openviking/storage/vectordb/engine/_python_api.py — Python batch API
  • openviking/retrieve/multi_aspect_retriever.py — new module
  • setup.py, pyproject.toml — build config

Want me to go ahead and create this PR using the GitHub API?

xtangxtang and others added 7 commits April 3, 2026 15:49
# Conflicts:
#	openviking/models/embedder/openai_embedders.py
#	openviking/server/config.py
#	openviking/server/routers/sessions.py
#	openviking/storage/collection_schemas.py
#	openviking/telemetry/resource_summary.py
#	openviking_cli/utils/config/embedding_config.py
#	tests/server/test_auth.py
… API

- Add 3 new INT8 inner-product kernels: AVX-512 VNNI, AMX single-query, AMX multi-query batch
- Implement search_knn_batch across C++/pybind/Python layers (6-layer API)
- Add MultiAspectRetriever for multi-prompt embedding with RRF fusion
- Add avx512_vnni and amx x86 build profiles
- AMX batch achieves ~3.5x speedup over serial at N=16 across all scenarios
…ation

- Remove avx512_vnni build variant and inner_product_int8_avx512_vnni kernel
- Single-query INT8 dispatch now uses batch_inner_product_int8_amx with num_vecs=1
- Remove _x86_avx512_vnni from Python engine variant mappings
- Build variants: sse3, avx2, avx512, amx
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 9, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 73
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: AMX INT8 kernels + multi-query batch search backend

Relevant files:

  • src/abi3_engine_backend.cpp
  • src/abi3_x86_caps.cpp
  • src/CMakeLists.txt
  • src/index/detail/index_manager_impl.cpp
  • src/index/detail/index_manager_impl.h
  • src/index/detail/vector/common/bruteforce.h
  • src/index/detail/vector/common/space_int8.h
  • src/index/detail/vector/common/vector_base.h
  • src/index/detail/vector/vector_index_adapter.h
  • src/index/index_engine.cpp
  • src/index/index_engine.h
  • src/index/index_manager.h
  • openviking/storage/vectordb/engine/init.py
  • openviking/storage/vectordb/engine/_python_api.py

Sub-PR theme: MultiAspectRetriever Python module

Relevant files:

  • openviking/retrieve/multi_aspect_retriever.py

Sub-PR theme: Telemetry, tests, and server warning updates

Relevant files:

  • openviking/telemetry/operation.py
  • tests/unit/test_embedding_config_dimension.py
  • openviking/server/app.py

⚡ Recommended focus areas for review

Batch Search Ignores Per-Request DSL & TopK

The search_batch implementation only uses the first request's dsl filter and topk value, silently ignoring these parameters for all other requests in the batch. This causes incorrect results when different requests in the batch have distinct filters or top-k counts.

int IndexManagerImpl::search_batch(const std::vector<SearchRequest>& reqs,
                                   std::vector<SearchResult>& results) {
  if (reqs.empty()) return 0;

  // Use first request's DSL for all queries (batch assumes shared filter)
  const auto& dsl = reqs[0].dsl;
  SearchContext ctx;
  if (!dsl.empty()) {
    if (int ret = parse_dsl_query(dsl, ctx); ret != 0) {
      SPDLOG_ERROR("IndexManagerImpl::search_batch DSL parse fail: {}", dsl);
      return ret;
    }
  }

  // Sorter queries are not supported in batch mode
  if (ctx.sorter_op) {
    return IndexManager::search_batch(reqs, results);
  }

  std::shared_lock<std::shared_mutex> lock(rw_mutex_);

  BitmapPtr bitmap = nullptr;
  if (ctx.filter_op) {
    bitmap = calculate_filter_bitmap(ctx, dsl);
    if (!bitmap) {
      SPDLOG_DEBUG("search_batch: calculate_filter_bitmap returned null");
      return -1;
    }
  }

  // Extract query vectors
  std::vector<const float*> query_ptrs(reqs.size());
  for (size_t i = 0; i < reqs.size(); ++i) {
    query_ptrs[i] = reqs[i].query.data();
  }
  uint32_t topk = reqs[0].topk;

  std::vector<VectorRecallResult> recall_results;
  int ret = vector_index_->recall_batch(query_ptrs, topk, bitmap.get(),
                                        recall_results);
  if (ret != 0) {
    SPDLOG_ERROR("search_batch: recall_batch failed, ret={}", ret);
    return ret;
  }

  results.resize(reqs.size());
  for (size_t i = 0; i < reqs.size(); ++i) {
    std::swap(results[i].labels, recall_results[i].labels);
    std::swap(results[i].scores, recall_results[i].scores);
    results[i].result_num = results[i].labels.size();
  }

  return 0;
}
AMX Permission Requested Once Per Process, Not Per Thread

The ensure_amx_permission function uses a static flag to request AMX tile permissions only once per process. However, AMX permissions must be requested per thread on Linux. This will cause crashes or undefined behavior in multi-threaded environments when threads other than the first use AMX instructions.

inline void ensure_amx_permission() {
  static bool requested = false;
  if (!requested) {
#if defined(__linux__)
    // ARCH_REQ_XCOMP_PERM = 0x1023, XFEATURE_XTILEDATA = 18
    syscall(SYS_arch_prctl, 0x1023, 18);
#endif
    requested = true;
  }
}
Batch AMX Path Skips Sparse Index Blending

The AMX-accelerated batch search path computes scores using only dense distance, skipping the finalize_score function that adds sparse index blending. This leads to incorrect results when sparse indexes are enabled, as sparse contributions are completely omitted.

float score =
    reverse_query_score_ ? (1.0f - raw_dist) : raw_dist;

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

PR Code Suggestions ✨

No code suggestions found for the PR.

@zhoujh01 zhoujh01 self-assigned this Apr 9, 2026
@qin-ctx qin-ctx requested a review from zhoujh01 April 9, 2026 03:08
@zhoujh01
Copy link
Copy Markdown
Collaborator

zhoujh01 commented Apr 9, 2026

Thank you for your code contribution. I'll review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants