Skip to content

add embedding timing instrumentation and fix dimension auto-detection#1299

Open
xtangxtang wants to merge 7 commits intovolcengine:mainfrom
epeshared:main
Open

add embedding timing instrumentation and fix dimension auto-detection#1299
xtangxtang wants to merge 7 commits intovolcengine:mainfrom
epeshared:main

Conversation

@xtangxtang
Copy link
Copy Markdown

@xtangxtang xtangxtang commented Apr 8, 2026

Summary

This PR adds fine-grained embedding telemetry instrumentation across the request pipeline and fixes a dimension auto-detection bug when the embedding model dimension is not explicitly configured.

Changes

1. Embedding Telemetry Instrumentation (openai_embedders.py)

  • Added _record_embedding_duration() helper to track embedding.duration_ms, embedding.requests, and embedding.error_count via the telemetry system.
  • Instrumented both _embed_single() and embed_batch() with time.perf_counter() timing, including error paths.

2. Session Extract with Synchronous Wait (sessions.py)

  • Added ExtractSessionRequest model with wait: bool and timeout: Optional[float] parameters.
  • When wait=True, the /extract endpoint blocks until the embedding queue drains, then reports queue.wait.duration_ms and session.extract.request.duration_ms in telemetry.

3. Queue Stats Enhancement (collection_schemas.py)

  • Extended RequestQueueStats with duration_ms, wall_start_ms, and wall_end_ms fields for accurate wall-clock duration tracking with concurrent embedding requests.

4. Telemetry Summary Enrichment (operation.py, resource_summary.py)

  • Added embedding section with duration_ms, wall_duration_ms, requests, error_count, avg_duration_ms, and share_of_total_pct.
  • Queue stats now include duration_ms and wall_duration_ms.

5. Server Config: Relaxed Non-Localhost Check (config.py, app.py)

  • Changed validate_server_config() from sys.exit(1) to logger.warning when no root_api_key is configured with a non-localhost bind address.
  • Allows flexible deployment scenarios (Docker, internal networks) while still warning about security implications.

6. Embedding Dimension Auto-Detection (embedding_config.py)

  • Bug fix: get_dimension() previously returned hardcoded 2048 when embedding.dense.dimension was omitted. This caused dimension mismatch with models like Qwen3-Embedding-0.6B (1024-dim) served via SGLang.
  • Now uses PrivateAttr _resolved_dimension for lazy detection via get_embedder().get_dimension() with caching. The config's dense.dimension field is not mutated.

7. Tests

  • Updated test_validate_no_key_non_localhost_warns to match new warning behavior.
  • Added test_embedding_config_dimension.py with 2 regression tests for dimension auto-detection.

Motivation

When benchmarking OpenViking with different embedding backends (SGLang, Ollama) on different hardware (Intel GNR with AMX, AMD Turin), we needed fine-grained telemetry to identify performance bottlenecks. The existing telemetry only tracked token counts but not embedding latency. Additionally, the hardcoded 2048 dimension fallback caused silent failures with non-standard embedding models.

Testing

  • All existing tests pass.
  • New unit tests added for dimension auto-detection.
  • Auth test updated to match new behavior.
  • Manually verified telemetry output in production-like benchmarks.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 8, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🏅 Score: 85
🧪 PR contains tests
🔒 Security concerns

Security Configuration Change:
The server no longer prevents binding to non-loopback addresses without authentication, exposing an unauthenticated ROOT endpoint to the network.

✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Update server security warning behavior for non-localhost binds

Relevant files:

  • openviking/server/config.py
  • openviking/server/app.py
  • tests/server/test_auth.py

Sub-PR theme: Fix embedding dimension auto-detection and caching

Relevant files:

  • openviking_cli/utils/config/embedding_config.py
  • tests/unit/test_embedding_config_dimension.py

Sub-PR theme: Add embedding timing instrumentation and queue wait telemetry

Relevant files:

  • openviking/models/embedder/openai_embedders.py
  • openviking/storage/collection_schemas.py
  • openviking/telemetry/operation.py
  • openviking/telemetry/resource_summary.py
  • openviking/server/routers/sessions.py

⚡ Recommended focus areas for review

Security Configuration Change

The server will now start when bound to non-loopback addresses without a root_api_key (removed sys.exit(1)). This exposes an unauthenticated ROOT endpoint to the network, which was previously blocked. The change is not mentioned in the PR title/description.

if not _is_localhost(config.host):
    logger.warning(
        "SECURITY WARNING: server.root_api_key is not configured and "
        "server.host is '%s' (non-localhost). This exposes an unauthenticated "
        "ROOT endpoint to the network.",
        config.host,
    )
    logger.warning(
        "Set server.root_api_key in ov.conf if you need authentication for "
        "network-accessible deployments."
    )
Division by Zero Guard

The embedding avg_duration_ms calculation uses embedding_duration_ms / embedding_requests. While there's a check for embedding_requests > 0, if embedding_requests is zero but embedding_duration_ms is non-zero, avg_duration_ms will be 0.0 which might be misleading, though not a crash.

"avg_duration_ms": round(embedding_duration_ms / embedding_requests, 3)
if embedding_requests
else 0.0,
Telemetry Metric Usage

The _record_embedding_duration function uses telemetry.count("embedding.duration_ms", duration_ms). If count is intended for incrementing by integer counts (not accumulating durations), this could misuse the telemetry API. Verify telemetry.count supports float values for duration accumulation.

def _record_embedding_duration(duration_ms: float, request_count: int = 1) -> None:
    telemetry = get_current_telemetry()
    telemetry.count("embedding.duration_ms", duration_ms)
    telemetry.count("embedding.requests", request_count)

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

PR Code Suggestions ✨

No code suggestions found for the PR.

xtangxtang and others added 2 commits April 8, 2026 15:24
# Conflicts:
#	openviking/models/embedder/openai_embedders.py
#	openviking/server/config.py
#	openviking/server/routers/sessions.py
#	openviking/storage/collection_schemas.py
#	openviking/telemetry/resource_summary.py
#	openviking_cli/utils/config/embedding_config.py
#	tests/server/test_auth.py
@qin-ctx qin-ctx requested a review from zhoujh01 April 8, 2026 11:17
xtangxtang and others added 3 commits April 9, 2026 10:23
… API

- Add 3 new INT8 inner-product kernels: AVX-512 VNNI, AMX single-query, AMX multi-query batch
- Implement search_knn_batch across C++/pybind/Python layers (6-layer API)
- Add MultiAspectRetriever for multi-prompt embedding with RRF fusion
- Add avx512_vnni and amx x86 build profiles
- AMX batch achieves ~3.5x speedup over serial at N=16 across all scenarios
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants