add embedding timing instrumentation and fix dimension auto-detection#1299
add embedding timing instrumentation and fix dimension auto-detection#1299xtangxtang wants to merge 7 commits intovolcengine:mainfrom
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
# Conflicts: # openviking/models/embedder/openai_embedders.py # openviking/server/config.py # openviking/server/routers/sessions.py # openviking/storage/collection_schemas.py # openviking/telemetry/resource_summary.py # openviking_cli/utils/config/embedding_config.py # tests/server/test_auth.py
… API - Add 3 new INT8 inner-product kernels: AVX-512 VNNI, AMX single-query, AMX multi-query batch - Implement search_knn_batch across C++/pybind/Python layers (6-layer API) - Add MultiAspectRetriever for multi-prompt embedding with RRF fusion - Add avx512_vnni and amx x86 build profiles - AMX batch achieves ~3.5x speedup over serial at N=16 across all scenarios
Summary
This PR adds fine-grained embedding telemetry instrumentation across the request pipeline and fixes a dimension auto-detection bug when the embedding model dimension is not explicitly configured.
Changes
1. Embedding Telemetry Instrumentation (
openai_embedders.py)_record_embedding_duration()helper to trackembedding.duration_ms,embedding.requests, andembedding.error_countvia the telemetry system._embed_single()andembed_batch()withtime.perf_counter()timing, including error paths.2. Session Extract with Synchronous Wait (
sessions.py)ExtractSessionRequestmodel withwait: boolandtimeout: Optional[float]parameters.wait=True, the/extractendpoint blocks until the embedding queue drains, then reportsqueue.wait.duration_msandsession.extract.request.duration_msin telemetry.3. Queue Stats Enhancement (
collection_schemas.py)RequestQueueStatswithduration_ms,wall_start_ms, andwall_end_msfields for accurate wall-clock duration tracking with concurrent embedding requests.4. Telemetry Summary Enrichment (
operation.py,resource_summary.py)embeddingsection withduration_ms,wall_duration_ms,requests,error_count,avg_duration_ms, andshare_of_total_pct.duration_msandwall_duration_ms.5. Server Config: Relaxed Non-Localhost Check (
config.py,app.py)validate_server_config()fromsys.exit(1)tologger.warningwhen noroot_api_keyis configured with a non-localhost bind address.6. Embedding Dimension Auto-Detection (
embedding_config.py)get_dimension()previously returned hardcoded2048whenembedding.dense.dimensionwas omitted. This caused dimension mismatch with models like Qwen3-Embedding-0.6B (1024-dim) served via SGLang.PrivateAttr _resolved_dimensionfor lazy detection viaget_embedder().get_dimension()with caching. The config'sdense.dimensionfield is not mutated.7. Tests
test_validate_no_key_non_localhost_warnsto match new warning behavior.test_embedding_config_dimension.pywith 2 regression tests for dimension auto-detection.Motivation
When benchmarking OpenViking with different embedding backends (SGLang, Ollama) on different hardware (Intel GNR with AMX, AMD Turin), we needed fine-grained telemetry to identify performance bottlenecks. The existing telemetry only tracked token counts but not embedding latency. Additionally, the hardcoded 2048 dimension fallback caused silent failures with non-standard embedding models.
Testing