⚡️ Speed up method VectorIndexAutoRetriever.generate_retrieval_spec by 12%
#135
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 12% (0.12x) speedup for
VectorIndexAutoRetriever.generate_retrieval_specinllama-index-core/llama_index/core/indices/vector_store/retrievers/auto_retriever/auto_retriever.py⏱️ Runtime :
149 milliseconds→133 milliseconds(best of26runs)📝 Explanation and details
The optimization achieves a 12% speedup by implementing a simple but effective caching strategy for JSON schema generation in the
VectorIndexAutoRetrieverclass.Key Optimization Applied:
@lru_cache(maxsize=8)decorator to a new static method_cached_schema_json()that caches the result ofVectorStoreQuerySpec.schema_json(indent=4). This eliminates repeated expensive Pydantic schema serialization calls.Why This Works:
The line profiler reveals that
VectorStoreQuerySpec.schema_json(indent=4)ingenerate_retrieval_specwas consuming 10.7% of total runtime (79.96ms out of 750ms) in the original code. This call generates the same JSON schema string repeatedly for every query, which is pure computational waste since the schema is static.By caching this result, the optimized version reduces this overhead to just 0.4% (3.12ms), saving approximately 77ms per batch of queries.
Performance Impact:
The optimization is particularly valuable since
generate_retrieval_specis likely called frequently in production vector search scenarios where multiple queries are processed against the same vector store schema. The LRU cache with maxsize=8 provides good coverage for different indentation levels while keeping memory overhead minimal.✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
from dataclasses import dataclass, field
from typing import Any, List, Optional
imports
import pytest
from llama_index.core.indices.vector_store.retrievers.auto_retriever.auto_retriever import
VectorIndexAutoRetriever
--- Minimal stubs and helpers for dependencies ---
Simulate QueryBundle
@DataClass
class QueryBundle:
query_str: str
Simulate VectorStoreInfo
@DataClass
class VectorStoreInfo:
description: str
supported_filters: List[str] = field(default_factory=list)
Simulate LLM
class DummyLLM:
def init(self, responses):
self.responses = responses
self.call_args = []
--- Unit Tests ---
1. Basic Test Cases
#------------------------------------------------
from dataclasses import dataclass, field
from typing import Any, List, Optional
imports
import pytest
from llama_index.core.indices.vector_store.retrievers.auto_retriever.auto_retriever import
VectorIndexAutoRetriever
Simulate VectorStoreInfo (pydantic model)
@DataClass
class VectorStoreInfo:
description: str = "A vector store for testing."
supported_metadata_filters: List[str] = field(default_factory=list)
Simulate QueryBundle
@DataClass
class QueryBundle:
query_str: str
Simulate LLM
class DummyLLM:
def init(self, responses=None):
# responses: dict mapping query_str to output string
self.responses = responses or {}
Dummy index for constructor
class DummyIndex:
service_context = None
_object_map = {}
------------------- UNIT TESTS -------------------
Basic Test Cases
def test_basic_valid_json_output():
"""Test with a simple valid query and well-formed JSON output."""
llm = DummyLLM(responses={
"hello": '{"query": "hello", "filters": [], "top_k": 7}'
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="hello")); spec = codeflash_output # 104μs -> 25.8μs (306% faster)
def test_basic_with_filters_and_top_k():
"""Test with filters and top_k provided."""
llm = DummyLLM(responses={
"search": '{"query": "search", "filters": [{"key": "author", "value": "Alice"}], "top_k": 3}'
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="search")); spec = codeflash_output # 134μs -> 54.6μs (146% faster)
Edge Test Cases
def test_missing_top_k_in_output():
"""Test with missing top_k in the output JSON."""
llm = DummyLLM(responses={
"foo": '{"query": "foo", "filters": []}'
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="foo")); spec = codeflash_output # 92.4μs -> 19.3μs (379% faster)
def test_extra_fields_in_json_output():
"""Test with extra fields in the output JSON (should ignore extras)."""
llm = DummyLLM(responses={
"extra": '{"query": "extra", "filters": [], "top_k": 2, "unused": "value"}'
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="extra")); spec = codeflash_output # 106μs -> 25.9μs (310% faster)
def test_large_filters_list():
"""Test with a large number of filters (edge of allowed size)."""
filters = [{"key": f"k{i}", "value": f"v{i}"} for i in range(1000)]
import json
llm = DummyLLM(responses={
"largefilters": json.dumps({"query": "largefilters", "filters": filters, "top_k": 999})
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="largefilters")); spec = codeflash_output # 11.4ms -> 11.0ms (2.73% faster)
def test_top_k_greater_than_max_top_k():
"""Test with top_k greater than max_top_k (should not clamp, just return)."""
llm = DummyLLM(responses={
"bigk": '{"query": "bigk", "filters": [], "top_k": 100}'
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm,
max_top_k=10
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="bigk")); spec = codeflash_output # 105μs -> 25.1μs (322% faster)
Large Scale Test Cases
def test_many_queries_sequential():
"""Test generating specs for many queries in sequence (scalability)."""
llm = DummyLLM(responses={
f"q{i}": f'{{"query": "q{i}", "filters": [], "top_k": {i % 10}}}' for i in range(100)
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
for i in range(100):
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str=f"q{i}")); spec = codeflash_output # 6.31ms -> 746μs (745% faster)
def test_large_output_json():
"""Test with very large output JSON (filters list near the limit)."""
filters = [{"key": f"k{i}", "value": f"v{i}"} for i in range(999)]
import json
llm = DummyLLM(responses={
"huge": json.dumps({"query": "huge", "filters": filters, "top_k": 999})
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="huge")); spec = codeflash_output # 11.2ms -> 10.9ms (3.13% faster)
def test_performance_many_large_specs():
"""Test performance with many large specs (no more than 100)."""
filters = [{"key": f"k{i}", "value": f"v{i}"} for i in range(100)]
import json
llm = DummyLLM(responses={
f"big{i}": json.dumps({"query": f"big{i}", "filters": filters, "top_k": 100}) for i in range(100)
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
for i in range(100):
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str=f"big{i}")); spec = codeflash_output # 119ms -> 109ms (8.81% faster)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
To edit these changes
git checkout codeflash/optimize-VectorIndexAutoRetriever.generate_retrieval_spec-mhveosh3and push.