Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 12, 2025

📄 12% (0.12x) speedup for VectorIndexAutoRetriever.generate_retrieval_spec in llama-index-core/llama_index/core/indices/vector_store/retrievers/auto_retriever/auto_retriever.py

⏱️ Runtime : 149 milliseconds 133 milliseconds (best of 26 runs)

📝 Explanation and details

The optimization achieves a 12% speedup by implementing a simple but effective caching strategy for JSON schema generation in the VectorIndexAutoRetriever class.

Key Optimization Applied:

  1. LRU Cache for Schema Generation: Added @lru_cache(maxsize=8) decorator to a new static method _cached_schema_json() that caches the result of VectorStoreQuerySpec.schema_json(indent=4). This eliminates repeated expensive Pydantic schema serialization calls.

Why This Works:

The line profiler reveals that VectorStoreQuerySpec.schema_json(indent=4) in generate_retrieval_spec was consuming 10.7% of total runtime (79.96ms out of 750ms) in the original code. This call generates the same JSON schema string repeatedly for every query, which is pure computational waste since the schema is static.

By caching this result, the optimized version reduces this overhead to just 0.4% (3.12ms), saving approximately 77ms per batch of queries.

Performance Impact:

  • Best for repeated queries: Test results show dramatic improvements for sequential operations - up to 745% faster for 100 sequential queries and 379% faster for simple repeated operations
  • Minimal impact on single queries: Even single queries benefit from the reduced schema generation overhead
  • Scalable benefit: The more queries processed in sequence, the greater the relative speedup due to cache hits

The optimization is particularly valuable since generate_retrieval_spec is likely called frequently in production vector search scenarios where multiple queries are processed against the same vector store schema. The LRU cache with maxsize=8 provides good coverage for different indentation levels while keeping memory overhead minimal.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 221 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime

from dataclasses import dataclass, field
from typing import Any, List, Optional

imports

import pytest
from llama_index.core.indices.vector_store.retrievers.auto_retriever.auto_retriever import
VectorIndexAutoRetriever

--- Minimal stubs and helpers for dependencies ---

Simulate QueryBundle

@DataClass
class QueryBundle:
query_str: str

Simulate VectorStoreInfo

@DataClass
class VectorStoreInfo:
description: str
supported_filters: List[str] = field(default_factory=list)

def json(self, indent=4):
    return (
        '{\n'
        f'    "description": "{self.description}",\n'
        f'    "supported_filters": {self.supported_filters}\n'
        '}'
    )

Simulate LLM

class DummyLLM:
def init(self, responses):
self.responses = responses
self.call_args = []

def predict(self, prompt, **prompt_args):
    # Save call args for inspection
    self.call_args.append((prompt, prompt_args))
    # Return the next response in the list, or the last one repeatedly
    if len(self.responses) > 1:
        return self.responses.pop(0)
    return self.responses[0]

--- Unit Tests ---

1. Basic Test Cases

#------------------------------------------------
from dataclasses import dataclass, field
from typing import Any, List, Optional

imports

import pytest
from llama_index.core.indices.vector_store.retrievers.auto_retriever.auto_retriever import
VectorIndexAutoRetriever

Simulate VectorStoreInfo (pydantic model)

@DataClass
class VectorStoreInfo:
description: str = "A vector store for testing."
supported_metadata_filters: List[str] = field(default_factory=list)

def json(self, indent=4):
    return '{"description": "%s"}' % self.description

Simulate QueryBundle

@DataClass
class QueryBundle:
query_str: str

Simulate LLM

class DummyLLM:
def init(self, responses=None):
# responses: dict mapping query_str to output string
self.responses = responses or {}

def predict(self, prompt, schema_str, info_str, query_str):
    # Return the output string for the given query_str
    if query_str in self.responses:
        return self.responses[query_str]
    # Default: return a valid JSON string
    import json
    return json.dumps({
        "query": query_str,
        "filters": [],
        "top_k": 5
    })

Dummy index for constructor

class DummyIndex:
service_context = None
_object_map = {}

------------------- UNIT TESTS -------------------

Basic Test Cases

def test_basic_valid_json_output():
"""Test with a simple valid query and well-formed JSON output."""
llm = DummyLLM(responses={
"hello": '{"query": "hello", "filters": [], "top_k": 7}'
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="hello")); spec = codeflash_output # 104μs -> 25.8μs (306% faster)

def test_basic_with_filters_and_top_k():
"""Test with filters and top_k provided."""
llm = DummyLLM(responses={
"search": '{"query": "search", "filters": [{"key": "author", "value": "Alice"}], "top_k": 3}'
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="search")); spec = codeflash_output # 134μs -> 54.6μs (146% faster)

Edge Test Cases

def test_missing_top_k_in_output():
"""Test with missing top_k in the output JSON."""
llm = DummyLLM(responses={
"foo": '{"query": "foo", "filters": []}'
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="foo")); spec = codeflash_output # 92.4μs -> 19.3μs (379% faster)

def test_extra_fields_in_json_output():
"""Test with extra fields in the output JSON (should ignore extras)."""
llm = DummyLLM(responses={
"extra": '{"query": "extra", "filters": [], "top_k": 2, "unused": "value"}'
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="extra")); spec = codeflash_output # 106μs -> 25.9μs (310% faster)

def test_large_filters_list():
"""Test with a large number of filters (edge of allowed size)."""
filters = [{"key": f"k{i}", "value": f"v{i}"} for i in range(1000)]
import json
llm = DummyLLM(responses={
"largefilters": json.dumps({"query": "largefilters", "filters": filters, "top_k": 999})
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="largefilters")); spec = codeflash_output # 11.4ms -> 11.0ms (2.73% faster)

def test_top_k_greater_than_max_top_k():
"""Test with top_k greater than max_top_k (should not clamp, just return)."""
llm = DummyLLM(responses={
"bigk": '{"query": "bigk", "filters": [], "top_k": 100}'
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm,
max_top_k=10
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="bigk")); spec = codeflash_output # 105μs -> 25.1μs (322% faster)

Large Scale Test Cases

def test_many_queries_sequential():
"""Test generating specs for many queries in sequence (scalability)."""
llm = DummyLLM(responses={
f"q{i}": f'{{"query": "q{i}", "filters": [], "top_k": {i % 10}}}' for i in range(100)
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
for i in range(100):
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str=f"q{i}")); spec = codeflash_output # 6.31ms -> 746μs (745% faster)

def test_large_output_json():
"""Test with very large output JSON (filters list near the limit)."""
filters = [{"key": f"k{i}", "value": f"v{i}"} for i in range(999)]
import json
llm = DummyLLM(responses={
"huge": json.dumps({"query": "huge", "filters": filters, "top_k": 999})
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str="huge")); spec = codeflash_output # 11.2ms -> 10.9ms (3.13% faster)

def test_performance_many_large_specs():
"""Test performance with many large specs (no more than 100)."""
filters = [{"key": f"k{i}", "value": f"v{i}"} for i in range(100)]
import json
llm = DummyLLM(responses={
f"big{i}": json.dumps({"query": f"big{i}", "filters": filters, "top_k": 100}) for i in range(100)
})
retriever = VectorIndexAutoRetriever(
index=DummyIndex(),
vector_store_info=VectorStoreInfo(),
llm=llm
)
for i in range(100):
codeflash_output = retriever.generate_retrieval_spec(QueryBundle(query_str=f"big{i}")); spec = codeflash_output # 119ms -> 109ms (8.81% faster)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-VectorIndexAutoRetriever.generate_retrieval_spec-mhveosh3 and push.

Codeflash Static Badge

The optimization achieves a **12% speedup** by implementing a simple but effective caching strategy for JSON schema generation in the `VectorIndexAutoRetriever` class.

**Key Optimization Applied:**

1. **LRU Cache for Schema Generation**: Added `@lru_cache(maxsize=8)` decorator to a new static method `_cached_schema_json()` that caches the result of `VectorStoreQuerySpec.schema_json(indent=4)`. This eliminates repeated expensive Pydantic schema serialization calls.

**Why This Works:**

The line profiler reveals that `VectorStoreQuerySpec.schema_json(indent=4)` in `generate_retrieval_spec` was consuming **10.7% of total runtime** (79.96ms out of 750ms) in the original code. This call generates the same JSON schema string repeatedly for every query, which is pure computational waste since the schema is static.

By caching this result, the optimized version reduces this overhead to just **0.4%** (3.12ms), saving approximately **77ms per batch of queries**.

**Performance Impact:**

- **Best for repeated queries**: Test results show dramatic improvements for sequential operations - up to **745% faster** for 100 sequential queries and **379% faster** for simple repeated operations
- **Minimal impact on single queries**: Even single queries benefit from the reduced schema generation overhead
- **Scalable benefit**: The more queries processed in sequence, the greater the relative speedup due to cache hits

The optimization is particularly valuable since `generate_retrieval_spec` is likely called frequently in production vector search scenarios where multiple queries are processed against the same vector store schema. The LRU cache with maxsize=8 provides good coverage for different indentation levels while keeping memory overhead minimal.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 12, 2025 02:54
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant