Skip to content

Conversation

@codeflash-ai
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented Aug 22, 2025

⚡️ This pull request contains optimizations for PR #1504

If you approve this dependent PR, these changes will be merged into the original PR branch feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs.

This PR will be automatically closed if the original PR is merged.


📄 19% (0.19x) speedup for retrieve_selectors_from_schema in inference/core/workflows/execution_engine/introspection/schema_parser.py

⏱️ Runtime : 186 microseconds 157 microseconds (best of 86 runs)

📝 Explanation and details

The optimized code achieves an 18% speedup through several targeted micro-optimizations:

1. Direct OrderedDict Construction
The most significant improvement eliminates the intermediate list allocation in retrieve_selectors_from_schema. Instead of building a list and then converting it to an OrderedDict with a generator expression, selectors are added directly to the OrderedDict during iteration. This saves memory allocation and reduces the final conversion overhead.

2. Reduced Dictionary Access Overhead
In retrieve_selectors_from_simple_property, the property_definition parameter is aliased to pd to avoid repeated dictionary name lookups. While seemingly minor, this reduces attribute resolution overhead in the function's hot path.

3. Optimized Set Membership Testing
The dynamic points-to-batch logic now caches set membership results in local variables (in_batches_and_scalars, in_batches, in_auto_cast) rather than performing the same set membership tests multiple times.

4. Conditional List Comprehension
When processing KIND_KEY values, the code now checks if the list is empty before creating the list comprehension, avoiding unnecessary iterator creation for empty cases.

Performance Analysis from Tests:
The optimizations show consistent improvements across all test scenarios, with particularly strong gains (20-30%) on simpler schemas and smaller but meaningful gains (6-11%) on complex union cases. The optimizations are most effective for schemas with many properties, where the direct dictionary construction and reduced lookups compound their benefits. Edge cases like empty schemas show the highest relative improvements (50%+) due to reduced overhead in the main loop structure.

Correctness verification report:

Test Status
⏪ Replay Tests 🔘 None Found
⚙️ Existing Unit Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
🌀 Generated Regression Tests 35 Passed
📊 Tests Coverage 93.3%
🌀 Generated Regression Tests and Runtime
import itertools
from collections import OrderedDict, defaultdict
from typing import Dict, List, Optional, Set

# imports
import pytest
from inference.core.workflows.execution_engine.introspection.schema_parser import \
    retrieve_selectors_from_schema

# --- Minimal stubs for entities and constants used in the tested code ---

# Constants
KIND_KEY = "kind"
REFERENCE_KEY = "$ref"
SELECTED_ELEMENT_KEY = "selected_element"
SELECTOR_POINTS_TO_BATCH_KEY = "points_to_batch"

# Kind enum stub
class Kind:
    # For test purposes, just store the string
    def __init__(self, kind_str):
        self.kind_str = kind_str

    @classmethod
    def model_validate(cls, val):
        return cls(val)

    def __eq__(self, other):
        if not isinstance(other, Kind):
            return False
        return self.kind_str == other.kind_str

    def __hash__(self):
        return hash(self.kind_str)

    def __repr__(self):
        return f"Kind({self.kind_str!r})"

# ReferenceDefinition stub
class ReferenceDefinition:
    def __init__(self, selected_element, kind, points_to_batch):
        self.selected_element = selected_element
        self.kind = list(kind)
        self.points_to_batch = set(points_to_batch)

    def __eq__(self, other):
        if not isinstance(other, ReferenceDefinition):
            return False
        return (
            self.selected_element == other.selected_element
            and set(self.kind) == set(other.kind)
            and set(self.points_to_batch) == set(other.points_to_batch)
        )

    def __repr__(self):
        return (
            f"ReferenceDefinition(selected_element={self.selected_element!r}, "
            f"kind={self.kind!r}, points_to_batch={self.points_to_batch!r})"
        )

# SelectorDefinition stub
class SelectorDefinition:
    def __init__(
        self,
        property_name,
        property_description,
        allowed_references,
        is_list_element,
        is_dict_element,
        dimensionality_offset,
        is_dimensionality_reference_property,
    ):
        self.property_name = property_name
        self.property_description = property_description
        self.allowed_references = list(allowed_references)
        self.is_list_element = is_list_element
        self.is_dict_element = is_dict_element
        self.dimensionality_offset = dimensionality_offset
        self.is_dimensionality_reference_property = is_dimensionality_reference_property

    def __eq__(self, other):
        if not isinstance(other, SelectorDefinition):
            return False
        return (
            self.property_name == other.property_name
            and self.property_description == other.property_description
            and self.allowed_references == other.allowed_references
            and self.is_list_element == other.is_list_element
            and self.is_dict_element == other.is_dict_element
            and self.dimensionality_offset == other.dimensionality_offset
            and self.is_dimensionality_reference_property == other.is_dimensionality_reference_property
        )

    def __repr__(self):
        return (
            f"SelectorDefinition(property_name={self.property_name!r}, "
            f"property_description={self.property_description!r}, "
            f"allowed_references={self.allowed_references!r}, "
            f"is_list_element={self.is_list_element!r}, "
            f"is_dict_element={self.is_dict_element!r}, "
            f"dimensionality_offset={self.dimensionality_offset!r}, "
            f"is_dimensionality_reference_property={self.is_dimensionality_reference_property!r})"
        )

# --- Insert the retrieve_selectors_from_schema and helpers here (as in the prompt) ---


EXCLUDED_PROPERTIES = {"type"}

ITEMS_KEY = "items"
TYPE_KEY = "type"
ADDITIONAL_PROPERTIES_KEY = "additionalProperties"
PROPERTIES_KEY = "properties"
DESCRIPTION_KEY = "description"
OBJECT_TYPE = "object"
from inference.core.workflows.execution_engine.introspection.schema_parser import \
    retrieve_selectors_from_schema

# --- UNIT TESTS ---

# Helper for expected SelectorDefinition
def make_selector(
    property_name,
    property_description,
    selected_element,
    kind,
    points_to_batch,
    is_list_element=False,
    is_dict_element=False,
    dimensionality_offset=0,
    is_dimensionality_reference_property=False,
):
    return SelectorDefinition(
        property_name=property_name,
        property_description=property_description,
        allowed_references=[
            ReferenceDefinition(
                selected_element=selected_element,
                kind=[Kind.model_validate(k) for k in kind],
                points_to_batch=points_to_batch,
            )
        ],
        is_list_element=is_list_element,
        is_dict_element=is_dict_element,
        dimensionality_offset=dimensionality_offset,
        is_dimensionality_reference_property=is_dimensionality_reference_property,
    )

# ---------------- Basic Test Cases ----------------

def test_basic_single_reference_property():
    # Test a schema with a single reference property, no batch, no special flags
    schema = {
        "properties": {
            "input1": {
                "description": "A reference input",
                REFERENCE_KEY: "#/definitions/foo",
                SELECTED_ELEMENT_KEY: "foo",
                KIND_KEY: ["foo_kind"],
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 5.65μs -> 4.43μs (27.6% faster)
    expected = OrderedDict([
        ("input1", make_selector(
            property_name="input1",
            property_description="A reference input",
            selected_element="foo",
            kind=["foo_kind"],
            points_to_batch={False},
        ))
    ])

def test_basic_multiple_properties_some_non_reference():
    # Test schema with one reference and one non-reference property
    schema = {
        "properties": {
            "input1": {
                "description": "A reference input",
                REFERENCE_KEY: "#/definitions/foo",
                SELECTED_ELEMENT_KEY: "foo",
                KIND_KEY: ["foo_kind"],
            },
            "input2": {
                "description": "A non-reference input",
                "type": "string"
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 6.10μs -> 5.23μs (16.7% faster)
    expected = OrderedDict([
        ("input1", make_selector(
            property_name="input1",
            property_description="A reference input",
            selected_element="foo",
            kind=["foo_kind"],
            points_to_batch={False},
        ))
    ])

def test_basic_list_of_reference():
    # Test a property which is a list of references
    schema = {
        "properties": {
            "input_list": {
                "description": "A list of references",
                "items": {
                    REFERENCE_KEY: "#/definitions/bar",
                    SELECTED_ELEMENT_KEY: "bar",
                    KIND_KEY: ["bar_kind"],
                }
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.82μs -> 3.95μs (22.1% faster)
    expected = OrderedDict([
        ("input_list", make_selector(
            property_name="input_list",
            property_description="A list of references",
            selected_element="bar",
            kind=["bar_kind"],
            points_to_batch={False},
            is_list_element=True,
        ))
    ])

def test_basic_dict_of_reference():
    # Test a property which is a dict of references
    schema = {
        "properties": {
            "input_dict": {
                "description": "A dict of references",
                "type": "object",
                "additionalProperties": {
                    REFERENCE_KEY: "#/definitions/baz",
                    SELECTED_ELEMENT_KEY: "baz",
                    KIND_KEY: ["baz_kind"],
                }
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.89μs -> 4.10μs (19.3% faster)
    expected = OrderedDict([
        ("input_dict", make_selector(
            property_name="input_dict",
            property_description="A dict of references",
            selected_element="baz",
            kind=["baz_kind"],
            points_to_batch={False},
            is_dict_element=True,
        ))
    ])

def test_basic_dimensionality_offset_and_reference_property():
    # Test dimensionality offset and is_dimensionality_reference_property
    schema = {
        "properties": {
            "input1": {
                "description": "A reference input",
                REFERENCE_KEY: "#/definitions/foo",
                SELECTED_ELEMENT_KEY: "foo",
                KIND_KEY: ["foo_kind"],
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={"input1": 2},
        dimensionality_reference_property="input1",
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.57μs -> 3.70μs (23.6% faster)
    expected = OrderedDict([
        ("input1", SelectorDefinition(
            property_name="input1",
            property_description="A reference input",
            allowed_references=[
                ReferenceDefinition(
                    selected_element="foo",
                    kind=[Kind.model_validate("foo_kind")],
                    points_to_batch={False},
                )
            ],
            is_list_element=False,
            is_dict_element=False,
            dimensionality_offset=2,
            is_dimensionality_reference_property=True,
        ))
    ])

def test_basic_points_to_batch_true():
    # Test a reference with points_to_batch True
    schema = {
        "properties": {
            "input1": {
                "description": "A ref input",
                REFERENCE_KEY: "#/definitions/foo",
                SELECTED_ELEMENT_KEY: "foo",
                KIND_KEY: ["foo_kind"],
                SELECTOR_POINTS_TO_BATCH_KEY: True,
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.62μs -> 3.62μs (27.7% faster)
    expected = OrderedDict([
        ("input1", make_selector(
            property_name="input1",
            property_description="A ref input",
            selected_element="foo",
            kind=["foo_kind"],
            points_to_batch={True},
        ))
    ])

# ---------------- Edge Test Cases ----------------

def test_edge_empty_schema():
    # Test empty schema
    schema = {"properties": {}}
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 2.73μs -> 1.79μs (51.9% faster)

def test_edge_excluded_properties():
    # Test that excluded properties (e.g. "type") are ignored
    schema = {
        "properties": {
            "type": {
                "description": "Should be excluded",
                REFERENCE_KEY: "#/definitions/foo",
                SELECTED_ELEMENT_KEY: "foo",
                KIND_KEY: ["foo_kind"],
            },
            "input1": {
                "description": "Should be included",
                REFERENCE_KEY: "#/definitions/bar",
                SELECTED_ELEMENT_KEY: "bar",
                KIND_KEY: ["bar_kind"],
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.68μs -> 3.82μs (22.6% faster)
    expected = OrderedDict([
        ("input1", make_selector(
            property_name="input1",
            property_description="Should be included",
            selected_element="bar",
            kind=["bar_kind"],
            points_to_batch={False},
        ))
    ])

def test_edge_missing_description():
    # Test property with no description (should default to "not available")
    schema = {
        "properties": {
            "input1": {
                REFERENCE_KEY: "#/definitions/foo",
                SELECTED_ELEMENT_KEY: "foo",
                KIND_KEY: ["foo_kind"],
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.39μs -> 3.49μs (25.9% faster)

def test_edge_union_anyof():
    # Test property with anyOf union of two references
    schema = {
        "properties": {
            "union_input": {
                "description": "Union input",
                "anyOf": [
                    {
                        REFERENCE_KEY: "#/definitions/a",
                        SELECTED_ELEMENT_KEY: "a",
                        KIND_KEY: ["kind_a"],
                    },
                    {
                        REFERENCE_KEY: "#/definitions/b",
                        SELECTED_ELEMENT_KEY: "b",
                        KIND_KEY: ["kind_b"],
                        SELECTOR_POINTS_TO_BATCH_KEY: True,
                    }
                ]
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 11.2μs -> 10.3μs (7.85% faster)
    # Should merge references by selected_element
    expected = OrderedDict([
        ("union_input", SelectorDefinition(
            property_name="union_input",
            property_description="Union input",
            allowed_references=[
                ReferenceDefinition(
                    selected_element="a",
                    kind=[Kind.model_validate("kind_a")],
                    points_to_batch={False},
                ),
                ReferenceDefinition(
                    selected_element="b",
                    kind=[Kind.model_validate("kind_b")],
                    points_to_batch={True},
                ),
            ],
            is_list_element=False,
            is_dict_element=False,
            dimensionality_offset=0,
            is_dimensionality_reference_property=False,
        ))
    ])

def test_edge_union_merges_kinds_and_points_to_batch():
    # Test union merging kinds and points_to_batch for same selected_element
    schema = {
        "properties": {
            "union_input": {
                "description": "Union input",
                "anyOf": [
                    {
                        REFERENCE_KEY: "#/definitions/a",
                        SELECTED_ELEMENT_KEY: "a",
                        KIND_KEY: ["kind_a"],
                        SELECTOR_POINTS_TO_BATCH_KEY: True,
                    },
                    {
                        REFERENCE_KEY: "#/definitions/a",
                        SELECTED_ELEMENT_KEY: "a",
                        KIND_KEY: ["kind_b"],
                        SELECTOR_POINTS_TO_BATCH_KEY: False,
                    }
                ]
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 10.4μs -> 9.28μs (11.7% faster)
    expected = OrderedDict([
        ("union_input", SelectorDefinition(
            property_name="union_input",
            property_description="Union input",
            allowed_references=[
                ReferenceDefinition(
                    selected_element="a",
                    kind=[Kind.model_validate("kind_a"), Kind.model_validate("kind_b")],
                    points_to_batch={True, False},
                ),
            ],
            is_list_element=False,
            is_dict_element=False,
            dimensionality_offset=0,
            is_dimensionality_reference_property=False,
        ))
    ])

def test_edge_dynamic_points_to_batch_accepting_batches_and_scalars():
    # Test "dynamic" points_to_batch with property in inputs_accepting_batches_and_scalars
    schema = {
        "properties": {
            "input1": {
                "description": "Dynamic points_to_batch",
                REFERENCE_KEY: "#/definitions/foo",
                SELECTED_ELEMENT_KEY: "foo",
                KIND_KEY: ["foo_kind"],
                SELECTOR_POINTS_TO_BATCH_KEY: "dynamic",
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars={"input1"},
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.33μs -> 3.58μs (21.0% faster)
    expected = OrderedDict([
        ("input1", make_selector(
            property_name="input1",
            property_description="Dynamic points_to_batch",
            selected_element="foo",
            kind=["foo_kind"],
            points_to_batch={True, False},
        ))
    ])

def test_edge_dynamic_points_to_batch_accepting_batches():
    # Test "dynamic" points_to_batch with property in inputs_accepting_batches only
    schema = {
        "properties": {
            "input1": {
                "description": "Dynamic points_to_batch",
                REFERENCE_KEY: "#/definitions/foo",
                SELECTED_ELEMENT_KEY: "foo",
                KIND_KEY: ["foo_kind"],
                SELECTOR_POINTS_TO_BATCH_KEY: "dynamic",
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches={"input1"},
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.12μs -> 3.48μs (18.4% faster)
    expected = OrderedDict([
        ("input1", make_selector(
            property_name="input1",
            property_description="Dynamic points_to_batch",
            selected_element="foo",
            kind=["foo_kind"],
            points_to_batch={True},
        ))
    ])

def test_edge_dynamic_points_to_batch_enforcing_auto_batch_casting():
    # Test "dynamic" points_to_batch with property in inputs_enforcing_auto_batch_casting only
    schema = {
        "properties": {
            "input1": {
                "description": "Dynamic points_to_batch",
                REFERENCE_KEY: "#/definitions/foo",
                SELECTED_ELEMENT_KEY: "foo",
                KIND_KEY: ["foo_kind"],
                SELECTOR_POINTS_TO_BATCH_KEY: "dynamic",
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting={"input1"},
    ); result = codeflash_output # 4.19μs -> 3.33μs (25.9% faster)
    expected = OrderedDict([
        ("input1", make_selector(
            property_name="input1",
            property_description="Dynamic points_to_batch",
            selected_element="foo",
            kind=["foo_kind"],
            points_to_batch={True},
        ))
    ])

def test_edge_dynamic_points_to_batch_none_of_the_sets():
    # Test "dynamic" points_to_batch with property in none of the sets
    schema = {
        "properties": {
            "input1": {
                "description": "Dynamic points_to_batch",
                REFERENCE_KEY: "#/definitions/foo",
                SELECTED_ELEMENT_KEY: "foo",
                KIND_KEY: ["foo_kind"],
                SELECTOR_POINTS_TO_BATCH_KEY: "dynamic",
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.21μs -> 3.54μs (19.0% faster)
    expected = OrderedDict([
        ("input1", make_selector(
            property_name="input1",
            property_description="Dynamic points_to_batch",
            selected_element="foo",
            kind=["foo_kind"],
            points_to_batch={False},
        ))
    ])

def test_edge_nested_list_of_list_reference():
    # Should ignore nested references above first level of depth
    schema = {
        "properties": {
            "input1": {
                "description": "List of lists of references",
                "items": {
                    "items": {
                        REFERENCE_KEY: "#/definitions/foo",
                        SELECTED_ELEMENT_KEY: "foo",
                        KIND_KEY: ["foo_kind"],
                    }
                }
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.05μs -> 3.24μs (25.1% faster)

def test_edge_nested_dict_of_dict_reference():
    # Should ignore nested references above first level of depth
    schema = {
        "properties": {
            "input1": {
                "description": "Dict of dicts of references",
                "type": "object",
                "additionalProperties": {
                    "additionalProperties": {
                        REFERENCE_KEY: "#/definitions/foo",
                        SELECTED_ELEMENT_KEY: "foo",
                        KIND_KEY: ["foo_kind"],
                    }
                }
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema=schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.70μs -> 3.99μs (17.9% faster)

# ---------------- Large Scale Test Cases ----------------




#------------------------------------------------
import itertools
from collections import OrderedDict, defaultdict
from typing import Dict, List, Optional, Set

# imports
import pytest
from inference.core.workflows.execution_engine.introspection.schema_parser import \
    retrieve_selectors_from_schema


# SelectorDefinition and ReferenceDefinition for testing
class ReferenceDefinition:
    def __init__(self, selected_element, kind, points_to_batch):
        self.selected_element = selected_element
        self.kind = list(kind)
        self.points_to_batch = set(points_to_batch)

    def __eq__(self, other):
        return (
            isinstance(other, ReferenceDefinition)
            and self.selected_element == other.selected_element
            and set(self.kind) == set(other.kind)
            and set(self.points_to_batch) == set(other.points_to_batch)
        )

    def __repr__(self):
        return f"ReferenceDefinition(selected_element={self.selected_element!r}, kind={self.kind!r}, points_to_batch={self.points_to_batch!r})"

class SelectorDefinition:
    def __init__(
        self,
        property_name,
        property_description,
        allowed_references,
        is_list_element,
        is_dict_element,
        dimensionality_offset,
        is_dimensionality_reference_property,
    ):
        self.property_name = property_name
        self.property_description = property_description
        self.allowed_references = allowed_references
        self.is_list_element = is_list_element
        self.is_dict_element = is_dict_element
        self.dimensionality_offset = dimensionality_offset
        self.is_dimensionality_reference_property = is_dimensionality_reference_property

    def __eq__(self, other):
        return (
            isinstance(other, SelectorDefinition)
            and self.property_name == other.property_name
            and self.property_description == other.property_description
            and self.allowed_references == other.allowed_references
            and self.is_list_element == other.is_list_element
            and self.is_dict_element == other.is_dict_element
            and self.dimensionality_offset == other.dimensionality_offset
            and self.is_dimensionality_reference_property == other.is_dimensionality_reference_property
        )

    def __repr__(self):
        return (
            f"SelectorDefinition(property_name={self.property_name!r}, "
            f"property_description={self.property_description!r}, "
            f"allowed_references={self.allowed_references!r}, "
            f"is_list_element={self.is_list_element!r}, "
            f"is_dict_element={self.is_dict_element!r}, "
            f"dimensionality_offset={self.dimensionality_offset!r}, "
            f"is_dimensionality_reference_property={self.is_dimensionality_reference_property!r})"
        )


EXCLUDED_PROPERTIES = {"type"}

ITEMS_KEY = "items"
TYPE_KEY = "type"
ADDITIONAL_PROPERTIES_KEY = "additionalProperties"
PROPERTIES_KEY = "properties"
DESCRIPTION_KEY = "description"
OBJECT_TYPE = "object"
from inference.core.workflows.execution_engine.introspection.schema_parser import \
    retrieve_selectors_from_schema

# --- Unit Tests ---

# ------------------------------------
# 1. BASIC TEST CASES
# ------------------------------------

def test_empty_schema_returns_empty():
    # Test with an empty schema (no properties)
    schema = {"properties": {}}
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 3.26μs -> 2.10μs (54.8% faster)

def test_schema_with_non_reference_property():
    # Property does not contain a reference, should be excluded
    schema = {
        "properties": {
            "foo": {"type": "string", "description": "foo desc"},
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 5.07μs -> 4.23μs (19.9% faster)

def test_schema_with_simple_reference_property():
    # Property contains a reference, should be included
    schema = {
        "properties": {
            "bar": {
                "description": "bar desc",
                "$ref": "#/definitions/Bar",
                "selected_element": "Bar",
                "kind": ["foo"],
                "points_to_batch": False,
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.75μs -> 3.83μs (24.1% faster)
    expected = OrderedDict(
        [
            (
                "bar",
                SelectorDefinition(
                    property_name="bar",
                    property_description="bar desc",
                    allowed_references=[
                        ReferenceDefinition(
                            selected_element="Bar",
                            kind=["foo"],
                            points_to_batch={False},
                        )
                    ],
                    is_list_element=False,
                    is_dict_element=False,
                    dimensionality_offset=0,
                    is_dimensionality_reference_property=False,
                ),
            )
        ]
    )

def test_schema_with_reference_and_dimensionality_offset():
    # Property contains a reference and a nonzero dimensionality offset
    schema = {
        "properties": {
            "baz": {
                "description": "baz desc",
                "$ref": "#/definitions/Baz",
                "selected_element": "Baz",
                "kind": ["bar"],
                "points_to_batch": True,
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={"baz": 2},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.61μs -> 3.73μs (23.7% faster)
    expected = OrderedDict(
        [
            (
                "baz",
                SelectorDefinition(
                    property_name="baz",
                    property_description="baz desc",
                    allowed_references=[
                        ReferenceDefinition(
                            selected_element="Baz",
                            kind=["bar"],
                            points_to_batch={True},
                        )
                    ],
                    is_list_element=False,
                    is_dict_element=False,
                    dimensionality_offset=2,
                    is_dimensionality_reference_property=False,
                ),
            )
        ]
    )

def test_schema_with_dimensionality_reference_property_flag():
    # Property is the dimensionality reference property
    schema = {
        "properties": {
            "qux": {
                "description": "qux desc",
                "$ref": "#/definitions/Qux",
                "selected_element": "Qux",
                "kind": ["quux"],
                "points_to_batch": False,
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property="qux",
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.42μs -> 3.66μs (20.8% faster)
    expected = OrderedDict(
        [
            (
                "qux",
                SelectorDefinition(
                    property_name="qux",
                    property_description="qux desc",
                    allowed_references=[
                        ReferenceDefinition(
                            selected_element="Qux",
                            kind=["quux"],
                            points_to_batch={False},
                        )
                    ],
                    is_list_element=False,
                    is_dict_element=False,
                    dimensionality_offset=0,
                    is_dimensionality_reference_property=True,
                ),
            )
        ]
    )

def test_schema_with_list_of_references():
    # Property is a list of references
    schema = {
        "properties": {
            "arr": {
                "description": "arr desc",
                "items": {
                    "$ref": "#/definitions/Arr",
                    "selected_element": "Arr",
                    "kind": ["foo"],
                    "points_to_batch": True,
                }
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.54μs -> 3.66μs (24.1% faster)
    expected = OrderedDict(
        [
            (
                "arr",
                SelectorDefinition(
                    property_name="arr",
                    property_description="arr desc",
                    allowed_references=[
                        ReferenceDefinition(
                            selected_element="Arr",
                            kind=["foo"],
                            points_to_batch={True},
                        )
                    ],
                    is_list_element=True,
                    is_dict_element=False,
                    dimensionality_offset=0,
                    is_dimensionality_reference_property=False,
                ),
            )
        ]
    )

def test_schema_with_dict_of_references():
    # Property is a dict of references
    schema = {
        "properties": {
            "dict_prop": {
                "description": "dict desc",
                "type": "object",
                "additionalProperties": {
                    "$ref": "#/definitions/DictRef",
                    "selected_element": "DictRef",
                    "kind": ["foo"],
                    "points_to_batch": False,
                },
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.80μs -> 3.96μs (21.3% faster)
    expected = OrderedDict(
        [
            (
                "dict_prop",
                SelectorDefinition(
                    property_name="dict_prop",
                    property_description="dict desc",
                    allowed_references=[
                        ReferenceDefinition(
                            selected_element="DictRef",
                            kind=["foo"],
                            points_to_batch={False},
                        )
                    ],
                    is_list_element=False,
                    is_dict_element=True,
                    dimensionality_offset=0,
                    is_dimensionality_reference_property=False,
                ),
            )
        ]
    )

# ------------------------------------
# 2. EDGE TEST CASES
# ------------------------------------

def test_schema_with_excluded_type_property():
    # Property named "type" should be excluded
    schema = {
        "properties": {
            "type": {
                "description": "should be excluded",
                "$ref": "#/definitions/Type",
                "selected_element": "Type",
                "kind": ["foo"],
                "points_to_batch": True,
            },
            "real": {
                "description": "real desc",
                "$ref": "#/definitions/Real",
                "selected_element": "Real",
                "kind": ["bar"],
                "points_to_batch": False,
            },
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.41μs -> 3.76μs (17.4% faster)

def test_schema_with_reference_and_dynamic_points_to_batch_true():
    # points_to_batch = "dynamic", property in inputs_accepting_batches_and_scalars
    schema = {
        "properties": {
            "foo": {
                "description": "foo desc",
                "$ref": "#/definitions/Foo",
                "selected_element": "Foo",
                "kind": ["foo"],
                "points_to_batch": "dynamic",
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches={"foo"},
        inputs_accepting_batches_and_scalars={"foo"},
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.17μs -> 3.57μs (16.8% faster)
    expected = OrderedDict(
        [
            (
                "foo",
                SelectorDefinition(
                    property_name="foo",
                    property_description="foo desc",
                    allowed_references=[
                        ReferenceDefinition(
                            selected_element="Foo",
                            kind=["foo"],
                            points_to_batch={True, False},
                        )
                    ],
                    is_list_element=False,
                    is_dict_element=False,
                    dimensionality_offset=0,
                    is_dimensionality_reference_property=False,
                ),
            )
        ]
    )

def test_schema_with_reference_and_dynamic_points_to_batch_false():
    # points_to_batch = "dynamic", property NOT in inputs_accepting_batches_and_scalars
    schema = {
        "properties": {
            "foo": {
                "description": "foo desc",
                "$ref": "#/definitions/Foo",
                "selected_element": "Foo",
                "kind": ["foo"],
                "points_to_batch": "dynamic",
            }
        }
    }
    # foo is in inputs_accepting_batches, so points_to_batch should be {True}
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches={"foo"},
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.29μs -> 3.57μs (20.2% faster)
    expected = OrderedDict(
        [
            (
                "foo",
                SelectorDefinition(
                    property_name="foo",
                    property_description="foo desc",
                    allowed_references=[
                        ReferenceDefinition(
                            selected_element="Foo",
                            kind=["foo"],
                            points_to_batch={True},
                        )
                    ],
                    is_list_element=False,
                    is_dict_element=False,
                    dimensionality_offset=0,
                    is_dimensionality_reference_property=False,
                ),
            )
        ]
    )

def test_schema_with_reference_and_dynamic_points_to_batch_auto_batch():
    # points_to_batch = "dynamic", property in inputs_enforcing_auto_batch_casting
    schema = {
        "properties": {
            "foo": {
                "description": "foo desc",
                "$ref": "#/definitions/Foo",
                "selected_element": "Foo",
                "kind": ["foo"],
                "points_to_batch": "dynamic",
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting={"foo"},
    ); result = codeflash_output # 4.11μs -> 3.34μs (23.1% faster)
    expected = OrderedDict(
        [
            (
                "foo",
                SelectorDefinition(
                    property_name="foo",
                    property_description="foo desc",
                    allowed_references=[
                        ReferenceDefinition(
                            selected_element="Foo",
                            kind=["foo"],
                            points_to_batch={True},
                        )
                    ],
                    is_list_element=False,
                    is_dict_element=False,
                    dimensionality_offset=0,
                    is_dimensionality_reference_property=False,
                ),
            )
        ]
    )

def test_schema_with_reference_and_dynamic_points_to_batch_none():
    # points_to_batch = "dynamic", property not in any batch sets
    schema = {
        "properties": {
            "foo": {
                "description": "foo desc",
                "$ref": "#/definitions/Foo",
                "selected_element": "Foo",
                "kind": ["foo"],
                "points_to_batch": "dynamic",
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.13μs -> 3.36μs (23.0% faster)
    expected = OrderedDict(
        [
            (
                "foo",
                SelectorDefinition(
                    property_name="foo",
                    property_description="foo desc",
                    allowed_references=[
                        ReferenceDefinition(
                            selected_element="Foo",
                            kind=["foo"],
                            points_to_batch={False},
                        )
                    ],
                    is_list_element=False,
                    is_dict_element=False,
                    dimensionality_offset=0,
                    is_dimensionality_reference_property=False,
                ),
            )
        ]
    )

def test_schema_with_union_anyof():
    # Property is a union (anyOf) of two references
    schema = {
        "properties": {
            "union": {
                "description": "union desc",
                "anyOf": [
                    {
                        "$ref": "#/definitions/Foo",
                        "selected_element": "Foo",
                        "kind": ["foo"],
                        "points_to_batch": True,
                    },
                    {
                        "$ref": "#/definitions/Bar",
                        "selected_element": "Bar",
                        "kind": ["bar"],
                        "points_to_batch": False,
                    },
                ],
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 11.0μs -> 10.4μs (6.28% faster)
    expected = OrderedDict(
        [
            (
                "union",
                SelectorDefinition(
                    property_name="union",
                    property_description="union desc",
                    allowed_references=[
                        ReferenceDefinition(
                            selected_element="Foo",
                            kind=["foo"],
                            points_to_batch={True},
                        ),
                        ReferenceDefinition(
                            selected_element="Bar",
                            kind=["bar"],
                            points_to_batch={False},
                        ),
                    ],
                    is_list_element=False,
                    is_dict_element=False,
                    dimensionality_offset=0,
                    is_dimensionality_reference_property=False,
                ),
            )
        ]
    )

def test_schema_with_union_merges_kinds_and_points_to_batch():
    # Union with repeated selected_element, should merge kind and points_to_batch
    schema = {
        "properties": {
            "union": {
                "description": "union desc",
                "anyOf": [
                    {
                        "$ref": "#/definitions/Foo",
                        "selected_element": "Foo",
                        "kind": ["foo"],
                        "points_to_batch": True,
                    },
                    {
                        "$ref": "#/definitions/Foo",
                        "selected_element": "Foo",
                        "kind": ["bar"],
                        "points_to_batch": False,
                    },
                ],
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 10.1μs -> 9.30μs (9.15% faster)
    expected = OrderedDict(
        [
            (
                "union",
                SelectorDefinition(
                    property_name="union",
                    property_description="union desc",
                    allowed_references=[
                        ReferenceDefinition(
                            selected_element="Foo",
                            kind=["foo", "bar"],
                            points_to_batch={True, False},
                        ),
                    ],
                    is_list_element=False,
                    is_dict_element=False,
                    dimensionality_offset=0,
                    is_dimensionality_reference_property=False,
                ),
            )
        ]
    )

def test_schema_with_missing_description():
    # Property missing description gets default "not available"
    schema = {
        "properties": {
            "foo": {
                "$ref": "#/definitions/Foo",
                "selected_element": "Foo",
                "kind": ["foo"],
                "points_to_batch": True,
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.16μs -> 3.36μs (23.9% faster)

def test_schema_with_nested_list_of_references():
    # Property is a list of list of references, should ignore nested reference
    schema = {
        "properties": {
            "nested": {
                "description": "nested list",
                "items": {
                    "items": {
                        "$ref": "#/definitions/Nested",
                        "selected_element": "Nested",
                        "kind": ["foo"],
                        "points_to_batch": True,
                    }
                }
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.05μs -> 3.11μs (30.3% faster)

def test_schema_with_nested_dict_of_references():
    # Property is a dict of dict of references, should ignore nested reference
    schema = {
        "properties": {
            "nested_dict": {
                "description": "nested dict",
                "type": "object",
                "additionalProperties": {
                    "additionalProperties": {
                        "$ref": "#/definitions/NestedDict",
                        "selected_element": "NestedDict",
                        "kind": ["foo"],
                        "points_to_batch": False,
                    }
                }
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 4.64μs -> 3.89μs (19.3% faster)

def test_schema_with_union_and_no_reference():
    # Union with no reference types, should be ignored
    schema = {
        "properties": {
            "union": {
                "description": "union desc",
                "anyOf": [
                    {"type": "string"},
                    {"type": "integer"},
                ]
            }
        }
    }
    codeflash_output = retrieve_selectors_from_schema(
        schema,
        inputs_dimensionality_offsets={},
        dimensionality_reference_property=None,
        inputs_accepting_batches=set(),
        inputs_accepting_batches_and_scalars=set(),
        inputs_enforcing_auto_batch_casting=set(),
    ); result = codeflash_output # 10.3μs -> 9.65μs (7.28% faster)

# ------------------------------------
# 3. LARGE SCALE TEST CASES
# ------------------------------------

To edit these changes git checkout codeflash/optimize-pr1504-2025-08-22T08.27.46 and push.

Codeflash

 (`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`)

The optimized code achieves an 18% speedup through several targeted micro-optimizations:

**1. Direct OrderedDict Construction**
The most significant improvement eliminates the intermediate list allocation in `retrieve_selectors_from_schema`. Instead of building a list and then converting it to an OrderedDict with a generator expression, selectors are added directly to the OrderedDict during iteration. This saves memory allocation and reduces the final conversion overhead.

**2. Reduced Dictionary Access Overhead**
In `retrieve_selectors_from_simple_property`, the `property_definition` parameter is aliased to `pd` to avoid repeated dictionary name lookups. While seemingly minor, this reduces attribute resolution overhead in the function's hot path.

**3. Optimized Set Membership Testing**
The dynamic points-to-batch logic now caches set membership results in local variables (`in_batches_and_scalars`, `in_batches`, `in_auto_cast`) rather than performing the same set membership tests multiple times.

**4. Conditional List Comprehension**
When processing KIND_KEY values, the code now checks if the list is empty before creating the list comprehension, avoiding unnecessary iterator creation for empty cases.

**Performance Analysis from Tests:**
The optimizations show consistent improvements across all test scenarios, with particularly strong gains (20-30%) on simpler schemas and smaller but meaningful gains (6-11%) on complex union cases. The optimizations are most effective for schemas with many properties, where the direct dictionary construction and reduced lookups compound their benefits. Edge cases like empty schemas show the highest relative improvements (50%+) due to reduced overhead in the main loop structure.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Aug 22, 2025
@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-pr1504-2025-08-22T08.27.46 branch August 22, 2025 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants