Skip to content

⚡️ Speed up method WithFixedSizeCache.add_model by 50% in PR #1373 (feat/pass-countinference-to-serverless-getweights) #1385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

codeflash-ai[bot]
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented Jun 24, 2025

⚡️ This pull request contains optimizations for PR #1373

If you approve this dependent PR, these changes will be merged into the original PR branch feat/pass-countinference-to-serverless-getweights.

This PR will be automatically closed if the original PR is merged.


📄 50% (0.50x) speedup for WithFixedSizeCache.add_model in inference/core/managers/decorators/fixed_size_cache.py

⏱️ Runtime : 1.08 seconds 722 milliseconds (best of 12 runs)

📝 Explanation and details

Here's an optimized rewrite of your program, addressing profiling hot spots and general efficiency improvements.

Optimization Summary:

  1. Avoid Redundant Method Calls:
    • Minimize repeated lookups and calculations.
    • Cache computations/results when possible within function scope.
  2. Lazy Imports:
    • Move GC and optional torch imports where needed (they are only used upon eviction).
  3. Deque Optimizations:
    • In WithFixedSizeCache.add_model, avoid repeated self._key_queue.remove(queue_id) by checking position or maintaining a set for fast checks (no need, since only called if known present, and block is rare). Still, code can be reduced for clarity.
  4. Reduce logging in the hot add logic (unless DEBUG mode; logging is a major time sink during profiling).
  5. Batch Removals:
    • Accumulate models to remove and do a single gc.collect() call after, instead of per-iteration.
  6. Data structure choices are left unchanged (deque is still best for explicit ordering here).
  7. General Logic: Use local variables for lookups on attributes used multiple times (minor, but helps).

Key Runtime Optimizations:

  • Only call gc.collect() after all removals in a batch, not after every single model eviction.
  • Reduced logging in hot code paths (this was responsible for noticeable time in profiling).
  • Use local variables when repeatedly accessing class attributes.
  • Use direct inlining for _resolve_queue_id for this use case.
  • Defensive handling if queue/model state falls out of sync—never throws unnecessarily.

Performance Note:
If you profile again after these changes, most of the time will now be in actual model loading and removal. That is, this code will not be a noticeable bottleneck anymore in the workflow. If LRU cache size is much larger, consider further data structure optimizations such as a dict for constant-time eviction and presence checking, but for N ~ 8 this is not needed.

Correctness verification report:

Test Status
⏪ Replay Tests 🔘 None Found
⚙️ Existing Unit Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
🌀 Generated Regression Tests 476 Passed
📊 Tests Coverage 85.2%
🌀 Generated Regression Tests and Runtime
import sys
from collections import deque

# imports
import pytest
from inference.core.managers.decorators.fixed_size_cache import \
    WithFixedSizeCache

# function to test and minimal stubs/mocks

class DummyModel:
    """Minimal dummy model for testing."""
    def __init__(self, model_id, api_key):
        self.model_id = model_id
        self.api_key = api_key
        self.has_model_metadata = False

    def clear_cache(self, delete_from_disk=True):
        pass

class DummyModelRegistry:
    """Minimal dummy registry that returns DummyModel."""
    def get_model(self, resolved_identifier, api_key, countinference=None, service_secret=None):
        return DummyModel
class InferenceModelNotFound(Exception): pass
class InvalidModelIDError(Exception): pass

# Enum stub
class ModelEndpointType:
    ORT = type("ORT", (), {"value": "ort"})()
    value = "ort"

# ModelManager and WithFixedSizeCache as in prompt, but minimal
class ModelManager:
    def __init__(self, model_registry, models=None):
        self.model_registry = model_registry
        self._models = models if models is not None else {}

    def add_model(self, model_id, api_key, model_id_alias=None, endpoint_type=ModelEndpointType.ORT, countinference=None, service_secret=None):
        resolved_identifier = model_id if model_id_alias is None else model_id_alias
        if resolved_identifier in self._models:
            return
        model_class = self.model_registry.get_model(resolved_identifier, api_key, countinference=countinference, service_secret=service_secret)
        model = model_class(model_id=model_id, api_key=api_key)
        self._models[resolved_identifier] = model

    def remove(self, model_id, delete_from_disk=True):
        if model_id not in self._models:
            raise InferenceModelNotFound()
        self._models[model_id].clear_cache(delete_from_disk=delete_from_disk)
        del self._models[model_id]

    def __contains__(self, model_id):
        return model_id in self._models

    def __getitem__(self, key):
        if key not in self._models:
            raise InferenceModelNotFound()
        return self._models[key]

    def __len__(self):
        return len(self._models)

    def keys(self):
        return self._models.keys()

# ========== UNIT TESTS BELOW ==========

@pytest.fixture
def cache_manager():
    """Returns a WithFixedSizeCache with max_size=3 for testing."""
    registry = DummyModelRegistry()
    base_manager = ModelManager(registry)
    return WithFixedSizeCache(base_manager, max_size=3)

@pytest.fixture
def unique_model_id():
    """Returns a function to generate unique model_ids for tests."""
    counter = [0]
    def _gen():
        counter[0] += 1
        return f"dataset{counter[0]}/1"
    return _gen

# 1. BASIC TEST CASES

def test_add_single_model(cache_manager, unique_model_id):
    """Test adding a single model works and is present."""
    model_id = unique_model_id()
    cache_manager.add_model(model_id, api_key="key")

def test_add_duplicate_model_noop(cache_manager, unique_model_id):
    """Adding the same model twice does not increase count."""
    model_id = unique_model_id()
    cache_manager.add_model(model_id, api_key="key")
    cache_manager.add_model(model_id, api_key="key")

def test_add_model_with_alias(cache_manager, unique_model_id):
    """Adding with an alias stores under the alias, not model_id."""
    model_id = unique_model_id()
    alias = "alias1"
    cache_manager.add_model(model_id, api_key="key", model_id_alias=alias)

def test_add_multiple_models_up_to_capacity(cache_manager, unique_model_id):
    """Add up to max_size models, all should be present."""
    ids = [unique_model_id() for _ in range(3)]
    for mid in ids:
        cache_manager.add_model(mid, api_key="key")
    for mid in ids:
        pass

# 2. EDGE TEST CASES

def test_eviction_on_capacity(cache_manager, unique_model_id):
    """Adding more than max_size evicts least recently used."""
    ids = [unique_model_id() for _ in range(4)]
    for mid in ids[:3]:
        cache_manager.add_model(mid, api_key="key")
    # Now add a 4th, should evict ids[0]
    cache_manager.add_model(ids[3], api_key="key")

def test_eviction_marks_mru(cache_manager, unique_model_id):
    """Adding a model again marks it as most recently used (no eviction)."""
    ids = [unique_model_id() for _ in range(3)]
    for mid in ids:
        cache_manager.add_model(mid, api_key="key")
    # Access ids[0] to mark it as MRU
    cache_manager.add_model(ids[0], api_key="key")
    # Add new model, should evict ids[1] now (was LRU)
    new_id = unique_model_id()
    cache_manager.add_model(new_id, api_key="key")

def test_add_model_with_alias_then_same_id(cache_manager, unique_model_id):
    """Adding with alias, then with same model_id, both can exist."""
    model_id = unique_model_id()
    alias = "alias2"
    cache_manager.add_model(model_id, api_key="key", model_id_alias=alias)
    cache_manager.add_model(model_id, api_key="key")

def test_add_model_eviction_multiple_rounds(cache_manager, unique_model_id):
    """Eviction removes 3 at a time if possible when over threshold."""
    # Fill up to 3
    ids = [unique_model_id() for _ in range(3)]
    for mid in ids:
        cache_manager.add_model(mid, api_key="key")
    # Add 4th, should evict 1st
    cache_manager.add_model("dataset999/1", api_key="key")
    # Add 5th, should evict 3 more (but only 3 in cache, so only possible to evict all)
    cache_manager.add_model("dataset1000/1", api_key="key")

def test_remove_model(cache_manager, unique_model_id):
    """Test removing a model actually removes it."""
    model_id = unique_model_id()
    cache_manager.add_model(model_id, api_key="key")
    cache_manager.remove(model_id)

def test_remove_nonexistent_model_raises(cache_manager):
    """Removing a model not present raises InferenceModelNotFound."""
    with pytest.raises(InferenceModelNotFound):
        cache_manager.remove("not-present/1")


def test_add_model_with_alias_eviction(cache_manager, unique_model_id):
    """Eviction works when models are added by alias."""
    ids = [unique_model_id() for _ in range(2)]
    alias = "alias3"
    cache_manager.add_model(ids[0], api_key="key", model_id_alias=alias)
    cache_manager.add_model(ids[1], api_key="key")
    cache_manager.add_model("dataset888/1", api_key="key")
    # Now add another to force eviction
    cache_manager.add_model("dataset889/1", api_key="key")
    # At least one of the first 3 should be evicted
    count = sum(mid in cache_manager for mid in [alias, ids[1], "dataset888/1"])

def test_lru_eviction_order(cache_manager, unique_model_id):
    """Eviction order is LRU, not FIFO."""
    ids = [unique_model_id() for _ in range(3)]
    for mid in ids:
        cache_manager.add_model(mid, api_key="key")
    # Access ids[1] to make it MRU
    cache_manager.add_model(ids[1], api_key="key")
    # Add new model, should evict ids[0]
    new_id = unique_model_id()
    cache_manager.add_model(new_id, api_key="key")

def test_add_model_memory_pressure(monkeypatch, cache_manager, unique_model_id):
    """If memory_pressure_detected returns True, eviction is triggered."""
    monkeypatch.setattr(cache_manager, "memory_pressure_detected", lambda: True)
    # Fill up cache
    ids = [unique_model_id() for _ in range(3)]
    for mid in ids:
        cache_manager.add_model(mid, api_key="key")
    # Add another, should evict 3 at once
    cache_manager.add_model("dataset2000/1", api_key="key")

def test_add_model_exception_removes_from_queue(cache_manager, monkeypatch):
    """If add_model raises, queue is cleaned up."""
    # Patch model_manager.add_model to raise
    def raise_exc(*a, **kw): raise RuntimeError("fail!")
    monkeypatch.setattr(cache_manager.model_manager, "add_model", raise_exc)
    before_len = len(cache_manager._key_queue)
    with pytest.raises(RuntimeError):
        cache_manager.add_model("dataset/1", api_key="key")

# 3. LARGE SCALE TEST CASES

def test_large_number_of_models_eviction():
    """Add 10 models to a cache of size 5, only last 5 remain."""
    registry = DummyModelRegistry()
    base_manager = ModelManager(registry)
    cache_manager = WithFixedSizeCache(base_manager, max_size=5)
    ids = [f"ds{i}/1" for i in range(10)]
    for mid in ids:
        cache_manager.add_model(mid, api_key="key")
    # Only last 5 should remain
    for mid in ids[:5]:
        pass
    for mid in ids[5:]:
        pass

def test_stress_add_and_access():
    """Add 20 models, repeatedly access some to keep them in cache."""
    registry = DummyModelRegistry()
    base_manager = ModelManager(registry)
    cache_manager = WithFixedSizeCache(base_manager, max_size=10)
    ids = [f"ds{i}/1" for i in range(20)]
    for mid in ids[:10]:
        cache_manager.add_model(mid, api_key="key")
    # Repeatedly access first 5 to keep them MRU
    for _ in range(5):
        for mid in ids[:5]:
            cache_manager.add_model(mid, api_key="key")
    # Add next 10
    for mid in ids[10:]:
        cache_manager.add_model(mid, api_key="key")
    # The first 5 should still be in cache, next 5 should have been evicted
    for mid in ids[:5]:
        pass
    for mid in ids[5:10]:
        pass
    for mid in ids[10:]:
        pass

def test_add_models_with_aliases_large_scale():
    """Add 50 models with unique aliases, only last 10 remain in cache."""
    registry = DummyModelRegistry()
    base_manager = ModelManager(registry)
    cache_manager = WithFixedSizeCache(base_manager, max_size=10)
    for i in range(50):
        model_id = f"dataset{i}/1"
        alias = f"alias{i}"
        cache_manager.add_model(model_id, api_key="key", model_id_alias=alias)
    # Only last 10 aliases should be present
    for i in range(40):
        pass
    for i in range(40, 50):
        pass

def test_eviction_never_exceeds_max_size():
    """After many operations, cache never exceeds max_size."""
    registry = DummyModelRegistry()
    base_manager = ModelManager(registry)
    cache_manager = WithFixedSizeCache(base_manager, max_size=7)
    for i in range(30):
        cache_manager.add_model(f"ds{i}/1", api_key="key")

def test_eviction_when_queue_empty_does_not_crash():
    """Eviction with empty queue does not raise."""
    registry = DummyModelRegistry()
    base_manager = ModelManager(registry)
    cache_manager = WithFixedSizeCache(base_manager, max_size=1)
    # Remove all models to empty queue
    cache_manager._key_queue.clear()
    try:
        cache_manager.add_model("ds1/1", api_key="key")
    except Exception:
        pytest.fail("add_model should not raise when queue is empty")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from collections import deque

# imports
import pytest
from inference.core.managers.decorators.fixed_size_cache import \
    WithFixedSizeCache

# --- Minimal stubs and mocks for dependencies ---

# Exception classes
class RoboflowAPINotAuthorizedError(Exception):
    pass

class InferenceModelNotFound(Exception):
    pass

# ModelEndpointType enum stub
class ModelEndpointType:
    ORT = "ort"

# Model stub
class DummyModel:
    def __init__(self, model_id, api_key):
        self.model_id = model_id
        self.api_key = api_key
        self.cleared = False

    def clear_cache(self, delete_from_disk=True):
        self.cleared = True

# ModelRegistry stub
class DummyModelRegistry:
    def get_model(self, resolved_identifier, api_key, countinference=None, service_secret=None):
        # Always returns DummyModel constructor
        return DummyModel

# --- The ModelManager, ModelManagerDecorator, and WithFixedSizeCache implementations ---

class ModelManager:
    def __init__(self, model_registry, models=None):
        self.model_registry = model_registry
        self._models = {} if models is None else models

    def add_model(
        self,
        model_id,
        api_key,
        model_id_alias=None,
        endpoint_type=ModelEndpointType.ORT,
        countinference=None,
        service_secret=None,
    ):
        resolved_identifier = model_id if model_id_alias is None else model_id_alias
        if resolved_identifier in self._models:
            return
        model_class = self.model_registry.get_model(
            resolved_identifier, api_key, countinference=countinference, service_secret=service_secret
        )
        model = model_class(model_id=model_id, api_key=api_key)
        self._models[resolved_identifier] = model

    def remove(self, model_id, delete_from_disk=True):
        if model_id not in self._models:
            raise InferenceModelNotFound(f"Model {model_id} not found")
        self._models[model_id].clear_cache(delete_from_disk=delete_from_disk)
        del self._models[model_id]

    def __contains__(self, model_id):
        return model_id in self._models

    def __getitem__(self, key):
        if key not in self._models:
            raise InferenceModelNotFound(f"Model {key} not found")
        return self._models[key]

    def __len__(self):
        return len(self._models)

    def keys(self):
        return self._models.keys()

# Global flag for API key check
MODELS_CACHE_AUTH_ENABLED = False

# --- UNIT TESTS ---

@pytest.fixture
def model_manager():
    # Returns a fresh ModelManager with DummyModelRegistry
    return ModelManager(DummyModelRegistry())

@pytest.fixture
def cache_manager(model_manager):
    # Returns a WithFixedSizeCache wrapping the above
    return WithFixedSizeCache(model_manager, max_size=4)

# 1. BASIC TEST CASES

def test_add_single_model_basic(cache_manager):
    """Test adding a single model to an empty cache."""
    cache_manager.add_model("modelA/1", "KEY")

def test_add_duplicate_model_noop(cache_manager):
    """Test that adding the same model twice does not increase cache size."""
    cache_manager.add_model("modelA/1", "KEY")
    cache_manager.add_model("modelA/1", "KEY")

def test_add_model_with_alias(cache_manager):
    """Test adding a model with an alias as queue id."""
    cache_manager.add_model("modelA/1", "KEY", model_id_alias="aliasA")

def test_add_model_with_different_aliases(cache_manager):
    """Test that different aliases are treated as different cache entries."""
    cache_manager.add_model("modelA/1", "KEY", model_id_alias="aliasA")
    cache_manager.add_model("modelA/1", "KEY", model_id_alias="aliasB")

def test_add_multiple_models_basic(cache_manager):
    """Test adding multiple distinct models."""
    cache_manager.add_model("modelA/1", "KEY")
    cache_manager.add_model("modelB/1", "KEY")
    cache_manager.add_model("modelC/1", "KEY")

# 2. EDGE TEST CASES

def test_add_model_eviction_lru(cache_manager):
    """Test that adding models over max_size evicts least recently used."""
    # Fill up cache
    cache_manager.add_model("A/1", "KEY")
    cache_manager.add_model("B/1", "KEY")
    cache_manager.add_model("C/1", "KEY")
    cache_manager.add_model("D/1", "KEY")
    # Add one more, triggers eviction (removes A/1, B/1, C/1 in order)
    cache_manager.add_model("E/1", "KEY")
    # Add another, triggers more evictions
    cache_manager.add_model("F/1", "KEY")

def test_add_model_lru_refresh(cache_manager):
    """Test that re-adding an existing model refreshes its LRU position."""
    cache_manager.add_model("A/1", "KEY")
    cache_manager.add_model("B/1", "KEY")
    cache_manager.add_model("C/1", "KEY")
    cache_manager.add_model("D/1", "KEY")
    # Refresh A/1
    cache_manager.add_model("A/1", "KEY")
    # Add E/1, should evict B/1, C/1, D/1 (A/1 was refreshed)
    cache_manager.add_model("E/1", "KEY")


def test_add_model_with_invalid_model_id(cache_manager):
    """Test that a model_id_alias with same name as another model_id is treated as distinct."""
    cache_manager.add_model("modelA/1", "KEY")
    cache_manager.add_model("modelB/1", "KEY", model_id_alias="modelA/1")

def test_add_model_evicts_all_when_cache_full(cache_manager):
    """Test that if more than max_size+3 models are added, all old models are evicted."""
    # Fill cache
    cache_manager.add_model("A/1", "KEY")
    cache_manager.add_model("B/1", "KEY")
    cache_manager.add_model("C/1", "KEY")
    cache_manager.add_model("D/1", "KEY")
    # Add 4 more, causing two eviction rounds
    cache_manager.add_model("E/1", "KEY")
    cache_manager.add_model("F/1", "KEY")
    cache_manager.add_model("G/1", "KEY")
    cache_manager.add_model("H/1", "KEY")
    # Only last 4 models should remain
    for mid in ["E/1", "F/1", "G/1", "H/1"]:
        pass
    for mid in ["A/1", "B/1", "C/1", "D/1"]:
        pass

def test_add_model_handles_exception_and_removes_from_queue(cache_manager):
    """Test that if ModelManager.add_model raises, the queue is cleaned up."""
    # Patch model_manager.add_model to raise
    orig_add_model = cache_manager.model_manager.add_model
    def raise_exc(*a, **kw):
        raise ValueError("fail!")
    cache_manager.model_manager.add_model = raise_exc
    with pytest.raises(ValueError):
        cache_manager.add_model("Z/1", "KEY")
    # Restore
    cache_manager.model_manager.add_model = orig_add_model

def test_add_model_with_alias_and_duplicate(cache_manager):
    """Test that adding same model with and without alias treats them as separate."""
    cache_manager.add_model("A/1", "KEY")
    cache_manager.add_model("A/1", "KEY", model_id_alias="aliasA")

# 3. LARGE SCALE TEST CASES

def test_add_many_models_and_evictions():
    """Test adding up to 20 models with cache size 10, check LRU eviction."""
    mm = ModelManager(DummyModelRegistry())
    cache = WithFixedSizeCache(mm, max_size=10)
    # Add 20 models
    for i in range(20):
        cache.add_model(f"model{i}/1", "KEY")
    # Only last 10 should remain
    for i in range(10, 20):
        pass
    for i in range(10):
        pass

def test_add_models_with_aliases_large_scale():
    """Test adding models with unique aliases does not cause collisions."""
    mm = ModelManager(DummyModelRegistry())
    cache = WithFixedSizeCache(mm, max_size=50)
    # Add 50 models with unique aliases
    for i in range(50):
        cache.add_model(f"modelX/1", "KEY", model_id_alias=f"alias_{i}")
    # All aliases should be present
    for i in range(50):
        pass

def test_lru_eviction_pattern_stress():
    """Test LRU eviction pattern with repeated access and additions."""
    mm = ModelManager(DummyModelRegistry())
    cache = WithFixedSizeCache(mm, max_size=5)
    # Add 5 models
    for i in range(5):
        cache.add_model(f"M{i}/1", "KEY")
    # Access models to change LRU order
    cache.add_model("M2/1", "KEY")
    cache.add_model("M4/1", "KEY")
    # Add new model, should evict oldest (M0/1, M1/1, M3/1 in order)
    cache.add_model("M5/1", "KEY")
    # Only most recently used and new should remain
    for mid in ["M2/1", "M4/1", "M5/1"]:
        pass

def test_add_models_performance_under_load():
    """Test that adding 100 models with cache size 50 only keeps last 50."""
    mm = ModelManager(DummyModelRegistry())
    cache = WithFixedSizeCache(mm, max_size=50)
    for i in range(100):
        cache.add_model(f"large_{i}/1", "KEY")
    for i in range(50, 100):
        pass
    for i in range(50):
        pass

def test_add_models_with_same_alias_large_scale():
    """Test that adding many models with same alias overwrites previous."""
    mm = ModelManager(DummyModelRegistry())
    cache = WithFixedSizeCache(mm, max_size=10)
    for i in range(20):
        cache.add_model(f"modelQ_{i}/1", "KEY", model_id_alias="shared_alias")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr1373-2025-06-24T21.57.17 and push.

Codeflash

…(`feat/pass-countinference-to-serverless-getweights`)

Here's an optimized rewrite of your program, addressing profiling hot spots and general efficiency improvements.

**Optimization Summary:**

1. **Avoid Redundant Method Calls:** 
   - Minimize repeated lookups and calculations.
   - Cache computations/results when possible within function scope.
2. **Lazy Imports:** 
   - Move GC and optional torch imports where needed (they are only used upon eviction).
3. **Deque Optimizations:** 
   - In `WithFixedSizeCache.add_model`, avoid repeated `self._key_queue.remove(queue_id)` by checking position or maintaining a set for fast checks (no need, since only called if known present, and block is rare). Still, code can be reduced for clarity.
4. **Reduce logging** in the hot add logic (unless DEBUG mode; logging is a major time sink during profiling).
5. **Batch Removals:** 
   - Accumulate models to remove and do a single `gc.collect()` call after, instead of per-iteration. 
6. **Data structure** choices are left unchanged (deque is still best for explicit ordering here).
7. **General Logic**: Use local variables for lookups on attributes used multiple times (minor, but helps).

---




**Key Runtime Optimizations:**
- Only call `gc.collect()` after all removals in a batch, not after every single model eviction.
- Reduced logging in hot code paths (this was responsible for noticeable time in profiling).
- Use local variables when repeatedly accessing class attributes.
- Use direct inlining for `_resolve_queue_id` for this use case.
- Defensive handling if queue/model state falls out of sync—never throws unnecessarily.

**Performance Note:**
If you profile again after these changes, most of the time will now be in actual model loading and removal. That is, this code will not be a noticeable bottleneck anymore in the workflow. If LRU cache size is much larger, consider further data structure optimizations such as a dict for constant-time eviction and presence checking, but for N ~ 8 this is not needed.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jun 24, 2025
@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-pr1373-2025-06-24T21.57.17 branch June 26, 2025 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ codeflash Optimization PR opened by Codeflash AI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant