Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 12, 2025

📄 17% (0.17x) speedup for pred_lines in invokeai/backend/image_util/mlsd/utils.py

⏱️ Runtime : 347 milliseconds 296 milliseconds (best of 17 runs)

📝 Explanation and details

The optimized code achieves a 17% speedup by targeting key performance bottlenecks in tensor operations and memory management:

Key Optimizations:

  1. Reduced Memory Allocations: In deccode_output_score_and_ptss, replaced heat = heat * keep with in-place heat.mul_(keep), eliminating temporary tensor creation. This saves both memory and computation time.

  2. More Efficient Tensor Indexing: Changed tpMap[:, 1:5, :, :][0] to direct tpMap[0, 1:5], avoiding intermediate tensor creation and reducing memory overhead.

  3. Optimized Image Preprocessing: Replaced np.concatenate with np.dstack for channel stacking, which is faster for axis=-1 operations. Used in-place division with np.divide(..., out=batch_image) to avoid creating temporary arrays during normalization.

  4. Vectorized Line Detection: The most significant improvement replaces the Python loop over candidate points with vectorized NumPy operations. Instead of iterating through 8,000+ points individually, the code now uses boolean masking to filter valid points in batch operations, dramatically reducing per-iteration overhead.

  5. Pre-allocated Result Arrays: Uses np.empty to allocate the final segments array directly rather than building a list and converting, eliminating list append operations and final array conversion.

  6. Optimized Distance Calculation: Replaced np.sum((start - end) ** 2, axis=-1) with np.einsum('ijk,ijk->ij', diff, diff), which is more efficient for element-wise dot products.

Performance Impact: The line profiler shows the vectorized approach eliminates the expensive loop (originally 17% of runtime in pred_lines). The optimizations are particularly effective for larger models and images, with test cases showing 20-90% improvements on large-scale scenarios while maintaining smaller but consistent gains across all test cases.

Device Optimization: Minor improvement in get_effective_device by checking buffers before parameters, as buffers are typically fewer and checking non-CPU devices early can short-circuit the iteration.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 44 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 66.7%
🌀 Generated Regression Tests and Runtime

import numpy as np

imports

import pytest
import torch
from invokeai.backend.image_util.mlsd.utils import pred_lines

--- Dummy model for testing ---

class DummyModel(torch.nn.Module):
def init(self, out_shape=(1, 5, 8, 8)):
super().init()
self.out_shape = out_shape
def forward(self, x):
# Return a tensor of required shape filled with ones
return torch.ones(self.out_shape, dtype=torch.float32)

--- Unit tests ---

1. Basic Test Cases

def test_edge_zero_input_shape():
"""Test with input_shape containing zeros (should raise error)."""
image = np.ones((8, 8, 3), dtype=np.uint8)
model = DummyModel(out_shape=(1, 5, 8, 8))
with pytest.raises(ZeroDivisionError):
pred_lines(image, model, input_shape=[0, 8], score_thr=0.1, dist_thr=0.5) # 15.3μs -> 16.0μs (4.60% slower)

3. Large Scale Test Cases

def test_large_scale_image_and_model():
"""Test with a large image and model output."""
image = np.ones((512, 512, 3), dtype=np.uint8)
model = DummyModel(out_shape=(1, 5, 32, 32)) # Output shape is 32x32
codeflash_output = pred_lines(image, model, input_shape=[32, 32], score_thr=0.1, dist_thr=0.5); lines = codeflash_output # 1.04ms -> 545μs (89.9% faster)

def test_large_scale_many_lines():
"""Test with a model outputting a large number of points."""
class ManyLinesModel(DummyModel):
def forward(self, x):
# Output a larger tensor
return torch.ones((1, 5, 64, 64), dtype=torch.float32)
image = np.ones((256, 256, 3), dtype=np.uint8)
model = ManyLinesModel(out_shape=(1, 5, 64, 64))
codeflash_output = pred_lines(image, model, input_shape=[64, 64], score_thr=0.1, dist_thr=0.5); lines = codeflash_output # 997μs -> 501μs (98.8% faster)

def test_large_scale_performance():
"""Test that function runs efficiently on large but reasonable data."""
image = np.ones((512, 512, 3), dtype=np.uint8)
model = DummyModel(out_shape=(1, 5, 128, 128))
codeflash_output = pred_lines(image, model, input_shape=[128, 128], score_thr=0.1, dist_thr=0.5); lines = codeflash_output # 1.60ms -> 1.08ms (48.3% faster)

def test_large_scale_empty_result():
"""Test with large image and thresholds that prevent any detection."""
image = np.ones((512, 512, 3), dtype=np.uint8)
model = DummyModel(out_shape=(1, 5, 128, 128))
codeflash_output = pred_lines(image, model, input_shape=[128, 128], score_thr=1000, dist_thr=1000); lines = codeflash_output # 1.48ms -> 1.07ms (38.0% faster)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

#------------------------------------------------
import itertools

import cv2
import numpy as np

imports

import pytest
import torch
from invokeai.backend.image_util.mlsd.utils import pred_lines
from torch.nn import functional as F

Helper: Dummy model for testing

class DummyModel(torch.nn.Module):
def init(self, output_shape, device="cpu", center_val=0.5, disp_val=30.0):
super().init()
self.output_shape = output_shape
self.device = torch.device(device)
self.center_val = center_val
self.disp_val = disp_val

def forward(self, x):
    # output shape: (1, 5, H, W)
    b, c, h, w = 1, 5, self.output_shape[2], self.output_shape[3]
    out = torch.zeros((b, c, h, w), device=self.device)
    # Center (channel 0): fill with center_val
    out[:, 0, :, :] = self.center_val
    # Displacement (channels 1-4): fill with disp_val
    out[:, 1:5, :, :] = self.disp_val
    return out

Helper: Model that produces no lines (low score, low disp)

class ZeroModel(torch.nn.Module):
def init(self, output_shape, device="cpu"):
super().init()
self.output_shape = output_shape
self.device = torch.device(device)

def forward(self, x):
    b, c, h, w = 1, 5, self.output_shape[2], self.output_shape[3]
    out = torch.zeros((b, c, h, w), device=self.device)
    # Center (channel 0): fill with -10 (sigmoid(-10) ~ 0)
    out[:, 0, :, :] = -10.0
    # Displacement (channels 1-4): fill with 0
    out[:, 1:5, :, :] = 0.0
    return out

Helper: Model with random output

class RandomModel(torch.nn.Module):
def init(self, output_shape, device="cpu", seed=None):
super().init()
self.output_shape = output_shape
self.device = torch.device(device)
self.seed = seed

def forward(self, x):
    if self.seed is not None:
        torch.manual_seed(self.seed)
    b, c, h, w = 1, 5, self.output_shape[2], self.output_shape[3]
    out = torch.randn((b, c, h, w), device=self.device)
    return out

Basic Test Cases

def test_basic_detects_lines():
# Test that lines are detected for a simple image and model
image = np.ones((512, 512, 3), dtype=np.uint8) * 127 # mid-gray
model = DummyModel((1, 5, 512, 512), center_val=10.0, disp_val=30.0) # high score, high disp
codeflash_output = pred_lines(image, model); lines = codeflash_output # 9.90ms -> 8.80ms (12.6% faster)

def test_basic_no_lines_low_score():
# Test that no lines are detected if scores are low
image = np.ones((512, 512, 3), dtype=np.uint8) * 127
model = DummyModel((1, 5, 512, 512), center_val=-10.0, disp_val=30.0) # low score
codeflash_output = pred_lines(image, model); lines = codeflash_output # 9.35ms -> 8.20ms (14.0% faster)

def test_basic_no_lines_low_disp():
# Test that no lines are detected if displacement is too small
image = np.ones((512, 512, 3), dtype=np.uint8) * 127
model = DummyModel((1, 5, 512, 512), center_val=10.0, disp_val=1.0) # high score, low disp
codeflash_output = pred_lines(image, model); lines = codeflash_output # 9.52ms -> 8.21ms (16.0% faster)

def test_basic_output_shape_and_type():
# Test output shape and type for a valid detection
image = np.ones((512, 512, 3), dtype=np.uint8) * 127
model = DummyModel((1, 5, 512, 512), center_val=10.0, disp_val=30.0)
codeflash_output = pred_lines(image, model); lines = codeflash_output # 9.51ms -> 8.10ms (17.5% faster)

Edge Test Cases

def test_edge_small_image():
# Test with a very small image
image = np.ones((10, 10, 3), dtype=np.uint8) * 127
model = DummyModel((1, 5, 512, 512), center_val=10.0, disp_val=30.0)
codeflash_output = pred_lines(image, model); lines = codeflash_output # 9.75ms -> 8.35ms (16.7% faster)

def test_edge_different_input_shape():
# Test with custom input_shape argument
image = np.ones((256, 256, 3), dtype=np.uint8) * 127
model = DummyModel((1, 5, 256, 256), center_val=10.0, disp_val=30.0)
codeflash_output = pred_lines(image, model, input_shape=[256, 256]); lines = codeflash_output # 2.96ms -> 2.28ms (30.3% faster)

def test_edge_score_thr_and_dist_thr():
# Test with higher score_thr and dist_thr
image = np.ones((512, 512, 3), dtype=np.uint8) * 127
model = DummyModel((1, 5, 512, 512), center_val=0.5, disp_val=30.0) # sigmoid(0.5) ~ 0.62
codeflash_output = pred_lines(image, model, score_thr=0.7, dist_thr=40.0); lines = codeflash_output # 9.37ms -> 8.11ms (15.5% faster)

def test_edge_no_lines_model():
# Test with a model that always outputs low scores and low displacement
image = np.ones((512, 512, 3), dtype=np.uint8) * 127
model = ZeroModel((1, 5, 512, 512))
codeflash_output = pred_lines(image, model); lines = codeflash_output # 9.36ms -> 8.06ms (16.2% faster)

def test_edge_invalid_image_shape():
# Test with invalid image shape (should raise error)
image = np.ones((512, 512), dtype=np.uint8) # missing channel
model = DummyModel((1, 5, 512, 512))
with pytest.raises(ValueError):
pred_lines(image, model) # 3.55μs -> 4.32μs (17.7% slower)

def test_edge_invalid_model_output_shape():
# Test with model output shape not matching expected (should raise error)
class BadShapeModel(torch.nn.Module):
def forward(self, x):
# Wrong shape
return torch.zeros((2, 5, 512, 512))
image = np.ones((512, 512, 3), dtype=np.uint8) * 127
model = BadShapeModel()
with pytest.raises(AssertionError):
pred_lines(image, model) # 3.04ms -> 2.23ms (36.1% faster)

def test_edge_device_gpu_cpu():
# Test with model on GPU if available, otherwise skip
if torch.cuda.is_available():
image = np.ones((512, 512, 3), dtype=np.uint8) * 127
model = DummyModel((1, 5, 512, 512), device="cuda", center_val=10.0, disp_val=30.0)
codeflash_output = pred_lines(image, model); lines = codeflash_output

Large Scale Test Cases

def test_large_many_lines():
# Test with a large image and model output, but not exceeding 100MB
# 512x512x3 (image) and 1x5x512x512 (output) is ~5MB
image = np.ones((512, 512, 3), dtype=np.uint8) * 127
model = DummyModel((1, 5, 512, 512), center_val=10.0, disp_val=30.0)
codeflash_output = pred_lines(image, model); lines = codeflash_output # 10.1ms -> 8.33ms (21.0% faster)

def test_large_random_model():
# Test with random values, check that output is valid and deterministic with seed
image = np.ones((512, 512, 3), dtype=np.uint8) * 127
model = RandomModel((1, 5, 512, 512), seed=42)
codeflash_output = pred_lines(image, model); lines1 = codeflash_output # 18.6ms -> 17.2ms (7.70% faster)
model = RandomModel((1, 5, 512, 512), seed=42)
codeflash_output = pred_lines(image, model); lines2 = codeflash_output # 18.6ms -> 17.2ms (7.98% faster)

def test_large_different_input_shapes():
# Test with several input shapes up to 512x512
for shape in [(128, 128), (256, 256), (512, 512)]:
image = np.ones((shape[0], shape[1], 3), dtype=np.uint8) * 127
model = DummyModel((1, 5, shape[0], shape[1]), center_val=10.0, disp_val=30.0)
codeflash_output = pred_lines(image, model, input_shape=[shape[0], shape[1]]); lines = codeflash_output # 13.7ms -> 11.2ms (22.9% faster)

def test_large_batch_performance():
# Test performance with max allowed size (not exceeding 100MB)
# 1x5x512x512 is ~5MB, so we can do 20 times safely
image = np.ones((512, 512, 3), dtype=np.uint8) * 127
model = DummyModel((1, 5, 512, 512), center_val=10.0, disp_val=30.0)
for _ in range(20):
codeflash_output = pred_lines(image, model); lines = codeflash_output # 189ms -> 160ms (18.1% faster)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pred_lines-mhvp5t7q and push.

Codeflash Static Badge

The optimized code achieves a **17% speedup** by targeting key performance bottlenecks in tensor operations and memory management:

**Key Optimizations:**

1. **Reduced Memory Allocations**: In `deccode_output_score_and_ptss`, replaced `heat = heat * keep` with in-place `heat.mul_(keep)`, eliminating temporary tensor creation. This saves both memory and computation time.

2. **More Efficient Tensor Indexing**: Changed `tpMap[:, 1:5, :, :][0]` to direct `tpMap[0, 1:5]`, avoiding intermediate tensor creation and reducing memory overhead.

3. **Optimized Image Preprocessing**: Replaced `np.concatenate` with `np.dstack` for channel stacking, which is faster for axis=-1 operations. Used in-place division with `np.divide(..., out=batch_image)` to avoid creating temporary arrays during normalization.

4. **Vectorized Line Detection**: The most significant improvement replaces the Python loop over candidate points with vectorized NumPy operations. Instead of iterating through 8,000+ points individually, the code now uses boolean masking to filter valid points in batch operations, dramatically reducing per-iteration overhead.

5. **Pre-allocated Result Arrays**: Uses `np.empty` to allocate the final segments array directly rather than building a list and converting, eliminating list append operations and final array conversion.

6. **Optimized Distance Calculation**: Replaced `np.sum((start - end) ** 2, axis=-1)` with `np.einsum('ijk,ijk->ij', diff, diff)`, which is more efficient for element-wise dot products.

**Performance Impact**: The line profiler shows the vectorized approach eliminates the expensive loop (originally 17% of runtime in `pred_lines`). The optimizations are particularly effective for larger models and images, with test cases showing 20-90% improvements on large-scale scenarios while maintaining smaller but consistent gains across all test cases.

**Device Optimization**: Minor improvement in `get_effective_device` by checking buffers before parameters, as buffers are typically fewer and checking non-CPU devices early can short-circuit the iteration.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 12, 2025 07:47
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant