⚡️ Speed up method `NoiseOutput.build` by 12% #152

codeflash-ai · 2025-11-12T10:03:57Z

📄 12% (0.12x) speedup for `NoiseOutput.build` in `invokeai/app/invocations/noise.py`

⏱️ Runtime : 627 microseconds → 562 microseconds (best of 114 runs)

📝 Explanation and details

The optimization replaces latents.size() calls with latents.shape attribute access and caches the shape in a variable to avoid repeated indexing operations.

Key changes:

Replaced latents.size()[3] and latents.size()[2] with shape[3] and shape[2]
Added shape = latents.shape to cache the tensor dimensions

Why this is faster:

Method call elimination: latents.shape is a direct attribute access, while latents.size() is a method call that has function call overhead in Python
Single shape computation: The tensor shape is computed once and reused, rather than calling latents.size() twice
Reduced indexing operations: From the line profiler, dimension access time dropped significantly - width calculation went from 208.6μs to 72.3μs (65% faster) and height from 110μs to 51.4μs (53% faster)

Performance impact:
The optimization delivers an 11% speedup (627μs → 562μs) and shows consistent improvements across all test cases (5-17% faster per test). This is particularly valuable since the function appears to be called frequently during noise generation workflows in the InvokeAI inference pipeline. Even small per-call improvements compound when the function is invoked hundreds of times during image generation.

Test case benefits:
The optimization performs well across all tensor sizes, with particularly strong gains on smaller tensors (up to 17% faster on minimal shapes) where the relative overhead of method calls is highest.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 411 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pytest # used for our unit tests

function to test

import torch
from invokeai.app.invocations.noise import NoiseOutput

Dummy constants and classes to allow isolated testing

LATENT_SCALE_FACTOR = 8 # Typical value in diffusers

class FieldDescriptions:
noise = "Noise tensor"
width = "Width of output"
height = "Height of output"

class LatentsField:
def init(self, latents_name, seed):
self.latents_name = latents_name
self.seed = seed

class OutputField:
def init(self, description):
self.description = description

def invocation_output(name):
# Dummy decorator for our tests
def decorator(cls):
return cls
return decorator

class BaseInvocationOutput:
pass
from invokeai.app.invocations.noise import NoiseOutput

unit tests

1. Basic Test Cases

def test_basic_build_returns_correct_width_height_and_noise():
# Test with a typical latent tensor shape: (batch, channels, height, width)
latents = torch.zeros((1, 4, 32, 32)) # 32x32 latent
codeflash_output = NoiseOutput.build("latent1", latents, 123); result = codeflash_output # 9.14μs -> 8.13μs (12.4% faster)

def test_build_with_different_latent_name_and_seed():
latents = torch.ones((2, 8, 16, 16))
codeflash_output = NoiseOutput.build("foo", latents, 999); result = codeflash_output # 7.82μs -> 7.16μs (9.12% faster)

def test_build_with_non_square_latents():
latents = torch.ones((1, 4, 24, 36))
codeflash_output = NoiseOutput.build("bar", latents, 42); result = codeflash_output # 8.01μs -> 7.56μs (5.94% faster)

2. Edge Test Cases

def test_build_with_minimum_size_latents():
# Smallest possible tensor with 1x1 spatial dimensions
latents = torch.randn((1, 1, 1, 1))
codeflash_output = NoiseOutput.build("edgecase", latents, 0); result = codeflash_output # 8.53μs -> 7.83μs (8.96% faster)

def test_build_with_large_seed_value():
latents = torch.zeros((1, 4, 8, 8))
seed = 2**31 - 1 # Maximum 32-bit signed int
codeflash_output = NoiseOutput.build("maxseed", latents, seed); result = codeflash_output # 8.38μs -> 7.72μs (8.48% faster)

def test_build_with_negative_seed_value():
latents = torch.zeros((1, 4, 8, 8))
seed = -12345
codeflash_output = NoiseOutput.build("negseed", latents, seed); result = codeflash_output # 8.30μs -> 7.34μs (13.1% faster)

def test_build_with_different_batch_and_channel_sizes():
latents = torch.ones((3, 7, 10, 12))
codeflash_output = NoiseOutput.build("batchchan", latents, 5); result = codeflash_output # 8.38μs -> 7.57μs (10.7% faster)

def test_build_with_zero_width_or_height_raises():
# Should raise IndexError when accessing size()[2] or size()[3] if tensor is too small
latents = torch.ones((1, 4, 0, 32))
with pytest.raises(IndexError):
NoiseOutput.build("zerowidth", latents, 1)
latents = torch.ones((1, 4, 32, 0))
with pytest.raises(IndexError):
NoiseOutput.build("zeroheight", latents, 1)

def test_build_with_invalid_shape_raises():
# Tensor missing spatial dims
latents = torch.ones((1, 4, 32)) # Only 3 dims
with pytest.raises(IndexError):
NoiseOutput.build("invalidshape", latents, 1)
# Tensor with too many dims
latents = torch.ones((1, 4, 32, 32, 1))
with pytest.raises(IndexError):
NoiseOutput.build("toomanydims", latents, 1)

def test_build_with_large_latent_tensor():
# Size: (2, 8, 128, 128) -- total elements: 28128*128 = 262144
latents = torch.randn((2, 8, 128, 128))
codeflash_output = NoiseOutput.build("large", latents, 123456); result = codeflash_output # 14.0μs -> 13.5μs (4.25% faster)

def test_build_with_maximum_allowed_tensor_size():
# Stay under 100MB: float32 = 4 bytes, so max elements = 25,000,000
# Let's use (1, 4, 250, 250): 14250*250 = 250,000 elements (1MB)
latents = torch.randn((1, 4, 250, 250))
codeflash_output = NoiseOutput.build("maxsize", latents, 9999); result = codeflash_output # 12.4μs -> 11.3μs (9.82% faster)

def test_build_with_many_batches_and_channels():
# Large batch and channel count, but small spatial dims
latents = torch.randn((100, 50, 8, 8))
codeflash_output = NoiseOutput.build("manybatchchan", latents, 7); result = codeflash_output # 12.5μs -> 11.4μs (9.67% faster)

def test_build_with_randomized_inputs():
# Test random sizes within reasonable bounds
for batch in [1, 5, 10]:
for channels in [1, 4, 16]:
for h in [8, 32, 64]:
for w in [8, 32, 64]:
latents = torch.randn((batch, channels, h, w))
seed = batch * channels * h * w
codeflash_output = NoiseOutput.build("rand", latents, seed); result = codeflash_output

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

#------------------------------------------------
import pytest # used for our unit tests
import torch
from invokeai.app.invocations.noise import NoiseOutput

function to test

LATENT_SCALE_FACTOR = 8 # For testing purposes; in real code, import from constants

class FieldDescriptions:
noise = "Noise tensor field"
width = "Width of noise"
height = "Height of noise"

class LatentsField:
def init(self, latents_name, seed):
self.latents_name = latents_name
self.seed = seed

class OutputField:
def init(self, description):
self.description = description

def invocation_output(name):
def decorator(cls):
cls._output_name = name
return cls
return decorator

class BaseInvocationOutput:
pass
from invokeai.app.invocations.noise import NoiseOutput

unit tests

-------------- Basic Test Cases --------------

def test_build_basic_shape_and_values():
# Test with a standard 4D tensor shape
latents = torch.zeros((1, 4, 32, 64))
codeflash_output = NoiseOutput.build("test_latents", latents, 42); result = codeflash_output # 11.8μs -> 11.7μs (1.24% faster)

def test_build_with_different_seed_and_name():
# Test with different latents_name and seed
latents = torch.ones((2, 8, 16, 32))
codeflash_output = NoiseOutput.build("other_latents", latents, 123456); result = codeflash_output # 9.27μs -> 8.71μs (6.46% faster)

def test_build_with_minimal_valid_shape():
# Minimal valid shape is (1,1,1,1)
latents = torch.rand((1, 1, 1, 1))
codeflash_output = NoiseOutput.build("min_latents", latents, 0); result = codeflash_output # 8.98μs -> 7.66μs (17.2% faster)

-------------- Edge Test Cases --------------

def test_build_raises_on_zero_height():
# Pass a tensor with zero height (size(2))
latents = torch.zeros((1, 4, 0, 64))
with pytest.raises(ValueError):
NoiseOutput.build("fail", latents, 1)

def test_build_raises_on_zero_width():
# Pass a tensor with zero width (size(3))
latents = torch.zeros((1, 4, 32, 0))
with pytest.raises(ValueError):
NoiseOutput.build("fail", latents, 1)

def test_build_with_large_seed_and_name():
# Very large seed and long name
latents = torch.ones((1, 1, 2, 2))
codeflash_output = NoiseOutput.build("X"*1000, latents, 2**62); result = codeflash_output # 11.8μs -> 11.2μs (5.67% faster)

def test_build_with_negative_seed():
# Negative seed should still be accepted
latents = torch.ones((1, 1, 2, 2))
codeflash_output = NoiseOutput.build("neg_seed", latents, -12345); result = codeflash_output # 9.25μs -> 8.44μs (9.59% faster)

def test_build_with_single_channel_and_batch():
# Single batch, single channel, normal height/width
latents = torch.rand((1, 1, 10, 10))
codeflash_output = NoiseOutput.build("single", latents, 7); result = codeflash_output # 8.82μs -> 7.90μs (11.6% faster)

-------------- Large Scale Test Cases --------------

def test_build_large_tensor_shape():
# Test with a large tensor, but <100MB
# float32: 4 bytes, so (1,4,128,128) = 141281284 = 262144 bytes = ~0.25MB
latents = torch.rand((1, 4, 128, 128))
codeflash_output = NoiseOutput.build("large_latents", latents, 999); result = codeflash_output # 9.79μs -> 8.79μs (11.3% faster)

def test_build_tensor_with_max_accepted_dimensions():
# Test with a tensor at the upper limit of allowed shape
# (1, 4, 256, 256) = 142562564 = 1,048,576 bytes = ~1MB
latents = torch.rand((1, 4, 256, 256))
codeflash_output = NoiseOutput.build("max_latents", latents, 2024); result = codeflash_output # 11.8μs -> 10.5μs (11.6% faster)

def test_build_many_invocations():
# Test building many outputs in a loop (scalability, determinism)
for i in range(100):
latents = torch.ones((1, 1, i+1, i+2))
codeflash_output = NoiseOutput.build(f"name_{i}", latents, i); result = codeflash_output # 227μs -> 201μs (12.7% faster)

def test_build_tensor_with_high_batch_and_channel():
# Test with high batch and channel, but reasonable width/height
latents = torch.rand((16, 32, 8, 8))
codeflash_output = NoiseOutput.build("high_batch_channel", latents, 555); result = codeflash_output # 8.86μs -> 7.90μs (12.2% faster)

def test_build_tensor_with_maximum_elements_under_100MB():
# Calculate maximum shape under 100MB for float32
# 100MB = 10010241024 = 104857600 bytes
# Each element = 4 bytes, so max elements = 104857600 // 4 = 26214400
# Let's use shape (1, 4, 256, 256) as above, which is well below the limit
latents = torch.rand((1, 4, 256, 256))
codeflash_output = NoiseOutput.build("max_size", latents, 888); result = codeflash_output # 10.5μs -> 9.82μs (7.00% faster)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-NoiseOutput.build-mhvu1r2b and push.

The optimization replaces `latents.size()` calls with `latents.shape` attribute access and caches the shape in a variable to avoid repeated indexing operations. **Key changes:** - Replaced `latents.size()[3]` and `latents.size()[2]` with `shape[3]` and `shape[2]` - Added `shape = latents.shape` to cache the tensor dimensions **Why this is faster:** 1. **Method call elimination**: `latents.shape` is a direct attribute access, while `latents.size()` is a method call that has function call overhead in Python 2. **Single shape computation**: The tensor shape is computed once and reused, rather than calling `latents.size()` twice 3. **Reduced indexing operations**: From the line profiler, dimension access time dropped significantly - width calculation went from 208.6μs to 72.3μs (65% faster) and height from 110μs to 51.4μs (53% faster) **Performance impact:** The optimization delivers an 11% speedup (627μs → 562μs) and shows consistent improvements across all test cases (5-17% faster per test). This is particularly valuable since the function appears to be called frequently during noise generation workflows in the InvokeAI inference pipeline. Even small per-call improvements compound when the function is invoked hundreds of times during image generation. **Test case benefits:** The optimization performs well across all tensor sizes, with particularly strong gains on smaller tensors (up to 17% faster on minimal shapes) where the relative overhead of method calls is highest.

codeflash-ai bot requested a review from mashraf-222 November 12, 2025 10:03

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `NoiseOutput.build` by 12% #152

⚡️ Speed up method `NoiseOutput.build` by 12% #152

Uh oh!

codeflash-ai bot commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method NoiseOutput.build by 12% #152

Are you sure you want to change the base?

⚡️ Speed up method NoiseOutput.build by 12% #152

Uh oh!

Conversation

codeflash-ai bot commented Nov 12, 2025

📄 12% (0.12x) speedup for NoiseOutput.build in invokeai/app/invocations/noise.py

📝 Explanation and details

function to test

Dummy constants and classes to allow isolated testing

unit tests

1. Basic Test Cases

2. Edge Test Cases

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

function to test

unit tests

-------------- Basic Test Cases --------------

-------------- Edge Test Cases --------------

-------------- Large Scale Test Cases --------------

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `NoiseOutput.build` by 12% #152

⚡️ Speed up method `NoiseOutput.build` by 12% #152

📄 12% (0.12x) speedup for `NoiseOutput.build` in `invokeai/app/invocations/noise.py`