⚡️ Speed up function `eager_attention_forward` by 9% #140

codeflash-ai · 2025-11-12T08:20:10Z

📄 9% (0.09x) speedup for `eager_attention_forward` in `src/transformers/models/vit_mae/modeling_vit_mae.py`

⏱️ Runtime : 2.73 milliseconds → 2.50 milliseconds (best of 250 runs)

📝 Explanation and details

The optimized code achieves a 9% speedup through three key optimizations that reduce memory allocations and computational overhead:

1. In-place tensor operations: The most significant change replaces expensive tensor arithmetic with in-place operations:

attn_weights = torch.matmul(query, key.transpose(2, 3)) * scaling becomes two steps: first the matmul, then attn_weights.mul_(scaling)
attn_weights = attn_weights + attention_mask becomes attn_weights.add_(attention_mask)

This eliminates intermediate tensor allocations. The line profiler shows the matmul operation time dropped from 2.20ms to 1.45ms (34% reduction), and the mask addition dropped from 244μs to 203μs (17% reduction).

2. Conditional dropout check: Adding if dropout > 0.0: before calling nn.functional.dropout avoids unnecessary function call overhead when dropout is disabled. The profiler shows this optimization particularly benefits test cases where dropout=0.0, as the dropout line goes from being called 30 times to only 7 times when actually needed.

3. Memory efficiency: By performing operations in-place on the attention weights tensor rather than creating new tensors for each arithmetic operation, the code reduces memory pressure and improves cache locality.

Performance impact: The test results show consistent 8-18% speedups across various scenarios, with particularly strong gains in edge cases (zero-length sequences, singleton dimensions) and large-scale tests. The optimization is most effective when attention masks are used and when dropout is zero, making it beneficial for both training and inference workloads in transformer attention mechanisms.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 30 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import math

imports

import pytest # used for our unit tests
import torch
from torch import nn
from transformers.models.vit_mae.modeling_vit_mae import
eager_attention_forward

unit tests

Helper class to mock training/eval mode

class DummyModule(nn.Module):
def init(self, training=True):
super().init()
self.training = training

#############################################

1. Basic Test Cases

#############################################

def test_basic_shapes_and_output():
# Test with small, valid shapes and no mask
batch, num_heads, seq_len, head_dim = 2, 4, 5, 8
module = DummyModule(training=False)
query = torch.randn(batch, num_heads, seq_len, head_dim)
key = torch.randn(batch, num_heads, seq_len, head_dim)
value = torch.randn(batch, num_heads, seq_len, head_dim)
# No mask
attn_output, attn_weights = eager_attention_forward(module, query, key, value, None) # 82.6μs -> 74.2μs (11.4% faster)
# Softmax along last dim should sum to 1 (within tolerance)
sums = attn_weights.sum(dim=-1)

def test_basic_with_attention_mask():
# Test with attention mask
batch, num_heads, tgt_len, src_len, head_dim = 1, 2, 3, 3, 4
module = DummyModule(training=False)
query = torch.randn(batch, num_heads, tgt_len, head_dim)
key = torch.randn(batch, num_heads, src_len, head_dim)
value = torch.randn(batch, num_heads, src_len, head_dim)
# Mask shape: (batch, num_heads, tgt_len, src_len)
attention_mask = torch.zeros(batch, num_heads, tgt_len, src_len)
# Set one position to -inf to mask it out
attention_mask[:, :, :, 1] = float('-inf')
attn_output, attn_weights = eager_attention_forward(module, query, key, value, attention_mask) # 74.6μs -> 66.2μs (12.7% faster)

def test_basic_scaling_and_dropout():
# Test with custom scaling and dropout
batch, num_heads, seq_len, head_dim = 1, 1, 2, 2
module = DummyModule(training=True)
query = torch.ones(batch, num_heads, seq_len, head_dim)
key = torch.ones(batch, num_heads, seq_len, head_dim)
value = torch.ones(batch, num_heads, seq_len, head_dim)
scaling = 0.5
dropout = 0.5
attn_output, attn_weights = eager_attention_forward(module, query, key, value, None, scaling, dropout) # 79.4μs -> 78.6μs (0.993% faster)
sums = attn_weights.sum(dim=-1)

#############################################

2. Edge Test Cases

#############################################

def test_zero_length_sequences():
# Zero-length sequence for query, key, value
batch, num_heads, seq_len, head_dim = 1, 1, 0, 4
module = DummyModule()
query = torch.empty(batch, num_heads, seq_len, head_dim)
key = torch.empty(batch, num_heads, seq_len, head_dim)
value = torch.empty(batch, num_heads, seq_len, head_dim)
attn_output, attn_weights = eager_attention_forward(module, query, key, value, None) # 59.5μs -> 52.1μs (14.3% faster)

def test_singleton_dimensions():
# Singleton batch, head, and sequence
batch, num_heads, seq_len, head_dim = 1, 1, 1, 1
module = DummyModule()
query = torch.tensor([[[[1.0]]]])
key = torch.tensor([[[[1.0]]]])
value = torch.tensor([[[[2.0]]]])
attn_output, attn_weights = eager_attention_forward(module, query, key, value, None) # 65.3μs -> 57.6μs (13.2% faster)

def test_mask_truncation():
# Mask longer than key sequence
batch, num_heads, tgt_len, src_len, head_dim = 1, 1, 2, 2, 2
module = DummyModule()
query = torch.randn(batch, num_heads, tgt_len, head_dim)
key = torch.randn(batch, num_heads, src_len, head_dim)
value = torch.randn(batch, num_heads, src_len, head_dim)
# Mask with extra columns
attention_mask = torch.zeros(batch, num_heads, tgt_len, src_len+2)
# Should not error, should truncate mask
attn_output, attn_weights = eager_attention_forward(module, query, key, value, attention_mask) # 84.1μs -> 74.1μs (13.5% faster)

def test_all_positions_masked():
# All positions masked (should result in NaN/0 weights after softmax)
batch, num_heads, tgt_len, src_len, head_dim = 1, 1, 2, 2, 2
module = DummyModule()
query = torch.randn(batch, num_heads, tgt_len, head_dim)
key = torch.randn(batch, num_heads, src_len, head_dim)
value = torch.randn(batch, num_heads, src_len, head_dim)
attention_mask = torch.full((batch, num_heads, tgt_len, src_len), float('-inf'))
attn_output, attn_weights = eager_attention_forward(module, query, key, value, attention_mask) # 79.9μs -> 70.7μs (13.0% faster)

def test_non_contiguous_input():
# Test with non-contiguous tensors
batch, num_heads, seq_len, head_dim = 2, 2, 3, 4
module = DummyModule()
base = torch.randn(batch, num_heads, seq_len, head_dim * 2)
query = base[..., ::2].contiguous() # make a view, then make contiguous
key = base[..., 1::2]
value = base[..., ::2]
attn_output, attn_weights = eager_attention_forward(module, query, key, value, None) # 65.3μs -> 57.6μs (13.3% faster)

def test_invalid_shapes_raise():
# Mismatched shapes should raise
batch, num_heads, seq_len, head_dim = 1, 2, 3, 4
module = DummyModule()
query = torch.randn(batch, num_heads, seq_len, head_dim)
key = torch.randn(batch, num_heads, seq_len+1, head_dim)
value = torch.randn(batch, num_heads, seq_len, head_dim)
with pytest.raises(RuntimeError):
eager_attention_forward(module, query, key, value, None) # 122μs -> 115μs (6.74% faster)

def test_invalid_mask_shape_raises():
# Mask shape not matching batch/heads should raise
batch, num_heads, seq_len, head_dim = 1, 2, 3, 4
module = DummyModule()
query = torch.randn(batch, num_heads, seq_len, head_dim)
key = torch.randn(batch, num_heads, seq_len, head_dim)
value = torch.randn(batch, num_heads, seq_len, head_dim)
attention_mask = torch.zeros(batch, num_heads+1, seq_len, seq_len)
with pytest.raises(RuntimeError):
eager_attention_forward(module, query, key, value, attention_mask) # 112μs -> 110μs (0.982% faster)

#############################################

3. Large Scale Test Cases

#############################################

def test_large_batch_and_heads():
# Large batch and heads, but under memory limit
batch, num_heads, seq_len, head_dim = 8, 16, 32, 16
module = DummyModule()
query = torch.randn(batch, num_heads, seq_len, head_dim)
key = torch.randn(batch, num_heads, seq_len, head_dim)
value = torch.randn(batch, num_heads, seq_len, head_dim)
attn_output, attn_weights = eager_attention_forward(module, query, key, value, None) # 308μs -> 284μs (8.34% faster)
# Check softmax sums
sums = attn_weights.sum(dim=-1)

def test_large_seq_len():
# Large sequence length, but under 100MB
batch, num_heads, seq_len, head_dim = 2, 2, 128, 32
module = DummyModule()
query = torch.randn(batch, num_heads, seq_len, head_dim)
key = torch.randn(batch, num_heads, seq_len, head_dim)
value = torch.randn(batch, num_heads, seq_len, head_dim)
attn_output, attn_weights = eager_attention_forward(module, query, key, value, None) # 178μs -> 161μs (10.6% faster)
# Softmax sums
sums = attn_weights.sum(dim=-1)

def test_large_masked_attention():
# Large mask with random masking
batch, num_heads, seq_len, head_dim = 2, 4, 64, 16
module = DummyModule()
query = torch.randn(batch, num_heads, seq_len, head_dim)
key = torch.randn(batch, num_heads, seq_len, head_dim)
value = torch.randn(batch, num_heads, seq_len, head_dim)
attention_mask = torch.zeros(batch, num_heads, seq_len, seq_len)
# Randomly mask out half of the positions
mask_indices = torch.rand(batch, num_heads, seq_len, seq_len) < 0.5
attention_mask[mask_indices] = float('-inf')
attn_output, attn_weights = eager_attention_forward(module, query, key, value, attention_mask) # 128μs -> 113μs (12.4% faster)
# Each row should sum to 1 or 0 (if all masked)
sums = attn_weights.sum(dim=-1)

#------------------------------------------------
import math

imports

import pytest # used for our unit tests
import torch
from torch import nn
from transformers.models.vit_mae.modeling_vit_mae import
eager_attention_forward

unit tests

Helper: Dummy module for dropout

class DummyModule(nn.Module):
def init(self, training=False):
super().init()
self.training = training

##############

BASIC TESTS

##############

def test_basic_identity_attention():
# Test attention with identity key/value and query
# Should produce output close to value when attention is uniform
batch, heads, seq_len, dim = 2, 1, 3, 4
module = DummyModule(training=False)
query = torch.ones(batch, heads, seq_len, dim)
key = torch.ones(batch, heads, seq_len, dim)
value = torch.arange(batch * heads * seq_len * dim, dtype=torch.float32).reshape(batch, heads, seq_len, dim)
attn_output, attn_weights = eager_attention_forward(module, query, key, value, None) # 71.4μs -> 63.0μs (13.4% faster)
# All weights should sum to 1 along last axis (softmax)
sums = attn_weights.sum(dim=-1)

def test_basic_attention_mask():
# Test with an attention mask that blocks one position
batch, heads, seq_len, dim = 1, 1, 2, 2
module = DummyModule(training=False)
query = torch.tensor([[[[1.0, 0.0], [0.0, 1.0]]]]) # (1,1,2,2)
key = torch.tensor([[[[1.0, 0.0], [0.0, 1.0]]]])
value = torch.tensor([[[[10.0, 20.0], [30.0, 40.0]]]])
# Mask: block attending to position 1
attention_mask = torch.tensor([[[[0.0, -1e9], [0.0, 0.0]]]])
attn_output, attn_weights = eager_attention_forward(module, query, key, value, attention_mask) # 87.7μs -> 78.5μs (11.7% faster)

def test_basic_scaling():
# Test that scaling changes the softmax output
batch, heads, seq_len, dim = 1, 1, 2, 2
module = DummyModule(training=False)
query = torch.tensor([[[[1.0, 0.0], [0.0, 1.0]]]])
key = torch.tensor([[[[1.0, 0.0], [0.0, 1.0]]]])
value = torch.tensor([[[[1.0, 2.0], [3.0, 4.0]]]])
# No scaling
attn_output1, attn_weights1 = eager_attention_forward(module, query, key, value, None, scaling=1.0) # 70.7μs -> 62.4μs (13.3% faster)
# With default scaling (should soften the softmax)
attn_output2, attn_weights2 = eager_attention_forward(module, query, key, value, None, scaling=None) # 22.2μs -> 18.8μs (18.0% faster)

def test_basic_dropout_behavior():
# Test dropout is only applied during training
batch, heads, seq_len, dim = 1, 1, 3, 2
query = torch.randn(batch, heads, seq_len, dim)
key = torch.randn(batch, heads, seq_len, dim)
value = torch.randn(batch, heads, seq_len, dim)
module_train = DummyModule(training=True)
module_eval = DummyModule(training=False)
# Use high dropout for clear effect
attn_output_train1, attn_weights_train1 = eager_attention_forward(module_train, query, key, value, None, dropout=0.9) # 81.6μs -> 81.3μs (0.395% faster)
attn_output_train2, attn_weights_train2 = eager_attention_forward(module_train, query, key, value, None, dropout=0.9) # 25.9μs -> 25.9μs (0.120% faster)
attn_output_eval, attn_weights_eval = eager_attention_forward(module_eval, query, key, value, None, dropout=0.9) # 17.3μs -> 17.1μs (1.07% faster)

##############

EDGE CASES

##############

def test_edge_zero_length_sequence():
# Test with zero-length sequence
batch, heads, seq_len, dim = 1, 1, 0, 4
module = DummyModule(training=False)
query = torch.empty(batch, heads, seq_len, dim)
key = torch.empty(batch, heads, seq_len, dim)
value = torch.empty(batch, heads, seq_len, dim)
# Should not throw, but output shapes should be correct
attn_output, attn_weights = eager_attention_forward(module, query, key, value, None) # 57.9μs -> 51.2μs (13.0% faster)

def test_edge_singleton_batch_head():
# Test with batch size 1, head size 1, sequence length 1
batch, heads, seq_len, dim = 1, 1, 1, 2
module = DummyModule(training=False)
query = torch.ones(batch, heads, seq_len, dim)
key = torch.ones(batch, heads, seq_len, dim)
value = torch.ones(batch, heads, seq_len, dim)
attn_output, attn_weights = eager_attention_forward(module, query, key, value, None) # 62.5μs -> 53.6μs (16.6% faster)

def test_edge_attention_mask_shape_mismatch():
# Test that attention_mask longer than key seq is truncated
batch, heads, tgt_len, src_len, dim = 1, 1, 2, 2, 2
module = DummyModule(training=False)
query = torch.randn(batch, heads, tgt_len, dim)
key = torch.randn(batch, heads, src_len, dim)
value = torch.randn(batch, heads, src_len, dim)
# Mask has extra keys
attention_mask = torch.zeros(batch, heads, tgt_len, src_len + 2)
# Should not raise, mask should be truncated to src_len
attn_output, attn_weights = eager_attention_forward(module, query, key, value, attention_mask) # 82.6μs -> 75.1μs (10.0% faster)

def test_edge_non_float_inputs():
# Test with integer inputs (should cast to float internally)
batch, heads, seq_len, dim = 1, 1, 2, 2
module = DummyModule(training=False)
query = torch.ones(batch, heads, seq_len, dim, dtype=torch.int32)
key = torch.ones(batch, heads, seq_len, dim, dtype=torch.int32)
value = torch.ones(batch, heads, seq_len, dim, dtype=torch.int32)
# Should not raise
attn_output, attn_weights = eager_attention_forward(module, query.float(), key.float(), value.float(), None) # 62.3μs -> 52.9μs (17.8% faster)

def test_edge_large_negative_mask():
# Test that large negative mask leads to zero attention
batch, heads, seq_len, dim = 1, 1, 2, 2
module = DummyModule(training=False)
query = torch.ones(batch, heads, seq_len, dim)
key = torch.ones(batch, heads, seq_len, dim)
value = torch.ones(batch, heads, seq_len, dim)
# Mask all positions with large negative
attention_mask = torch.full((batch, heads, seq_len, seq_len), -1e9)
attn_output, attn_weights = eager_attention_forward(module, query, key, value, attention_mask) # 80.4μs -> 71.2μs (13.0% faster)
# Softmax(-inf) = uniform (since all are -inf), so sum to 1
sums = attn_weights.sum(dim=-1)

##############

LARGE SCALE TESTS

##############

def test_large_scale_attention():
# Test with large but manageable sizes (<100MB)
batch, heads, seq_len, dim = 2, 4, 64, 32 # 2464324 = 65536 bytes per tensor, safely under 100MB
module = DummyModule(training=False)
query = torch.randn(batch, heads, seq_len, dim)
key = torch.randn(batch, heads, seq_len, dim)
value = torch.randn(batch, heads, seq_len, dim)
attn_output, attn_weights = eager_attention_forward(module, query, key, value, None) # 147μs -> 132μs (10.8% faster)
# Check sum of weights
sums = attn_weights.sum(dim=-1)

def test_large_scale_attention_mask():
# Large-scale with attention mask
batch, heads, seq_len, dim = 2, 2, 32, 16
module = DummyModule(training=False)
query = torch.randn(batch, heads, seq_len, dim)
key = torch.randn(batch, heads, seq_len, dim)
value = torch.randn(batch, heads, seq_len, dim)
# Random mask (some positions blocked)
attention_mask = torch.zeros(batch, heads, seq_len, seq_len)
attention_mask[:, :, :, seq_len//2:] = -1e9 # block half
attn_output, attn_weights = eager_attention_forward(module, query, key, value, attention_mask) # 97.4μs -> 85.7μs (13.7% faster)

def test_large_scale_dropout():
# Large-scale with dropout
batch, heads, seq_len, dim = 2, 2, 32, 16
module = DummyModule(training=True)
query = torch.randn(batch, heads, seq_len, dim)
key = torch.randn(batch, heads, seq_len, dim)
value = torch.randn(batch, heads, seq_len, dim)
attn_output1, attn_weights1 = eager_attention_forward(module, query, key, value, None, dropout=0.5) # 150μs -> 148μs (1.07% faster)
attn_output2, attn_weights2 = eager_attention_forward(module, query, key, value, None, dropout=0.5) # 81.5μs -> 81.1μs (0.541% faster)

def test_large_scale_non_square_attention():
# Test with different query and key lengths
batch, heads, q_len, k_len, dim = 2, 2, 16, 32, 8
module = DummyModule(training=False)
query = torch.randn(batch, heads, q_len, dim)
key = torch.randn(batch, heads, k_len, dim)
value = torch.randn(batch, heads, k_len, dim)
attn_output, attn_weights = eager_attention_forward(module, query, key, value, None) # 88.2μs -> 79.7μs (10.7% faster)
# Softmax sums to 1 along last axis
sums = attn_weights.sum(dim=-1)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-eager_attention_forward-mhvqcaqf and push.

The optimized code achieves a 9% speedup through three key optimizations that reduce memory allocations and computational overhead: **1. In-place tensor operations**: The most significant change replaces expensive tensor arithmetic with in-place operations: - `attn_weights = torch.matmul(query, key.transpose(2, 3)) * scaling` becomes two steps: first the matmul, then `attn_weights.mul_(scaling)` - `attn_weights = attn_weights + attention_mask` becomes `attn_weights.add_(attention_mask)` This eliminates intermediate tensor allocations. The line profiler shows the matmul operation time dropped from 2.20ms to 1.45ms (34% reduction), and the mask addition dropped from 244μs to 203μs (17% reduction). **2. Conditional dropout check**: Adding `if dropout > 0.0:` before calling `nn.functional.dropout` avoids unnecessary function call overhead when dropout is disabled. The profiler shows this optimization particularly benefits test cases where dropout=0.0, as the dropout line goes from being called 30 times to only 7 times when actually needed. **3. Memory efficiency**: By performing operations in-place on the attention weights tensor rather than creating new tensors for each arithmetic operation, the code reduces memory pressure and improves cache locality. **Performance impact**: The test results show consistent 8-18% speedups across various scenarios, with particularly strong gains in edge cases (zero-length sequences, singleton dimensions) and large-scale tests. The optimization is most effective when attention masks are used and when dropout is zero, making it beneficial for both training and inference workloads in transformer attention mechanisms.

codeflash-ai bot requested a review from mashraf-222 November 12, 2025 08:20

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `eager_attention_forward` by 9% #140

⚡️ Speed up function `eager_attention_forward` by 9% #140

Uh oh!

codeflash-ai bot commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function eager_attention_forward by 9% #140

Are you sure you want to change the base?

⚡️ Speed up function eager_attention_forward by 9% #140

Uh oh!

Conversation

codeflash-ai bot commented Nov 12, 2025

📄 9% (0.09x) speedup for eager_attention_forward in src/transformers/models/vit_mae/modeling_vit_mae.py

📝 Explanation and details

imports

unit tests

Helper class to mock training/eval mode

1. Basic Test Cases

2. Edge Test Cases

3. Large Scale Test Cases

imports

unit tests

Helper: Dummy module for dropout

BASIC TESTS

EDGE CASES

LARGE SCALE TESTS

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `eager_attention_forward` by 9% #140

⚡️ Speed up function `eager_attention_forward` by 9% #140

📄 9% (0.09x) speedup for `eager_attention_forward` in `src/transformers/models/vit_mae/modeling_vit_mae.py`