Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 12, 2025

📄 26% (0.26x) speedup for MarkupLMEmbeddings.create_position_ids_from_input_ids in src/transformers/models/markuplm/modeling_markuplm.py

⏱️ Runtime : 1.88 milliseconds 1.49 milliseconds (best of 191 runs)

📝 Explanation and details

The optimized code achieves a 26% speedup by eliminating unnecessary tensor operations and type conversions in the create_position_ids_from_input_ids method.

Key optimizations applied:

  1. Eliminated redundant .int() cast: The original code converted the boolean mask to int unnecessarily. The optimized version keeps the mask as a boolean tensor, which PyTorch can work with directly in torch.cumsum().

  2. Removed .type_as() operation: The original code used .type_as(mask) to match tensor types, but this is redundant since torch.cumsum() on boolean tensors already returns the appropriate integer type (long).

  3. Simplified conditional addition: Instead of always adding past_key_values_length in the expression, the optimized code only performs the addition when past_key_values_length != 0, avoiding unnecessary operations in the common case where it's zero.

  4. Eliminated final .long() cast: The cumsum operation already produces long tensors, making the explicit cast redundant.

Why this leads to speedup:

  • Fewer tensor allocations: Each avoided type conversion (.int(), .type_as(), .long()) eliminates temporary tensor creation
  • Reduced memory bandwidth: Boolean tensors are more memory-efficient than integer tensors for masks
  • Conditional optimization: The if check for past_key_values_length != 0 avoids arithmetic operations in ~87.5% of test cases (35 out of 40 calls in the profiler)

Performance characteristics by test case:

  • Best gains (30-40% faster): Cases with all padding or simple patterns where the boolean mask operations shine
  • Consistent gains (15-28% faster): All other test cases benefit from reduced allocations
  • Larger sequences: The optimization scales well with sequence length, maintaining ~20-28% improvements

The optimization is particularly valuable since this function is likely called frequently in transformer model forward passes, making even small per-call improvements significant for overall model performance.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 40 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime

import pytest # used for our unit tests
import torch # used for tensor operations
from transformers.models.markuplm.modeling_markuplm import MarkupLMEmbeddings

unit tests

----------- Basic Test Cases -----------

def test_basic_no_padding():
# All tokens are non-padding, padding_idx=0
input_ids = torch.tensor([[1, 2, 3, 4]])
padding_idx = 0
# Positions should be [1,2,3,4] + padding_idx = [1,2,3,4]
expected = torch.tensor([[1,2,3,4]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 48.2μs -> 38.2μs (26.2% faster)

def test_basic_with_padding_middle():
# Padding in the middle
input_ids = torch.tensor([[5, 0, 6, 7]])
padding_idx = 0
# Positions: [1,0,2,3] + padding_idx = [1,0,2,3]
expected = torch.tensor([[1,0,2,3]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.5μs -> 35.0μs (32.8% faster)

def test_basic_all_padding():
# All tokens are padding
input_ids = torch.tensor([[0,0,0,0]])
padding_idx = 0
# All positions should be 0
expected = torch.tensor([[0,0,0,0]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 45.0μs -> 32.0μs (40.3% faster)

def test_basic_batch_multiple_rows():
# Batch of 2 sentences
input_ids = torch.tensor([[1,2,0,3], [0,4,5,0]])
padding_idx = 0
# First row: [1,2,0,3]
# Positions: [1,2,0,3]
# Second row: [0,4,5,0]
# Positions: [0,1,2,0]
expected = torch.tensor([[1,2,0,3],[0,1,2,0]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 43.9μs -> 33.8μs (30.0% faster)

def test_basic_nonzero_padding_idx():
# Padding index is not zero
input_ids = torch.tensor([[3,3,1,2]])
padding_idx = 3
# Only [1,2] are non-padding, so positions: [0,0,4,5]
expected = torch.tensor([[3,3,4,5]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 44.2μs -> 34.1μs (29.8% faster)

def test_basic_past_key_values_length():
# Test with past_key_values_length
input_ids = torch.tensor([[1,0,2,3]])
padding_idx = 0
past_key_values_length = 5
# Positions: [6,0,7,8]
expected = torch.tensor([[6,0,7,8]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length); result = codeflash_output # 42.2μs -> 35.8μs (17.9% faster)

----------- Edge Test Cases -----------

def test_edge_empty_tensor():
# Empty tensor
input_ids = torch.empty((0,0), dtype=torch.long)
padding_idx = 0
expected = torch.empty((0,0), dtype=torch.long)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 40.7μs -> 33.0μs (23.5% faster)

def test_edge_single_token_non_padding():
# Single token, non-padding
input_ids = torch.tensor([[42]])
padding_idx = 0
expected = torch.tensor([[1]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 43.0μs -> 33.5μs (28.4% faster)

def test_edge_single_token_padding():
# Single token, padding
input_ids = torch.tensor([[0]])
padding_idx = 0
expected = torch.tensor([[0]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 42.5μs -> 33.1μs (28.5% faster)

def test_edge_all_padding_nonzero_idx():
# All padding, nonzero padding idx
input_ids = torch.tensor([[3,3,3]])
padding_idx = 3
expected = torch.tensor([[3,3,3]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 40.5μs -> 35.0μs (15.7% faster)

def test_edge_alternating_padding_nonpadding():
# Alternating padding and non-padding tokens
input_ids = torch.tensor([[0,1,0,2,0,3]])
padding_idx = 0
# Positions: [0,1,0,2,0,3]
expected = torch.tensor([[0,1,0,2,0,3]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 42.9μs -> 32.0μs (34.1% faster)

def test_edge_large_padding_idx():
# Large padding index
input_ids = torch.tensor([[100,101,100,102]])
padding_idx = 100
# Positions: [0,101,0,102]
expected = torch.tensor([[100,101,100,102]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.1μs -> 36.7μs (25.6% faster)

def test_edge_negative_padding_idx():
# Negative padding index (should work for negative values)
input_ids = torch.tensor([[-1, 2, -1, 3]])
padding_idx = -1
# Positions: [0,0+1,0,0+2]
# mask: [0,1,0,1]
# cumsum: [0,1,0,2]
# positions: [0,-1+1=0,0,-1+2=1]
expected = torch.tensor([[-1,0,-1,1]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.0μs -> 32.9μs (39.7% faster)

----------- Large Scale Test Cases -----------

def test_large_scale_long_sequence():
# Sequence of length 1000, no padding
input_ids = torch.arange(1, 1001).unsqueeze(0) # shape (1,1000)
padding_idx = 0
# Positions should be [1,2,3,...,1000]
expected = torch.arange(1, 1001).unsqueeze(0)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 42.7μs -> 35.1μs (21.7% faster)

def test_large_scale_long_sequence_with_padding():
# Sequence of length 1000, padding every 10th token
input_ids = torch.arange(1, 1001)
input_ids[::10] = 0 # set every 10th token to padding
input_ids = input_ids.unsqueeze(0)
padding_idx = 0
# Positions: positions increment for non-padding, 0 for padding
expected = torch.zeros_like(input_ids)
counter = 0
for i in range(input_ids.shape[1]):
if input_ids[0,i] != padding_idx:
counter += 1
expected[0,i] = counter
else:
expected[0,i] = 0
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 82.6μs -> 71.4μs (15.7% faster)

def test_large_scale_batch():
# Batch of 10 sequences, each length 100
batch_size = 10
seq_len = 100
input_ids = torch.arange(1, seq_len+1).repeat(batch_size,1)
# Set padding_idx at index 0 for each row
input_ids[:,0] = 0
padding_idx = 0
expected = torch.zeros_like(input_ids)
for b in range(batch_size):
counter = 0
for i in range(seq_len):
if input_ids[b,i] != padding_idx:
counter += 1
expected[b,i] = counter
else:
expected[b,i] = 0
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 82.1μs -> 71.8μs (14.5% faster)

def test_large_scale_past_key_values_length():
# Test with large past_key_values_length
input_ids = torch.tensor([[1]*100])
padding_idx = 0
past_key_values_length = 500
# Positions: [501,502,...,600]
expected = torch.arange(501,601).unsqueeze(0)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length); result = codeflash_output # 43.9μs -> 36.0μs (22.0% faster)

def test_large_scale_all_padding():
# All padding, large sequence
input_ids = torch.zeros((1,1000), dtype=torch.long)
padding_idx = 0
expected = torch.zeros((1,1000), dtype=torch.long)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 50.6μs -> 39.9μs (26.8% faster)

----------- Determinism Test -----------

def test_determinism():
# Running the same input twice should yield the same output
input_ids = torch.tensor([[1,0,2,3]])
padding_idx = 0
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result1 = codeflash_output # 47.3μs -> 35.9μs (31.8% faster)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result2 = codeflash_output # 14.6μs -> 8.75μs (66.4% faster)

----------- Type and Shape Test -----------

def test_type_and_shape():
# Output should be long dtype, same shape as input
input_ids = torch.tensor([[1,2,0,3]])
padding_idx = 0
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 43.0μs -> 31.4μs (37.0% faster)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

#------------------------------------------------
import pytest # used for our unit tests
import torch
from transformers.models.markuplm.modeling_markuplm import MarkupLMEmbeddings

class XPathEmbeddings:
# Dummy stub for XPathEmbeddings, not used in test
def init(self, config):
pass
from transformers.models.markuplm.modeling_markuplm import MarkupLMEmbeddings

unit tests

1. Basic Test Cases

def test_basic_single_row_no_padding():
# Simple case: no padding, single row
input_ids = torch.tensor([[1, 2, 3, 4]])
padding_idx = 0
# Positions: [1,2,3,4] + padding_idx = [1,2,3,4]
expected = torch.tensor([[1,2,3,4]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 48.6μs -> 38.7μs (25.5% faster)

def test_basic_single_row_with_padding():
# Padding at start and end
input_ids = torch.tensor([[0, 1, 2, 0, 3, 0]])
padding_idx = 0
# Positions: [0,1,2,0,3,0] -> [0,1,2,0,3,0]
# Mask: [0,1,1,0,1,0]
# Cumsum: [0,1,2,2,3,3]
# Positions: [0,1,2,0,3,0] + padding_idx = [0,1,2,0,3,0]
expected = torch.tensor([[0,1,2,0,3,0]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 47.5μs -> 37.5μs (26.4% faster)

def test_basic_batch_rows():
# Batch of two rows, mixed padding
input_ids = torch.tensor([
[0, 1, 2, 0, 3],
[1, 2, 0, 0, 0]
])
padding_idx = 0
expected = torch.tensor([
[0,1,2,0,3],
[1,2,0,0,0]
])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 41.5μs -> 32.4μs (28.2% faster)

def test_basic_nonzero_padding_idx():
# Padding index is not zero
input_ids = torch.tensor([[5, 1, 5, 2, 3]])
padding_idx = 5
# Mask: [0,1,0,1,1]
# Cumsum: [0,1,1,2,3]
# Positions: [0,1,0,2,3] + padding_idx = [5,6,5,7,8]
expected = torch.tensor([[5,6,5,7,8]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.3μs -> 37.0μs (25.0% faster)

def test_basic_past_key_values_length():
# Test with past_key_values_length
input_ids = torch.tensor([[0, 1, 2, 0, 3]])
padding_idx = 0
past_key_values_length = 2
# Mask: [0,1,1,0,1]
# Cumsum: [0,1,2,2,3]
# Positions: ([0,1,2,0,3] + 2) * mask = [0,3,4,0,5]
# Positions: [0,3,4,0,5] + padding_idx = [0,3,4,0,5]
expected = torch.tensor([[0,3,4,0,5]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length); result = codeflash_output # 42.1μs -> 36.4μs (15.8% faster)

2. Edge Test Cases

def test_edge_all_padding():
# All tokens are padding
input_ids = torch.tensor([[0,0,0,0]])
padding_idx = 0
expected = torch.tensor([[0,0,0,0]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 43.3μs -> 31.8μs (36.1% faster)

def test_edge_no_tokens():
# Empty input
input_ids = torch.empty((1,0), dtype=torch.long)
padding_idx = 0
expected = torch.empty((1,0), dtype=torch.long)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 40.7μs -> 32.3μs (26.1% faster)

def test_edge_single_token_padding():
# Single token, padding
input_ids = torch.tensor([[0]])
padding_idx = 0
expected = torch.tensor([[0]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 44.3μs -> 34.5μs (28.5% faster)

def test_edge_single_token_non_padding():
# Single token, non-padding
input_ids = torch.tensor([[2]])
padding_idx = 0
expected = torch.tensor([[1]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.4μs -> 33.6μs (38.2% faster)

def test_edge_high_padding_idx():
# High padding_idx value
input_ids = torch.tensor([[99, 1, 99, 2]])
padding_idx = 99
# Mask: [0,1,0,1]
# Cumsum: [0,1,1,2]
# Positions: [99,100,99,101]
expected = torch.tensor([[99,100,99,101]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 43.2μs -> 37.3μs (15.7% faster)

def test_edge_past_key_values_length_zero():
# Explicitly set past_key_values_length=0, should be same as default
input_ids = torch.tensor([[0, 1, 2]])
padding_idx = 0
expected = torch.tensor([[0,1,2]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length=0); result = codeflash_output # 42.5μs -> 31.9μs (33.0% faster)

def test_edge_past_key_values_length_large():
# Large past_key_values_length
input_ids = torch.tensor([[0, 1, 2, 0, 3]])
padding_idx = 0
past_key_values_length = 100
# Mask: [0,1,1,0,1]
# Cumsum: [0,1,2,2,3]
# Positions: [0,101,102,0,103]
expected = torch.tensor([[0,101,102,0,103]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length); result = codeflash_output # 46.3μs -> 34.1μs (35.8% faster)

def test_edge_2d_batch_padding():
# 2D batch, mixed padding
input_ids = torch.tensor([
[0, 1, 2],
[3, 0, 4]
])
padding_idx = 0
# First row: [0,1,2] -> [0,1,2]
# Second row: [3,0,4] -> [1,0,2]
expected = torch.tensor([
[0,1,2],
[1,0,2]
])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.3μs -> 33.9μs (36.7% faster)

def test_edge_different_types():
# Input IDs as int32
input_ids = torch.tensor([[0, 1, 2, 0, 3]], dtype=torch.int32)
padding_idx = 0
expected = torch.tensor([[0,1,2,0,3]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.2μs -> 35.6μs (29.7% faster)

3. Large Scale Test Cases

def test_large_batch_size():
# Large batch size, small sequence length
batch_size = 500
seq_len = 5
input_ids = torch.zeros((batch_size, seq_len), dtype=torch.long)
# Set first token in each row to non-padding
for i in range(batch_size):
input_ids[i,0] = 1
padding_idx = 0
# Should be [1,0,0,0,0] for each row
expected = torch.zeros((batch_size, seq_len), dtype=torch.long)
expected[:,0] = 1
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 47.1μs -> 37.1μs (26.9% faster)

def test_large_seq_len():
# Large sequence length, single batch
seq_len = 900
input_ids = torch.ones((1, seq_len), dtype=torch.long) # no padding
padding_idx = 0
# Should be [1,2,...,900]
expected = torch.arange(1, seq_len+1).unsqueeze(0)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 45.6μs -> 35.6μs (28.0% faster)

def test_large_batch_and_seq_len_some_padding():
# Large batch and sequence length, with padding in random places
batch_size = 50
seq_len = 200
input_ids = torch.ones((batch_size, seq_len), dtype=torch.long)
padding_idx = 0
# Set every 10th token to padding
for i in range(batch_size):
input_ids[i,::10] = 0
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 63.0μs -> 52.9μs (19.1% faster)
# Check that every 10th token is padding_idx
for i in range(batch_size):
pass
# Check that the positions increase between paddings
for i in range(batch_size):
pos = 1
for j in range(seq_len):
if j % 10 == 0:
pass
else:
pos += 1

def test_large_past_key_values_length():
# Large past_key_values_length, large sequence
seq_len = 500
input_ids = torch.ones((1, seq_len), dtype=torch.long)
padding_idx = 0
past_key_values_length = 100
expected = torch.arange(1+past_key_values_length, seq_len+1+past_key_values_length).unsqueeze(0)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length); result = codeflash_output # 45.6μs -> 37.6μs (21.2% faster)

def test_large_all_padding():
# Large input, all padding
batch_size = 100
seq_len = 100
input_ids = torch.zeros((batch_size, seq_len), dtype=torch.long)
padding_idx = 0
expected = torch.zeros((batch_size, seq_len), dtype=torch.long)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 71.2μs -> 61.2μs (16.3% faster)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-MarkupLMEmbeddings.create_position_ids_from_input_ids-mhvlsner and push.

Codeflash Static Badge

The optimized code achieves a **26% speedup** by eliminating unnecessary tensor operations and type conversions in the `create_position_ids_from_input_ids` method.

**Key optimizations applied:**

1. **Eliminated redundant `.int()` cast**: The original code converted the boolean mask to int unnecessarily. The optimized version keeps the mask as a boolean tensor, which PyTorch can work with directly in `torch.cumsum()`.

2. **Removed `.type_as()` operation**: The original code used `.type_as(mask)` to match tensor types, but this is redundant since `torch.cumsum()` on boolean tensors already returns the appropriate integer type (long).

3. **Simplified conditional addition**: Instead of always adding `past_key_values_length` in the expression, the optimized code only performs the addition when `past_key_values_length != 0`, avoiding unnecessary operations in the common case where it's zero.

4. **Eliminated final `.long()` cast**: The cumsum operation already produces long tensors, making the explicit cast redundant.

**Why this leads to speedup:**
- **Fewer tensor allocations**: Each avoided type conversion (`.int()`, `.type_as()`, `.long()`) eliminates temporary tensor creation
- **Reduced memory bandwidth**: Boolean tensors are more memory-efficient than integer tensors for masks
- **Conditional optimization**: The `if` check for `past_key_values_length != 0` avoids arithmetic operations in ~87.5% of test cases (35 out of 40 calls in the profiler)

**Performance characteristics by test case:**
- **Best gains** (30-40% faster): Cases with all padding or simple patterns where the boolean mask operations shine
- **Consistent gains** (15-28% faster): All other test cases benefit from reduced allocations
- **Larger sequences**: The optimization scales well with sequence length, maintaining ~20-28% improvements

The optimization is particularly valuable since this function is likely called frequently in transformer model forward passes, making even small per-call improvements significant for overall model performance.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 12, 2025 06:12
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant