⚡️ Speed up method MarkupLMEmbeddings.create_position_ids_from_input_ids by 26%
#135
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 26% (0.26x) speedup for
MarkupLMEmbeddings.create_position_ids_from_input_idsinsrc/transformers/models/markuplm/modeling_markuplm.py⏱️ Runtime :
1.88 milliseconds→1.49 milliseconds(best of191runs)📝 Explanation and details
The optimized code achieves a 26% speedup by eliminating unnecessary tensor operations and type conversions in the
create_position_ids_from_input_idsmethod.Key optimizations applied:
Eliminated redundant
.int()cast: The original code converted the boolean mask to int unnecessarily. The optimized version keeps the mask as a boolean tensor, which PyTorch can work with directly intorch.cumsum().Removed
.type_as()operation: The original code used.type_as(mask)to match tensor types, but this is redundant sincetorch.cumsum()on boolean tensors already returns the appropriate integer type (long).Simplified conditional addition: Instead of always adding
past_key_values_lengthin the expression, the optimized code only performs the addition whenpast_key_values_length != 0, avoiding unnecessary operations in the common case where it's zero.Eliminated final
.long()cast: The cumsum operation already produces long tensors, making the explicit cast redundant.Why this leads to speedup:
.int(),.type_as(),.long()) eliminates temporary tensor creationifcheck forpast_key_values_length != 0avoids arithmetic operations in ~87.5% of test cases (35 out of 40 calls in the profiler)Performance characteristics by test case:
The optimization is particularly valuable since this function is likely called frequently in transformer model forward passes, making even small per-call improvements significant for overall model performance.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
import pytest # used for our unit tests
import torch # used for tensor operations
from transformers.models.markuplm.modeling_markuplm import MarkupLMEmbeddings
unit tests
----------- Basic Test Cases -----------
def test_basic_no_padding():
# All tokens are non-padding, padding_idx=0
input_ids = torch.tensor([[1, 2, 3, 4]])
padding_idx = 0
# Positions should be [1,2,3,4] + padding_idx = [1,2,3,4]
expected = torch.tensor([[1,2,3,4]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 48.2μs -> 38.2μs (26.2% faster)
def test_basic_with_padding_middle():
# Padding in the middle
input_ids = torch.tensor([[5, 0, 6, 7]])
padding_idx = 0
# Positions: [1,0,2,3] + padding_idx = [1,0,2,3]
expected = torch.tensor([[1,0,2,3]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.5μs -> 35.0μs (32.8% faster)
def test_basic_all_padding():
# All tokens are padding
input_ids = torch.tensor([[0,0,0,0]])
padding_idx = 0
# All positions should be 0
expected = torch.tensor([[0,0,0,0]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 45.0μs -> 32.0μs (40.3% faster)
def test_basic_batch_multiple_rows():
# Batch of 2 sentences
input_ids = torch.tensor([[1,2,0,3], [0,4,5,0]])
padding_idx = 0
# First row: [1,2,0,3]
# Positions: [1,2,0,3]
# Second row: [0,4,5,0]
# Positions: [0,1,2,0]
expected = torch.tensor([[1,2,0,3],[0,1,2,0]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 43.9μs -> 33.8μs (30.0% faster)
def test_basic_nonzero_padding_idx():
# Padding index is not zero
input_ids = torch.tensor([[3,3,1,2]])
padding_idx = 3
# Only [1,2] are non-padding, so positions: [0,0,4,5]
expected = torch.tensor([[3,3,4,5]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 44.2μs -> 34.1μs (29.8% faster)
def test_basic_past_key_values_length():
# Test with past_key_values_length
input_ids = torch.tensor([[1,0,2,3]])
padding_idx = 0
past_key_values_length = 5
# Positions: [6,0,7,8]
expected = torch.tensor([[6,0,7,8]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length); result = codeflash_output # 42.2μs -> 35.8μs (17.9% faster)
----------- Edge Test Cases -----------
def test_edge_empty_tensor():
# Empty tensor
input_ids = torch.empty((0,0), dtype=torch.long)
padding_idx = 0
expected = torch.empty((0,0), dtype=torch.long)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 40.7μs -> 33.0μs (23.5% faster)
def test_edge_single_token_non_padding():
# Single token, non-padding
input_ids = torch.tensor([[42]])
padding_idx = 0
expected = torch.tensor([[1]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 43.0μs -> 33.5μs (28.4% faster)
def test_edge_single_token_padding():
# Single token, padding
input_ids = torch.tensor([[0]])
padding_idx = 0
expected = torch.tensor([[0]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 42.5μs -> 33.1μs (28.5% faster)
def test_edge_all_padding_nonzero_idx():
# All padding, nonzero padding idx
input_ids = torch.tensor([[3,3,3]])
padding_idx = 3
expected = torch.tensor([[3,3,3]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 40.5μs -> 35.0μs (15.7% faster)
def test_edge_alternating_padding_nonpadding():
# Alternating padding and non-padding tokens
input_ids = torch.tensor([[0,1,0,2,0,3]])
padding_idx = 0
# Positions: [0,1,0,2,0,3]
expected = torch.tensor([[0,1,0,2,0,3]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 42.9μs -> 32.0μs (34.1% faster)
def test_edge_large_padding_idx():
# Large padding index
input_ids = torch.tensor([[100,101,100,102]])
padding_idx = 100
# Positions: [0,101,0,102]
expected = torch.tensor([[100,101,100,102]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.1μs -> 36.7μs (25.6% faster)
def test_edge_negative_padding_idx():
# Negative padding index (should work for negative values)
input_ids = torch.tensor([[-1, 2, -1, 3]])
padding_idx = -1
# Positions: [0,0+1,0,0+2]
# mask: [0,1,0,1]
# cumsum: [0,1,0,2]
# positions: [0,-1+1=0,0,-1+2=1]
expected = torch.tensor([[-1,0,-1,1]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.0μs -> 32.9μs (39.7% faster)
----------- Large Scale Test Cases -----------
def test_large_scale_long_sequence():
# Sequence of length 1000, no padding
input_ids = torch.arange(1, 1001).unsqueeze(0) # shape (1,1000)
padding_idx = 0
# Positions should be [1,2,3,...,1000]
expected = torch.arange(1, 1001).unsqueeze(0)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 42.7μs -> 35.1μs (21.7% faster)
def test_large_scale_long_sequence_with_padding():
# Sequence of length 1000, padding every 10th token
input_ids = torch.arange(1, 1001)
input_ids[::10] = 0 # set every 10th token to padding
input_ids = input_ids.unsqueeze(0)
padding_idx = 0
# Positions: positions increment for non-padding, 0 for padding
expected = torch.zeros_like(input_ids)
counter = 0
for i in range(input_ids.shape[1]):
if input_ids[0,i] != padding_idx:
counter += 1
expected[0,i] = counter
else:
expected[0,i] = 0
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 82.6μs -> 71.4μs (15.7% faster)
def test_large_scale_batch():
# Batch of 10 sequences, each length 100
batch_size = 10
seq_len = 100
input_ids = torch.arange(1, seq_len+1).repeat(batch_size,1)
# Set padding_idx at index 0 for each row
input_ids[:,0] = 0
padding_idx = 0
expected = torch.zeros_like(input_ids)
for b in range(batch_size):
counter = 0
for i in range(seq_len):
if input_ids[b,i] != padding_idx:
counter += 1
expected[b,i] = counter
else:
expected[b,i] = 0
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 82.1μs -> 71.8μs (14.5% faster)
def test_large_scale_past_key_values_length():
# Test with large past_key_values_length
input_ids = torch.tensor([[1]*100])
padding_idx = 0
past_key_values_length = 500
# Positions: [501,502,...,600]
expected = torch.arange(501,601).unsqueeze(0)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length); result = codeflash_output # 43.9μs -> 36.0μs (22.0% faster)
def test_large_scale_all_padding():
# All padding, large sequence
input_ids = torch.zeros((1,1000), dtype=torch.long)
padding_idx = 0
expected = torch.zeros((1,1000), dtype=torch.long)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 50.6μs -> 39.9μs (26.8% faster)
----------- Determinism Test -----------
def test_determinism():
# Running the same input twice should yield the same output
input_ids = torch.tensor([[1,0,2,3]])
padding_idx = 0
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result1 = codeflash_output # 47.3μs -> 35.9μs (31.8% faster)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result2 = codeflash_output # 14.6μs -> 8.75μs (66.4% faster)
----------- Type and Shape Test -----------
def test_type_and_shape():
# Output should be long dtype, same shape as input
input_ids = torch.tensor([[1,2,0,3]])
padding_idx = 0
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 43.0μs -> 31.4μs (37.0% faster)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest # used for our unit tests
import torch
from transformers.models.markuplm.modeling_markuplm import MarkupLMEmbeddings
class XPathEmbeddings:
# Dummy stub for XPathEmbeddings, not used in test
def init(self, config):
pass
from transformers.models.markuplm.modeling_markuplm import MarkupLMEmbeddings
unit tests
1. Basic Test Cases
def test_basic_single_row_no_padding():
# Simple case: no padding, single row
input_ids = torch.tensor([[1, 2, 3, 4]])
padding_idx = 0
# Positions: [1,2,3,4] + padding_idx = [1,2,3,4]
expected = torch.tensor([[1,2,3,4]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 48.6μs -> 38.7μs (25.5% faster)
def test_basic_single_row_with_padding():
# Padding at start and end
input_ids = torch.tensor([[0, 1, 2, 0, 3, 0]])
padding_idx = 0
# Positions: [0,1,2,0,3,0] -> [0,1,2,0,3,0]
# Mask: [0,1,1,0,1,0]
# Cumsum: [0,1,2,2,3,3]
# Positions: [0,1,2,0,3,0] + padding_idx = [0,1,2,0,3,0]
expected = torch.tensor([[0,1,2,0,3,0]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 47.5μs -> 37.5μs (26.4% faster)
def test_basic_batch_rows():
# Batch of two rows, mixed padding
input_ids = torch.tensor([
[0, 1, 2, 0, 3],
[1, 2, 0, 0, 0]
])
padding_idx = 0
expected = torch.tensor([
[0,1,2,0,3],
[1,2,0,0,0]
])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 41.5μs -> 32.4μs (28.2% faster)
def test_basic_nonzero_padding_idx():
# Padding index is not zero
input_ids = torch.tensor([[5, 1, 5, 2, 3]])
padding_idx = 5
# Mask: [0,1,0,1,1]
# Cumsum: [0,1,1,2,3]
# Positions: [0,1,0,2,3] + padding_idx = [5,6,5,7,8]
expected = torch.tensor([[5,6,5,7,8]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.3μs -> 37.0μs (25.0% faster)
def test_basic_past_key_values_length():
# Test with past_key_values_length
input_ids = torch.tensor([[0, 1, 2, 0, 3]])
padding_idx = 0
past_key_values_length = 2
# Mask: [0,1,1,0,1]
# Cumsum: [0,1,2,2,3]
# Positions: ([0,1,2,0,3] + 2) * mask = [0,3,4,0,5]
# Positions: [0,3,4,0,5] + padding_idx = [0,3,4,0,5]
expected = torch.tensor([[0,3,4,0,5]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length); result = codeflash_output # 42.1μs -> 36.4μs (15.8% faster)
2. Edge Test Cases
def test_edge_all_padding():
# All tokens are padding
input_ids = torch.tensor([[0,0,0,0]])
padding_idx = 0
expected = torch.tensor([[0,0,0,0]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 43.3μs -> 31.8μs (36.1% faster)
def test_edge_no_tokens():
# Empty input
input_ids = torch.empty((1,0), dtype=torch.long)
padding_idx = 0
expected = torch.empty((1,0), dtype=torch.long)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 40.7μs -> 32.3μs (26.1% faster)
def test_edge_single_token_padding():
# Single token, padding
input_ids = torch.tensor([[0]])
padding_idx = 0
expected = torch.tensor([[0]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 44.3μs -> 34.5μs (28.5% faster)
def test_edge_single_token_non_padding():
# Single token, non-padding
input_ids = torch.tensor([[2]])
padding_idx = 0
expected = torch.tensor([[1]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.4μs -> 33.6μs (38.2% faster)
def test_edge_high_padding_idx():
# High padding_idx value
input_ids = torch.tensor([[99, 1, 99, 2]])
padding_idx = 99
# Mask: [0,1,0,1]
# Cumsum: [0,1,1,2]
# Positions: [99,100,99,101]
expected = torch.tensor([[99,100,99,101]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 43.2μs -> 37.3μs (15.7% faster)
def test_edge_past_key_values_length_zero():
# Explicitly set past_key_values_length=0, should be same as default
input_ids = torch.tensor([[0, 1, 2]])
padding_idx = 0
expected = torch.tensor([[0,1,2]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length=0); result = codeflash_output # 42.5μs -> 31.9μs (33.0% faster)
def test_edge_past_key_values_length_large():
# Large past_key_values_length
input_ids = torch.tensor([[0, 1, 2, 0, 3]])
padding_idx = 0
past_key_values_length = 100
# Mask: [0,1,1,0,1]
# Cumsum: [0,1,2,2,3]
# Positions: [0,101,102,0,103]
expected = torch.tensor([[0,101,102,0,103]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length); result = codeflash_output # 46.3μs -> 34.1μs (35.8% faster)
def test_edge_2d_batch_padding():
# 2D batch, mixed padding
input_ids = torch.tensor([
[0, 1, 2],
[3, 0, 4]
])
padding_idx = 0
# First row: [0,1,2] -> [0,1,2]
# Second row: [3,0,4] -> [1,0,2]
expected = torch.tensor([
[0,1,2],
[1,0,2]
])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.3μs -> 33.9μs (36.7% faster)
def test_edge_different_types():
# Input IDs as int32
input_ids = torch.tensor([[0, 1, 2, 0, 3]], dtype=torch.int32)
padding_idx = 0
expected = torch.tensor([[0,1,2,0,3]])
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 46.2μs -> 35.6μs (29.7% faster)
3. Large Scale Test Cases
def test_large_batch_size():
# Large batch size, small sequence length
batch_size = 500
seq_len = 5
input_ids = torch.zeros((batch_size, seq_len), dtype=torch.long)
# Set first token in each row to non-padding
for i in range(batch_size):
input_ids[i,0] = 1
padding_idx = 0
# Should be [1,0,0,0,0] for each row
expected = torch.zeros((batch_size, seq_len), dtype=torch.long)
expected[:,0] = 1
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 47.1μs -> 37.1μs (26.9% faster)
def test_large_seq_len():
# Large sequence length, single batch
seq_len = 900
input_ids = torch.ones((1, seq_len), dtype=torch.long) # no padding
padding_idx = 0
# Should be [1,2,...,900]
expected = torch.arange(1, seq_len+1).unsqueeze(0)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 45.6μs -> 35.6μs (28.0% faster)
def test_large_batch_and_seq_len_some_padding():
# Large batch and sequence length, with padding in random places
batch_size = 50
seq_len = 200
input_ids = torch.ones((batch_size, seq_len), dtype=torch.long)
padding_idx = 0
# Set every 10th token to padding
for i in range(batch_size):
input_ids[i,::10] = 0
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 63.0μs -> 52.9μs (19.1% faster)
# Check that every 10th token is padding_idx
for i in range(batch_size):
pass
# Check that the positions increase between paddings
for i in range(batch_size):
pos = 1
for j in range(seq_len):
if j % 10 == 0:
pass
else:
pos += 1
def test_large_past_key_values_length():
# Large past_key_values_length, large sequence
seq_len = 500
input_ids = torch.ones((1, seq_len), dtype=torch.long)
padding_idx = 0
past_key_values_length = 100
expected = torch.arange(1+past_key_values_length, seq_len+1+past_key_values_length).unsqueeze(0)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length); result = codeflash_output # 45.6μs -> 37.6μs (21.2% faster)
def test_large_all_padding():
# Large input, all padding
batch_size = 100
seq_len = 100
input_ids = torch.zeros((batch_size, seq_len), dtype=torch.long)
padding_idx = 0
expected = torch.zeros((batch_size, seq_len), dtype=torch.long)
codeflash_output = MarkupLMEmbeddings.create_position_ids_from_input_ids(input_ids, padding_idx); result = codeflash_output # 71.2μs -> 61.2μs (16.3% faster)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
To edit these changes
git checkout codeflash/optimize-MarkupLMEmbeddings.create_position_ids_from_input_ids-mhvlsnerand push.