Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 12, 2025

📄 24% (0.24x) speedup for RteProcessor._create_examples in src/transformers/data/processors/glue.py

⏱️ Runtime : 2.73 milliseconds 2.20 milliseconds (best of 250 runs)

📝 Explanation and details

The optimized code achieves a 24% speedup through several key algorithmic improvements that reduce per-iteration overhead:

Key Optimizations:

  1. Eliminated branch condition in tight loop: The original code checked if i == 0: continue on every iteration (7,175 times according to profiler). The optimized version skips the header upfront using iter(lines) and next(), removing this conditional check entirely from the main processing loop.

  2. Replaced list.append() with list comprehension: The original code used examples.append() in a loop, which required repeated dynamic memory allocation and method calls (46.5% of total time). The optimized version uses a list comprehension that pre-allocates memory and builds the list more efficiently.

  3. Hoisted invariant computations:

    • is_test = set_type == "test" is computed once instead of on every iteration
    • Example = InputExample localizes the class reference to avoid repeated global lookups

Performance Impact:

  • Small datasets (1-10 examples): 5-15% slower due to setup overhead of iterator creation
  • Large datasets (500-1000+ examples): 21-28% faster, where the loop optimizations dominate
  • Edge cases (empty/header-only): Slight overhead but maintains correctness with proper StopIteration handling

The optimization is particularly effective for large-scale data processing workloads typical in ML preprocessing pipelines, where the GLUE processor would handle thousands of examples. The trade-off of slightly slower performance on tiny datasets is acceptable given the substantial gains on realistic workloads.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 82 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime

import warnings

imports

import pytest # used for our unit tests
from transformers.data.processors.glue import RteProcessor

Minimal InputExample class for testing

class InputExample:
def init(self, guid, text_a, text_b, label=None):
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label

def __eq__(self, other):
    if not isinstance(other, InputExample):
        return False
    return (
        self.guid == other.guid and
        self.text_a == other.text_a and
        self.text_b == other.text_b and
        self.label == other.label
    )

def __repr__(self):
    return f"InputExample(guid={self.guid!r}, text_a={self.text_a!r}, text_b={self.text_b!r}, label={self.label!r})"

Minimal DataProcessor class for testing

class DataProcessor:
pass

DEPRECATION_WARNING = (
"This {0} will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets "
"library. You can have a look at this example script for pointers: "
"https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py"
)
from transformers.data.processors.glue import RteProcessor

unit tests

@pytest.fixture
def processor():
# Fixture to create a processor instance
return RteProcessor()

--------------------------

1. Basic Test Cases

--------------------------

def test_basic_single_example_train(processor):
# Test a single example in train set
lines = [
["index", "sentence1", "sentence2", "label"], # header
["1", "The sky is blue.", "The sky is clear.", "entailment"]
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.21μs -> 2.34μs (5.85% slower)
ex = examples[0]

def test_basic_single_example_dev(processor):
# Test a single example in dev set
lines = [
["index", "sentence1", "sentence2", "label"], # header
["2", "Grass is green.", "Plants are green.", "entailment"]
]
codeflash_output = processor._create_examples(lines, "dev"); examples = codeflash_output # 2.21μs -> 2.42μs (8.65% slower)
ex = examples[0]

def test_basic_single_example_test(processor):
# Test a single example in test set (label should be None)
lines = [
["index", "sentence1", "sentence2", "label"], # header
["3", "Birds can fly.", "Birds have wings.", "not_entailment"]
]
codeflash_output = processor._create_examples(lines, "test"); examples = codeflash_output # 2.27μs -> 2.48μs (8.28% slower)
ex = examples[0]

def test_basic_multiple_examples(processor):
# Test multiple examples in train set
lines = [
["index", "sentence1", "sentence2", "label"], # header
["1", "A", "B", "entailment"],
["2", "C", "D", "not_entailment"],
["3", "E", "F", "entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 3.35μs -> 3.32μs (0.753% faster)

--------------------------

2. Edge Test Cases

--------------------------

def test_empty_lines(processor):
# Test with empty lines list
lines = []
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 655ns -> 962ns (31.9% slower)

def test_header_only(processor):
# Test with only header, no examples
lines = [["index", "sentence1", "sentence2", "label"]]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 810ns -> 1.24μs (34.7% slower)

def test_missing_text_b(processor):
# Test with missing text_b (should raise IndexError)
lines = [
["index", "sentence1", "sentence2", "label"],
["5", "Sentence A"] # missing text_b and label
]
with pytest.raises(IndexError):
processor._create_examples(lines, "train") # 1.67μs -> 2.26μs (25.8% slower)

def test_label_none_in_non_test(processor):
# Test with label None in train set (should be accepted)
lines = [
["index", "sentence1", "sentence2", "label"],
["6", "A", "B", None]
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 3.10μs -> 3.12μs (0.801% slower)

def test_extra_columns(processor):
# Test with extra columns in input (should use only the required columns)
lines = [
["index", "sentence1", "sentence2", "label", "extra"],
["7", "A", "B", "entailment", "ignored"]
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.49μs -> 2.73μs (8.83% slower)

def test_non_string_fields(processor):
# Test with non-string fields (should accept any type)
lines = [
["index", "sentence1", "sentence2", "label"],
[8, 123, 456, True]
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.44μs -> 2.73μs (10.4% slower)

def test_guid_uniqueness(processor):
# Test that GUID is constructed correctly and is unique per example
lines = [
["index", "sentence1", "sentence2", "label"],
["1", "A", "B", "entailment"],
["1", "C", "D", "not_entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 3.17μs -> 3.13μs (1.21% faster)

def test_label_is_none_for_test(processor):
# Test that label is always None for test set, even if present
lines = [
["index", "sentence1", "sentence2", "label"],
["9", "A", "B", "entailment"],
["10", "C", "D", "not_entailment"],
]
codeflash_output = processor._create_examples(lines, "test"); examples = codeflash_output # 3.15μs -> 3.17μs (0.661% slower)

def test_set_type_case_sensitivity(processor):
# Test that set_type is case-sensitive ("Test" != "test")
lines = [
["index", "sentence1", "sentence2", "label"],
["11", "A", "B", "entailment"],
]
codeflash_output = processor._create_examples(lines, "Test"); examples = codeflash_output # 2.35μs -> 2.69μs (12.5% slower)

def test_empty_strings(processor):
# Test with empty strings for text_a, text_b, label
lines = [
["index", "sentence1", "sentence2", "label"],
["12", "", "", ""]
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.35μs -> 2.58μs (8.72% slower)

def test_whitespace_strings(processor):
# Test with whitespace strings for text_a, text_b, label
lines = [
["index", "sentence1", "sentence2", "label"],
["13", " ", " ", " "]
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.32μs -> 2.60μs (10.8% slower)

--------------------------

3. Large Scale Test Cases

--------------------------

def test_large_scale_1000_examples(processor):
# Test with 1000 examples to check scalability
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(1, 1001):
lines.append([str(i), f"Sentence {i}A", f"Sentence {i}B", "entailment" if i % 2 == 0 else "not_entailment"])
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 370μs -> 290μs (27.4% faster)

def test_large_scale_test_set(processor):
# Test with 500 examples in test set (label should always be None)
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(1, 501):
lines.append([str(i), f"TestA{i}", f"TestB{i}", "entailment"])
codeflash_output = processor._create_examples(lines, "test"); examples = codeflash_output # 184μs -> 147μs (24.8% faster)

def test_large_scale_empty_fields(processor):
# Test with many examples with empty fields
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(1, 1001):
lines.append([str(i), "", "", ""])
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 370μs -> 294μs (26.0% faster)

def test_large_scale_non_string_fields(processor):
# Test with many examples with non-string fields
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(1, 1001):
lines.append([i, i10, i100, i % 2 == 0])
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 404μs -> 333μs (21.2% faster)

def test_large_scale_extra_columns(processor):
# Test with extra columns in large scale
lines = [["index", "sentence1", "sentence2", "label", "extra"]]
for i in range(1, 1001):
lines.append([str(i), f"A{i}", f"B{i}", "entailment", f"extra{i}"])
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 366μs -> 288μs (27.3% faster)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

#------------------------------------------------
import pytest # used for our unit tests
from transformers.data.processors.glue import RteProcessor

Minimal InputExample class for testing

class InputExample:
def init(self, guid, text_a, text_b=None, label=None):
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label

def __eq__(self, other):
    # Equality check for test assertions
    return (
        isinstance(other, InputExample)
        and self.guid == other.guid
        and self.text_a == other.text_a
        and self.text_b == other.text_b
        and self.label == other.label
    )

def __repr__(self):
    return f"InputExample(guid={self.guid!r}, text_a={self.text_a!r}, text_b={self.text_b!r}, label={self.label!r})"

Minimal DataProcessor class for testing

class DataProcessor:
pass
from transformers.data.processors.glue import RteProcessor

------------------ UNIT TESTS ------------------

@pytest.fixture
def processor():
# Fixture to create a processor instance for tests
return RteProcessor()

1. Basic Test Cases

def test_empty_lines(processor):
# Test with empty lines (should return empty list)
lines = []
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 808ns -> 1.08μs (24.9% slower)

def test_only_header(processor):
# Test with only header row (should return empty list)
lines = [["index", "sentence1", "sentence2", "label"]]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 865ns -> 1.33μs (34.8% slower)

def test_single_train_example(processor):
# Test with one train example
lines = [
["index", "sentence1", "sentence2", "label"],
["0", "The sky is blue.", "The sky is colored.", "entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 2.67μs -> 2.92μs (8.69% slower)
expected = [
InputExample(guid="train-0", text_a="The sky is blue.", text_b="The sky is colored.", label="entailment")
]

def test_single_dev_example(processor):
# Test with one dev example
lines = [
["index", "sentence1", "sentence2", "label"],
["1", "Cats are animals.", "Cats are mammals.", "not_entailment"],
]
codeflash_output = processor._create_examples(lines, "dev"); result = codeflash_output # 2.35μs -> 2.59μs (9.37% slower)
expected = [
InputExample(guid="dev-1", text_a="Cats are animals.", text_b="Cats are mammals.", label="not_entailment")
]

def test_single_test_example(processor):
# Test with one test example (label should be None)
lines = [
["index", "sentence1", "sentence2", "label"],
["2", "Dogs bark.", "Dogs make noise.", "entailment"],
]
codeflash_output = processor._create_examples(lines, "test"); result = codeflash_output # 2.31μs -> 2.49μs (7.27% slower)
expected = [
InputExample(guid="test-2", text_a="Dogs bark.", text_b="Dogs make noise.", label=None)
]

def test_multiple_examples(processor):
# Test with multiple examples
lines = [
["index", "sentence1", "sentence2", "label"],
["3", "Water is wet.", "Water is dry.", "not_entailment"],
["4", "Fire is hot.", "Fire is cold.", "not_entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 3.01μs -> 3.08μs (2.37% slower)
expected = [
InputExample(guid="train-3", text_a="Water is wet.", text_b="Water is dry.", label="not_entailment"),
InputExample(guid="train-4", text_a="Fire is hot.", text_b="Fire is cold.", label="not_entailment"),
]

2. Edge Test Cases

def test_missing_label_for_train(processor):
# Test with missing label in train (should raise IndexError)
lines = [
["index", "sentence1", "sentence2", "label"],
["5", "Birds fly.", "Birds swim."], # Missing label
]
try:
processor._create_examples(lines, "train")
except IndexError:
pass

def test_missing_label_for_test(processor):
# Test with missing label in test (should work, label=None)
lines = [
["index", "sentence1", "sentence2", "label"],
["6", "Fish swim.", "Fish walk."], # Missing label
]
codeflash_output = processor._create_examples(lines, "test"); result = codeflash_output # 2.32μs -> 2.51μs (7.62% slower)
expected = [
InputExample(guid="test-6", text_a="Fish swim.", text_b="Fish walk.", label=None)
]

def test_extra_columns(processor):
# Test with extra columns (should use the last column as label)
lines = [
["index", "sentence1", "sentence2", "label", "extra"],
["7", "Sun rises.", "Sun sets.", "entailment", "extra_info"],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 2.25μs -> 2.47μs (8.63% slower)
expected = [
InputExample(guid="train-7", text_a="Sun rises.", text_b="Sun sets.", label="extra_info")
]

def test_non_string_inputs(processor):
# Test with non-string values (should handle them as is)
lines = [
["index", "sentence1", "sentence2", "label"],
[8, 123, None, True],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 2.34μs -> 2.60μs (9.94% slower)
expected = [
InputExample(guid="train-8", text_a=123, text_b=None, label=True)
]

def test_empty_strings(processor):
# Test with empty strings
lines = [
["index", "sentence1", "sentence2", "label"],
["9", "", "", ""],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 2.30μs -> 2.55μs (9.72% slower)
expected = [
InputExample(guid="train-9", text_a="", text_b="", label="")
]

def test_long_texts(processor):
# Test with very long text fields
long_text = "a" * 10000
lines = [
["index", "sentence1", "sentence2", "label"],
["10", long_text, long_text, "entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 2.25μs -> 2.49μs (9.65% slower)
expected = [
InputExample(guid="train-10", text_a=long_text, text_b=long_text, label="entailment")
]

def test_guid_uniqueness(processor):
# Test that guids are correctly composed and unique
lines = [
["index", "sentence1", "sentence2", "label"],
["11", "A", "B", "entailment"],
["12", "C", "D", "not_entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 3.09μs -> 3.06μs (1.08% faster)
guids = [ex.guid for ex in result]

def test_label_none_for_test(processor):
# Test that label is None for test set, even if label is present
lines = [
["index", "sentence1", "sentence2", "label"],
["13", "E", "F", "entailment"],
]
codeflash_output = processor._create_examples(lines, "test"); result = codeflash_output # 2.22μs -> 2.42μs (8.42% slower)

def test_label_not_none_for_train(processor):
# Test that label is not None for train set
lines = [
["index", "sentence1", "sentence2", "label"],
["14", "G", "H", "entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 2.28μs -> 2.32μs (1.81% slower)

def test_label_not_none_for_dev(processor):
# Test that label is not None for dev set
lines = [
["index", "sentence1", "sentence2", "label"],
["15", "I", "J", "not_entailment"],
]
codeflash_output = processor._create_examples(lines, "dev"); result = codeflash_output # 2.33μs -> 2.54μs (8.25% slower)

def test_incorrect_line_length(processor):
# Test with lines of incorrect length (should raise IndexError)
lines = [
["index", "sentence1", "sentence2", "label"],
["16", "K"], # Not enough columns
]
try:
processor._create_examples(lines, "train")
except IndexError:
pass

3. Large Scale Test Cases

def test_large_scale_train(processor):
# Test with a large number of train examples (1000)
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(1000):
lines.append([str(i), f"sentenceA_{i}", f"sentenceB_{i}", "entailment" if i % 2 == 0 else "not_entailment"])
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 372μs -> 294μs (26.7% faster)

def test_large_scale_test(processor):
# Test with a large number of test examples (1000)
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(1000):
lines.append([str(i), f"testA_{i}", f"testB_{i}", "entailment"])
codeflash_output = processor._create_examples(lines, "test"); result = codeflash_output # 368μs -> 288μs (27.6% faster)
# Check that all labels are None
for ex in result:
pass

def test_large_scale_long_text(processor):
# Test with large number of examples and long text fields
long_text = "x" * 1000
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(500):
lines.append([str(i), long_text, long_text, "entailment"])
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 185μs -> 148μs (24.6% faster)
for ex in result:
pass

def test_large_scale_edge_labels(processor):
# Test with large number of examples and edge labels
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(100):
lines.append([str(i), f"A{i}", f"B{i}", None])
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 39.0μs -> 32.2μs (21.1% faster)
for ex in result:
pass

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-RteProcessor._create_examples-mhviy7a9 and push.

Codeflash Static Badge

The optimized code achieves a **24% speedup** through several key algorithmic improvements that reduce per-iteration overhead:

**Key Optimizations:**

1. **Eliminated branch condition in tight loop**: The original code checked `if i == 0: continue` on every iteration (7,175 times according to profiler). The optimized version skips the header upfront using `iter(lines)` and `next()`, removing this conditional check entirely from the main processing loop.

2. **Replaced list.append() with list comprehension**: The original code used `examples.append()` in a loop, which required repeated dynamic memory allocation and method calls (46.5% of total time). The optimized version uses a list comprehension that pre-allocates memory and builds the list more efficiently.

3. **Hoisted invariant computations**: 
   - `is_test = set_type == "test"` is computed once instead of on every iteration
   - `Example = InputExample` localizes the class reference to avoid repeated global lookups

**Performance Impact:**
- **Small datasets** (1-10 examples): 5-15% slower due to setup overhead of iterator creation
- **Large datasets** (500-1000+ examples): 21-28% faster, where the loop optimizations dominate
- **Edge cases** (empty/header-only): Slight overhead but maintains correctness with proper StopIteration handling

The optimization is particularly effective for large-scale data processing workloads typical in ML preprocessing pipelines, where the GLUE processor would handle thousands of examples. The trade-off of slightly slower performance on tiny datasets is acceptable given the substantial gains on realistic workloads.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 12, 2025 04:53
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant