Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 12, 2025

📄 20% (0.20x) speedup for MnliProcessor._create_examples in src/transformers/data/processors/glue.py

⏱️ Runtime : 2.68 milliseconds 2.23 milliseconds (best of 250 runs)

📝 Explanation and details

The optimized code achieves a 20% speedup through three key optimizations that eliminate redundant operations in the main loop:

What was optimized:

  1. Early exit for empty datasets: Added if len(lines) <= 1: return [] to avoid unnecessary loop setup when there are no data rows beyond the header.

  2. Pre-computed test condition: Moved set_type.startswith("test") outside the loop into is_test = set_type.startswith("test"), eliminating 6,420 repeated string method calls per execution.

  3. Direct slice iteration: Replaced enumerate(lines) with for line in lines[1:] to skip the header directly, eliminating the need for an index variable and the if i == 0: continue check on every iteration.

Why this leads to speedup:

The line profiler shows the original code spent significant time on the set_type.startswith("test") check (13.1% of total time) and the enumerate overhead (9.2% of total time). By pre-computing the test condition and using direct slicing, these operations are eliminated from the hot loop. The optimizations are particularly effective for larger datasets, as shown in the test results where improvements range from 19-22% for 500-1000 examples.

Performance characteristics:

  • Small datasets (single examples): 2-12% improvement
  • Medium datasets (100-300 examples): 16-20% improvement
  • Large datasets (500-1000 examples): 19-22% improvement
  • Edge cases (empty datasets): Up to 67% improvement due to early exit

The optimizations preserve all behavior including error handling for malformed data and maintain the same output format and exception conditions.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 59 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime

import pytest # used for our unit tests
from transformers.data.processors.glue import MnliProcessor

Minimal InputExample and DataProcessor for testing

class InputExample:
def init(self, guid, text_a, text_b, label=None):
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label

class DataProcessor:
pass
from transformers.data.processors.glue import MnliProcessor

unit tests

--- Basic Test Cases ---

def test_basic_single_example_train():
# Test with a single train example (after header)
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line = ["0", "", "", "", "", "", "", "", "The cat sat.", "The feline rested.", "entailment"]
lines = [header, line]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.45μs -> 2.18μs (12.0% faster)
ex = examples[0]

def test_basic_multiple_examples_dev():
# Test with multiple dev examples
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line1 = ["1", "", "", "", "", "", "", "", "Dogs bark.", "Canines make noise.", "neutral"]
line2 = ["2", "", "", "", "", "", "", "", "Birds fly.", "Penguins swim.", "contradiction"]
lines = [header, line1, line2]
codeflash_output = processor._create_examples(lines, "dev"); examples = codeflash_output # 2.86μs -> 2.75μs (4.07% faster)

def test_basic_test_set_label_none():
# Test that label is None for test set
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line = ["3", "", "", "", "", "", "", "", "Fish swim.", "Salmon migrate.", "should_be_ignored"]
lines = [header, line]
codeflash_output = processor._create_examples(lines, "test"); examples = codeflash_output # 2.12μs -> 2.18μs (2.84% slower)
ex = examples[0]

--- Edge Test Cases ---

def test_edge_empty_lines():
# Test with only header, no examples
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
lines = [header]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 800ns -> 485ns (64.9% faster)

def test_edge_missing_fields():
# Test with missing fields in the line (should raise IndexError)
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line = ["4", "", "", "", "", "", "", "", "Only text_a"] # Missing text_b and label
lines = [header, line]
with pytest.raises(IndexError):
processor._create_examples(lines, "train") # 1.50μs -> 1.53μs (1.96% slower)

def test_edge_empty_strings():
# Test with empty strings for text_a, text_b, and label
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line = ["5", "", "", "", "", "", "", "", "", "", ""]
lines = [header, line]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.72μs -> 2.54μs (6.80% faster)
ex = examples[0]

def test_edge_non_string_fields():
# Test with non-string types in fields (should be converted to str in guid, but not in text_a/text_b/label)
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line = [6, "", "", "", "", "", "", "", 123, 456, True]
lines = [header, line]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.56μs -> 2.44μs (4.63% faster)
ex = examples[0]

def test_edge_set_type_variations():
# Test with set_type that starts with "test" but is not exactly "test"
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line = ["7", "", "", "", "", "", "", "", "A", "B", "label"]
lines = [header, line]
codeflash_output = processor._create_examples(lines, "test_matched"); examples = codeflash_output # 2.42μs -> 2.38μs (2.06% faster)
ex = examples[0]

def test_edge_duplicate_indices():
# Test with duplicate indices, should not crash and both examples should be present
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line1 = ["8", "", "", "", "", "", "", "", "Text A", "Text B", "label1"]
line2 = ["8", "", "", "", "", "", "", "", "Text C", "Text D", "label2"]
lines = [header, line1, line2]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 3.21μs -> 3.10μs (3.49% faster)

--- Large Scale Test Cases ---

def test_large_scale_1000_examples():
# Test with 1000 examples, check performance and correctness
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
lines = [header]
for i in range(1000):
line = [str(i), "", "", "", "", "", "", "", f"Sentence {i}", f"Premise {i}", "entailment"]
lines.append(line)
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 417μs -> 349μs (19.2% faster)
# Check a few random samples
for idx in [0, 499, 999]:
ex = examples[idx]

def test_large_scale_test_set_label_none():
# Test with 500 test examples, label should be None
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
lines = [header]
for i in range(500):
line = [str(i), "", "", "", "", "", "", "", f"TextA{i}", f"TextB{i}", "label_should_be_none"]
lines.append(line)
codeflash_output = processor._create_examples(lines, "test"); examples = codeflash_output # 202μs -> 166μs (21.9% faster)
for ex in examples:
pass

def test_large_scale_empty_strings():
# Test with 300 examples, all fields empty strings
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
lines = [header]
for i in range(300):
line = [str(i), "", "", "", "", "", "", "", "", "", ""]
lines.append(line)
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 123μs -> 102μs (20.0% faster)
for ex in examples:
pass

def test_large_scale_varied_labels():
# Test with 100 examples, varied labels
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
labels = ["entailment", "neutral", "contradiction"]
lines = [header]
for i in range(100):
label = labels[i % 3]
line = [str(i), "", "", "", "", "", "", "", f"TextA{i}", f"TextB{i}", label]
lines.append(line)
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 41.9μs -> 35.9μs (16.6% faster)
for i, ex in enumerate(examples):
pass

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

#------------------------------------------------
import warnings

imports

import pytest # used for our unit tests
from transformers.data.processors.glue import MnliProcessor

Minimal InputExample class for testing (since we're not importing from transformers)

class InputExample:
def init(self, guid, text_a, text_b, label):
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label

def __eq__(self, other):
    return (
        isinstance(other, InputExample)
        and self.guid == other.guid
        and self.text_a == other.text_a
        and self.text_b == other.text_b
        and self.label == other.label
    )

def __repr__(self):
    return f"InputExample(guid={self.guid!r}, text_a={self.text_a!r}, text_b={self.text_b!r}, label={self.label!r})"

Minimal DataProcessor class for testing (since we're not importing from transformers)

class DataProcessor:
pass

DEPRECATION_WARNING = (
"This {0} will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets "
"library. You can have a look at this example script for pointers: "
"https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py"
)
from transformers.data.processors.glue import MnliProcessor

unit tests

@pytest.fixture
def processor():
# Fixture to instantiate MnliProcessor for each test
return MnliProcessor()

-------------------------

Basic Test Cases

-------------------------

def test_basic_single_example(processor):
# Test with a single valid line (after header)
lines = [
["header0", "header1", "header2", "header3", "header4", "header5", "header6", "header7", "header8", "header9", "header10"], # header
["123", "a", "b", "c", "d", "e", "f", "g", "Premise text", "Hypothesis text", "entailment"]
]
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.41μs -> 2.26μs (6.50% faster)
ex = examples[0]

def test_basic_multiple_examples(processor):
# Test with multiple valid lines (after header)
lines = [
["header"] * 11,
["1", "a", "b", "c", "d", "e", "f", "g", "Premise1", "Hypothesis1", "neutral"],
["2", "a", "b", "c", "d", "e", "f", "g", "Premise2", "Hypothesis2", "contradiction"],
["3", "a", "b", "c", "d", "e", "f", "g", "Premise3", "Hypothesis3", "entailment"],
]
set_type = "dev"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 3.54μs -> 3.23μs (9.47% faster)

def test_basic_test_set_type(processor):
# Test with set_type starting with "test" (should set label to None)
lines = [
["header"] * 11,
["42", "a", "b", "c", "d", "e", "f", "g", "PremiseX", "HypothesisX", "should_be_ignored"]
]
set_type = "test"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.35μs -> 2.29μs (2.93% faster)
ex = examples[0]

-------------------------

Edge Test Cases

-------------------------

def test_empty_lines(processor):
# Test with only header (no examples)
lines = [
["header"] * 11
]
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 802ns -> 481ns (66.7% faster)

def test_missing_columns(processor):
# Test with lines that have fewer than 10 columns (should raise IndexError)
lines = [
["header"] * 11,
["123", "a", "b", "c", "d", "e", "f", "g", "Premise text"] # Only 9 columns
]
set_type = "train"
with pytest.raises(IndexError):
processor._create_examples(lines, set_type) # 1.47μs -> 1.59μs (7.59% slower)

def test_extra_columns(processor):
# Test with lines that have more than 11 columns (should ignore extras)
lines = [
["header"] * 12,
["999", "a", "b", "c", "d", "e", "f", "g", "Premise Extra", "Hypothesis Extra", "entailment", "extra_column"]
]
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.49μs -> 2.47μs (0.849% faster)
ex = examples[0]

def test_non_string_labels(processor):
# Test with a numeric label
lines = [
["header"] * 11,
["123", "a", "b", "c", "d", "e", "f", "g", "Premise", "Hypothesis", 42]
]
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.56μs -> 2.47μs (3.36% faster)

def test_guid_with_special_characters(processor):
# Test guid generation with special characters in line[0]
lines = [
["header"] * 11,
["id-!@#", "a", "b", "c", "d", "e", "f", "g", "Premise", "Hypothesis", "neutral"]
]
set_type = "dev"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.56μs -> 2.41μs (6.31% faster)

def test_empty_text_fields(processor):
# Test with empty premise and hypothesis
lines = [
["header"] * 11,
["123", "a", "b", "c", "d", "e", "f", "g", "", "", "entailment"]
]
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.46μs -> 2.38μs (3.75% faster)

def test_label_none_for_test_variants(processor):
# Test with set_type "test_matched" and "test_mismatched"
lines = [
["header"] * 11,
["777", "a", "b", "c", "d", "e", "f", "g", "Premise", "Hypothesis", "label_should_be_none"]
]
for set_type in ["test_matched", "test_mismatched"]:
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 3.63μs -> 3.39μs (7.11% faster)

def test_label_empty_string(processor):
# Test with label as empty string
lines = [
["header"] * 11,
["123", "a", "b", "c", "d", "e", "f", "g", "Premise", "Hypothesis", ""]
]
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.35μs -> 2.29μs (2.67% faster)

-------------------------

Large Scale Test Cases

-------------------------

def test_large_scale_1000_examples(processor):
# Test with 1000 examples
header = ["header"] * 11
lines = [header]
for i in range(1000):
lines.append([str(i), "a", "b", "c", "d", "e", "f", "g", f"Premise {i}", f"Hypothesis {i}", "entailment"])
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 423μs -> 349μs (21.3% faster)

def test_large_scale_test_set_type(processor):
# Test with 500 test examples (label should be None)
header = ["header"] * 11
lines = [header]
for i in range(500):
lines.append([str(i), "a", "b", "c", "d", "e", "f", "g", f"Premise {i}", f"Hypothesis {i}", "should_be_ignored"])
set_type = "test"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 203μs -> 167μs (21.8% faster)
for ex in examples:
pass

def test_large_scale_varied_labels(processor):
# Test with 1000 examples with alternating labels
header = ["header"] * 11
lines = [header]
labels = ["entailment", "neutral", "contradiction"]
for i in range(1000):
label = labels[i % 3]
lines.append([str(i), "a", "b", "c", "d", "e", "f", "g", f"Premise {i}", f"Hypothesis {i}", label])
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 406μs -> 338μs (20.0% faster)
# Check label cycling
for i, ex in enumerate(examples):
pass

def test_large_scale_empty_fields(processor):
# Test with 1000 examples with some empty fields
header = ["header"] * 11
lines = [header]
for i in range(1000):
premise = "" if i % 10 == 0 else f"Premise {i}"
hypothesis = "" if i % 20 == 0 else f"Hypothesis {i}"
label = "entailment"
lines.append([str(i), "a", "b", "c", "d", "e", "f", "g", premise, hypothesis, label])
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 407μs -> 339μs (20.2% faster)
# Check for empty fields at correct intervals
for i, ex in enumerate(examples):
if i % 10 == 0:
pass
if i % 20 == 0:
pass

def test_large_scale_performance(processor):
# Performance test: ensure function completes quickly for 999 examples
import time
header = ["header"] * 11
lines = [header]
for i in range(999):
lines.append([str(i), "a", "b", "c", "d", "e", "f", "g", f"Premise {i}", f"Hypothesis {i}", "entailment"])
set_type = "train"
start = time.time()
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 403μs -> 337μs (19.7% faster)
duration = time.time() - start

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-MnliProcessor._create_examples-mhvekv50 and push.

Codeflash Static Badge

The optimized code achieves a **20% speedup** through three key optimizations that eliminate redundant operations in the main loop:

**What was optimized:**

1. **Early exit for empty datasets**: Added `if len(lines) <= 1: return []` to avoid unnecessary loop setup when there are no data rows beyond the header.

2. **Pre-computed test condition**: Moved `set_type.startswith("test")` outside the loop into `is_test = set_type.startswith("test")`, eliminating 6,420 repeated string method calls per execution.

3. **Direct slice iteration**: Replaced `enumerate(lines)` with `for line in lines[1:]` to skip the header directly, eliminating the need for an index variable and the `if i == 0: continue` check on every iteration.

**Why this leads to speedup:**

The line profiler shows the original code spent significant time on the `set_type.startswith("test")` check (13.1% of total time) and the `enumerate` overhead (9.2% of total time). By pre-computing the test condition and using direct slicing, these operations are eliminated from the hot loop. The optimizations are particularly effective for larger datasets, as shown in the test results where improvements range from 19-22% for 500-1000 examples.

**Performance characteristics:**
- Small datasets (single examples): 2-12% improvement
- Medium datasets (100-300 examples): 16-20% improvement  
- Large datasets (500-1000 examples): 19-22% improvement
- Edge cases (empty datasets): Up to 67% improvement due to early exit

The optimizations preserve all behavior including error handling for malformed data and maintain the same output format and exception conditions.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 12, 2025 02:50
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant