⚡️ Speed up method MnliProcessor._create_examples by 20%
#127
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 20% (0.20x) speedup for
MnliProcessor._create_examplesinsrc/transformers/data/processors/glue.py⏱️ Runtime :
2.68 milliseconds→2.23 milliseconds(best of250runs)📝 Explanation and details
The optimized code achieves a 20% speedup through three key optimizations that eliminate redundant operations in the main loop:
What was optimized:
Early exit for empty datasets: Added
if len(lines) <= 1: return []to avoid unnecessary loop setup when there are no data rows beyond the header.Pre-computed test condition: Moved
set_type.startswith("test")outside the loop intois_test = set_type.startswith("test"), eliminating 6,420 repeated string method calls per execution.Direct slice iteration: Replaced
enumerate(lines)withfor line in lines[1:]to skip the header directly, eliminating the need for an index variable and theif i == 0: continuecheck on every iteration.Why this leads to speedup:
The line profiler shows the original code spent significant time on the
set_type.startswith("test")check (13.1% of total time) and theenumerateoverhead (9.2% of total time). By pre-computing the test condition and using direct slicing, these operations are eliminated from the hot loop. The optimizations are particularly effective for larger datasets, as shown in the test results where improvements range from 19-22% for 500-1000 examples.Performance characteristics:
The optimizations preserve all behavior including error handling for malformed data and maintain the same output format and exception conditions.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
import pytest # used for our unit tests
from transformers.data.processors.glue import MnliProcessor
Minimal InputExample and DataProcessor for testing
class InputExample:
def init(self, guid, text_a, text_b, label=None):
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
class DataProcessor:
pass
from transformers.data.processors.glue import MnliProcessor
unit tests
--- Basic Test Cases ---
def test_basic_single_example_train():
# Test with a single train example (after header)
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line = ["0", "", "", "", "", "", "", "", "The cat sat.", "The feline rested.", "entailment"]
lines = [header, line]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.45μs -> 2.18μs (12.0% faster)
ex = examples[0]
def test_basic_multiple_examples_dev():
# Test with multiple dev examples
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line1 = ["1", "", "", "", "", "", "", "", "Dogs bark.", "Canines make noise.", "neutral"]
line2 = ["2", "", "", "", "", "", "", "", "Birds fly.", "Penguins swim.", "contradiction"]
lines = [header, line1, line2]
codeflash_output = processor._create_examples(lines, "dev"); examples = codeflash_output # 2.86μs -> 2.75μs (4.07% faster)
def test_basic_test_set_label_none():
# Test that label is None for test set
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line = ["3", "", "", "", "", "", "", "", "Fish swim.", "Salmon migrate.", "should_be_ignored"]
lines = [header, line]
codeflash_output = processor._create_examples(lines, "test"); examples = codeflash_output # 2.12μs -> 2.18μs (2.84% slower)
ex = examples[0]
--- Edge Test Cases ---
def test_edge_empty_lines():
# Test with only header, no examples
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
lines = [header]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 800ns -> 485ns (64.9% faster)
def test_edge_missing_fields():
# Test with missing fields in the line (should raise IndexError)
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line = ["4", "", "", "", "", "", "", "", "Only text_a"] # Missing text_b and label
lines = [header, line]
with pytest.raises(IndexError):
processor._create_examples(lines, "train") # 1.50μs -> 1.53μs (1.96% slower)
def test_edge_empty_strings():
# Test with empty strings for text_a, text_b, and label
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line = ["5", "", "", "", "", "", "", "", "", "", ""]
lines = [header, line]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.72μs -> 2.54μs (6.80% faster)
ex = examples[0]
def test_edge_non_string_fields():
# Test with non-string types in fields (should be converted to str in guid, but not in text_a/text_b/label)
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line = [6, "", "", "", "", "", "", "", 123, 456, True]
lines = [header, line]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.56μs -> 2.44μs (4.63% faster)
ex = examples[0]
def test_edge_set_type_variations():
# Test with set_type that starts with "test" but is not exactly "test"
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line = ["7", "", "", "", "", "", "", "", "A", "B", "label"]
lines = [header, line]
codeflash_output = processor._create_examples(lines, "test_matched"); examples = codeflash_output # 2.42μs -> 2.38μs (2.06% faster)
ex = examples[0]
def test_edge_duplicate_indices():
# Test with duplicate indices, should not crash and both examples should be present
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
line1 = ["8", "", "", "", "", "", "", "", "Text A", "Text B", "label1"]
line2 = ["8", "", "", "", "", "", "", "", "Text C", "Text D", "label2"]
lines = [header, line1, line2]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 3.21μs -> 3.10μs (3.49% faster)
--- Large Scale Test Cases ---
def test_large_scale_1000_examples():
# Test with 1000 examples, check performance and correctness
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
lines = [header]
for i in range(1000):
line = [str(i), "", "", "", "", "", "", "", f"Sentence {i}", f"Premise {i}", "entailment"]
lines.append(line)
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 417μs -> 349μs (19.2% faster)
# Check a few random samples
for idx in [0, 499, 999]:
ex = examples[idx]
def test_large_scale_test_set_label_none():
# Test with 500 test examples, label should be None
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
lines = [header]
for i in range(500):
line = [str(i), "", "", "", "", "", "", "", f"TextA{i}", f"TextB{i}", "label_should_be_none"]
lines.append(line)
codeflash_output = processor._create_examples(lines, "test"); examples = codeflash_output # 202μs -> 166μs (21.9% faster)
for ex in examples:
pass
def test_large_scale_empty_strings():
# Test with 300 examples, all fields empty strings
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
lines = [header]
for i in range(300):
line = [str(i), "", "", "", "", "", "", "", "", "", ""]
lines.append(line)
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 123μs -> 102μs (20.0% faster)
for ex in examples:
pass
def test_large_scale_varied_labels():
# Test with 100 examples, varied labels
processor = MnliProcessor()
header = ["index", "other", "fields", "not", "used", "in", "this", "test", "sentence1", "sentence2", "label"]
labels = ["entailment", "neutral", "contradiction"]
lines = [header]
for i in range(100):
label = labels[i % 3]
line = [str(i), "", "", "", "", "", "", "", f"TextA{i}", f"TextB{i}", label]
lines.append(line)
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 41.9μs -> 35.9μs (16.6% faster)
for i, ex in enumerate(examples):
pass
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import warnings
imports
import pytest # used for our unit tests
from transformers.data.processors.glue import MnliProcessor
Minimal InputExample class for testing (since we're not importing from transformers)
class InputExample:
def init(self, guid, text_a, text_b, label):
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
Minimal DataProcessor class for testing (since we're not importing from transformers)
class DataProcessor:
pass
DEPRECATION_WARNING = (
"This {0} will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets "
"library. You can have a look at this example script for pointers: "
"https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py"
)
from transformers.data.processors.glue import MnliProcessor
unit tests
@pytest.fixture
def processor():
# Fixture to instantiate MnliProcessor for each test
return MnliProcessor()
-------------------------
Basic Test Cases
-------------------------
def test_basic_single_example(processor):
# Test with a single valid line (after header)
lines = [
["header0", "header1", "header2", "header3", "header4", "header5", "header6", "header7", "header8", "header9", "header10"], # header
["123", "a", "b", "c", "d", "e", "f", "g", "Premise text", "Hypothesis text", "entailment"]
]
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.41μs -> 2.26μs (6.50% faster)
ex = examples[0]
def test_basic_multiple_examples(processor):
# Test with multiple valid lines (after header)
lines = [
["header"] * 11,
["1", "a", "b", "c", "d", "e", "f", "g", "Premise1", "Hypothesis1", "neutral"],
["2", "a", "b", "c", "d", "e", "f", "g", "Premise2", "Hypothesis2", "contradiction"],
["3", "a", "b", "c", "d", "e", "f", "g", "Premise3", "Hypothesis3", "entailment"],
]
set_type = "dev"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 3.54μs -> 3.23μs (9.47% faster)
def test_basic_test_set_type(processor):
# Test with set_type starting with "test" (should set label to None)
lines = [
["header"] * 11,
["42", "a", "b", "c", "d", "e", "f", "g", "PremiseX", "HypothesisX", "should_be_ignored"]
]
set_type = "test"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.35μs -> 2.29μs (2.93% faster)
ex = examples[0]
-------------------------
Edge Test Cases
-------------------------
def test_empty_lines(processor):
# Test with only header (no examples)
lines = [
["header"] * 11
]
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 802ns -> 481ns (66.7% faster)
def test_missing_columns(processor):
# Test with lines that have fewer than 10 columns (should raise IndexError)
lines = [
["header"] * 11,
["123", "a", "b", "c", "d", "e", "f", "g", "Premise text"] # Only 9 columns
]
set_type = "train"
with pytest.raises(IndexError):
processor._create_examples(lines, set_type) # 1.47μs -> 1.59μs (7.59% slower)
def test_extra_columns(processor):
# Test with lines that have more than 11 columns (should ignore extras)
lines = [
["header"] * 12,
["999", "a", "b", "c", "d", "e", "f", "g", "Premise Extra", "Hypothesis Extra", "entailment", "extra_column"]
]
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.49μs -> 2.47μs (0.849% faster)
ex = examples[0]
def test_non_string_labels(processor):
# Test with a numeric label
lines = [
["header"] * 11,
["123", "a", "b", "c", "d", "e", "f", "g", "Premise", "Hypothesis", 42]
]
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.56μs -> 2.47μs (3.36% faster)
def test_guid_with_special_characters(processor):
# Test guid generation with special characters in line[0]
lines = [
["header"] * 11,
["id-!@#", "a", "b", "c", "d", "e", "f", "g", "Premise", "Hypothesis", "neutral"]
]
set_type = "dev"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.56μs -> 2.41μs (6.31% faster)
def test_empty_text_fields(processor):
# Test with empty premise and hypothesis
lines = [
["header"] * 11,
["123", "a", "b", "c", "d", "e", "f", "g", "", "", "entailment"]
]
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.46μs -> 2.38μs (3.75% faster)
def test_label_none_for_test_variants(processor):
# Test with set_type "test_matched" and "test_mismatched"
lines = [
["header"] * 11,
["777", "a", "b", "c", "d", "e", "f", "g", "Premise", "Hypothesis", "label_should_be_none"]
]
for set_type in ["test_matched", "test_mismatched"]:
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 3.63μs -> 3.39μs (7.11% faster)
def test_label_empty_string(processor):
# Test with label as empty string
lines = [
["header"] * 11,
["123", "a", "b", "c", "d", "e", "f", "g", "Premise", "Hypothesis", ""]
]
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 2.35μs -> 2.29μs (2.67% faster)
-------------------------
Large Scale Test Cases
-------------------------
def test_large_scale_1000_examples(processor):
# Test with 1000 examples
header = ["header"] * 11
lines = [header]
for i in range(1000):
lines.append([str(i), "a", "b", "c", "d", "e", "f", "g", f"Premise {i}", f"Hypothesis {i}", "entailment"])
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 423μs -> 349μs (21.3% faster)
def test_large_scale_test_set_type(processor):
# Test with 500 test examples (label should be None)
header = ["header"] * 11
lines = [header]
for i in range(500):
lines.append([str(i), "a", "b", "c", "d", "e", "f", "g", f"Premise {i}", f"Hypothesis {i}", "should_be_ignored"])
set_type = "test"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 203μs -> 167μs (21.8% faster)
for ex in examples:
pass
def test_large_scale_varied_labels(processor):
# Test with 1000 examples with alternating labels
header = ["header"] * 11
lines = [header]
labels = ["entailment", "neutral", "contradiction"]
for i in range(1000):
label = labels[i % 3]
lines.append([str(i), "a", "b", "c", "d", "e", "f", "g", f"Premise {i}", f"Hypothesis {i}", label])
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 406μs -> 338μs (20.0% faster)
# Check label cycling
for i, ex in enumerate(examples):
pass
def test_large_scale_empty_fields(processor):
# Test with 1000 examples with some empty fields
header = ["header"] * 11
lines = [header]
for i in range(1000):
premise = "" if i % 10 == 0 else f"Premise {i}"
hypothesis = "" if i % 20 == 0 else f"Hypothesis {i}"
label = "entailment"
lines.append([str(i), "a", "b", "c", "d", "e", "f", "g", premise, hypothesis, label])
set_type = "train"
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 407μs -> 339μs (20.2% faster)
# Check for empty fields at correct intervals
for i, ex in enumerate(examples):
if i % 10 == 0:
pass
if i % 20 == 0:
pass
def test_large_scale_performance(processor):
# Performance test: ensure function completes quickly for 999 examples
import time
header = ["header"] * 11
lines = [header]
for i in range(999):
lines.append([str(i), "a", "b", "c", "d", "e", "f", "g", f"Premise {i}", f"Hypothesis {i}", "entailment"])
set_type = "train"
start = time.time()
codeflash_output = processor._create_examples(lines, set_type); examples = codeflash_output # 403μs -> 337μs (19.7% faster)
duration = time.time() - start
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
To edit these changes
git checkout codeflash/optimize-MnliProcessor._create_examples-mhvekv50and push.