⚡️ Speed up method RteProcessor._create_examples by 24%
#132
+23
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 24% (0.24x) speedup for
RteProcessor._create_examplesinsrc/transformers/data/processors/glue.py⏱️ Runtime :
2.73 milliseconds→2.20 milliseconds(best of250runs)📝 Explanation and details
The optimized code achieves a 24% speedup through several key algorithmic improvements that reduce per-iteration overhead:
Key Optimizations:
Eliminated branch condition in tight loop: The original code checked
if i == 0: continueon every iteration (7,175 times according to profiler). The optimized version skips the header upfront usingiter(lines)andnext(), removing this conditional check entirely from the main processing loop.Replaced list.append() with list comprehension: The original code used
examples.append()in a loop, which required repeated dynamic memory allocation and method calls (46.5% of total time). The optimized version uses a list comprehension that pre-allocates memory and builds the list more efficiently.Hoisted invariant computations:
is_test = set_type == "test"is computed once instead of on every iterationExample = InputExamplelocalizes the class reference to avoid repeated global lookupsPerformance Impact:
The optimization is particularly effective for large-scale data processing workloads typical in ML preprocessing pipelines, where the GLUE processor would handle thousands of examples. The trade-off of slightly slower performance on tiny datasets is acceptable given the substantial gains on realistic workloads.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
import warnings
imports
import pytest # used for our unit tests
from transformers.data.processors.glue import RteProcessor
Minimal InputExample class for testing
class InputExample:
def init(self, guid, text_a, text_b, label=None):
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
Minimal DataProcessor class for testing
class DataProcessor:
pass
DEPRECATION_WARNING = (
"This {0} will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets "
"library. You can have a look at this example script for pointers: "
"https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py"
)
from transformers.data.processors.glue import RteProcessor
unit tests
@pytest.fixture
def processor():
# Fixture to create a processor instance
return RteProcessor()
--------------------------
1. Basic Test Cases
--------------------------
def test_basic_single_example_train(processor):
# Test a single example in train set
lines = [
["index", "sentence1", "sentence2", "label"], # header
["1", "The sky is blue.", "The sky is clear.", "entailment"]
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.21μs -> 2.34μs (5.85% slower)
ex = examples[0]
def test_basic_single_example_dev(processor):
# Test a single example in dev set
lines = [
["index", "sentence1", "sentence2", "label"], # header
["2", "Grass is green.", "Plants are green.", "entailment"]
]
codeflash_output = processor._create_examples(lines, "dev"); examples = codeflash_output # 2.21μs -> 2.42μs (8.65% slower)
ex = examples[0]
def test_basic_single_example_test(processor):
# Test a single example in test set (label should be None)
lines = [
["index", "sentence1", "sentence2", "label"], # header
["3", "Birds can fly.", "Birds have wings.", "not_entailment"]
]
codeflash_output = processor._create_examples(lines, "test"); examples = codeflash_output # 2.27μs -> 2.48μs (8.28% slower)
ex = examples[0]
def test_basic_multiple_examples(processor):
# Test multiple examples in train set
lines = [
["index", "sentence1", "sentence2", "label"], # header
["1", "A", "B", "entailment"],
["2", "C", "D", "not_entailment"],
["3", "E", "F", "entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 3.35μs -> 3.32μs (0.753% faster)
--------------------------
2. Edge Test Cases
--------------------------
def test_empty_lines(processor):
# Test with empty lines list
lines = []
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 655ns -> 962ns (31.9% slower)
def test_header_only(processor):
# Test with only header, no examples
lines = [["index", "sentence1", "sentence2", "label"]]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 810ns -> 1.24μs (34.7% slower)
def test_missing_text_b(processor):
# Test with missing text_b (should raise IndexError)
lines = [
["index", "sentence1", "sentence2", "label"],
["5", "Sentence A"] # missing text_b and label
]
with pytest.raises(IndexError):
processor._create_examples(lines, "train") # 1.67μs -> 2.26μs (25.8% slower)
def test_label_none_in_non_test(processor):
# Test with label None in train set (should be accepted)
lines = [
["index", "sentence1", "sentence2", "label"],
["6", "A", "B", None]
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 3.10μs -> 3.12μs (0.801% slower)
def test_extra_columns(processor):
# Test with extra columns in input (should use only the required columns)
lines = [
["index", "sentence1", "sentence2", "label", "extra"],
["7", "A", "B", "entailment", "ignored"]
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.49μs -> 2.73μs (8.83% slower)
def test_non_string_fields(processor):
# Test with non-string fields (should accept any type)
lines = [
["index", "sentence1", "sentence2", "label"],
[8, 123, 456, True]
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.44μs -> 2.73μs (10.4% slower)
def test_guid_uniqueness(processor):
# Test that GUID is constructed correctly and is unique per example
lines = [
["index", "sentence1", "sentence2", "label"],
["1", "A", "B", "entailment"],
["1", "C", "D", "not_entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 3.17μs -> 3.13μs (1.21% faster)
def test_label_is_none_for_test(processor):
# Test that label is always None for test set, even if present
lines = [
["index", "sentence1", "sentence2", "label"],
["9", "A", "B", "entailment"],
["10", "C", "D", "not_entailment"],
]
codeflash_output = processor._create_examples(lines, "test"); examples = codeflash_output # 3.15μs -> 3.17μs (0.661% slower)
def test_set_type_case_sensitivity(processor):
# Test that set_type is case-sensitive ("Test" != "test")
lines = [
["index", "sentence1", "sentence2", "label"],
["11", "A", "B", "entailment"],
]
codeflash_output = processor._create_examples(lines, "Test"); examples = codeflash_output # 2.35μs -> 2.69μs (12.5% slower)
def test_empty_strings(processor):
# Test with empty strings for text_a, text_b, label
lines = [
["index", "sentence1", "sentence2", "label"],
["12", "", "", ""]
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.35μs -> 2.58μs (8.72% slower)
def test_whitespace_strings(processor):
# Test with whitespace strings for text_a, text_b, label
lines = [
["index", "sentence1", "sentence2", "label"],
["13", " ", " ", " "]
]
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 2.32μs -> 2.60μs (10.8% slower)
--------------------------
3. Large Scale Test Cases
--------------------------
def test_large_scale_1000_examples(processor):
# Test with 1000 examples to check scalability
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(1, 1001):
lines.append([str(i), f"Sentence {i}A", f"Sentence {i}B", "entailment" if i % 2 == 0 else "not_entailment"])
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 370μs -> 290μs (27.4% faster)
def test_large_scale_test_set(processor):
# Test with 500 examples in test set (label should always be None)
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(1, 501):
lines.append([str(i), f"TestA{i}", f"TestB{i}", "entailment"])
codeflash_output = processor._create_examples(lines, "test"); examples = codeflash_output # 184μs -> 147μs (24.8% faster)
def test_large_scale_empty_fields(processor):
# Test with many examples with empty fields
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(1, 1001):
lines.append([str(i), "", "", ""])
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 370μs -> 294μs (26.0% faster)
def test_large_scale_non_string_fields(processor):
# Test with many examples with non-string fields
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(1, 1001):
lines.append([i, i10, i100, i % 2 == 0])
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 404μs -> 333μs (21.2% faster)
def test_large_scale_extra_columns(processor):
# Test with extra columns in large scale
lines = [["index", "sentence1", "sentence2", "label", "extra"]]
for i in range(1, 1001):
lines.append([str(i), f"A{i}", f"B{i}", "entailment", f"extra{i}"])
codeflash_output = processor._create_examples(lines, "train"); examples = codeflash_output # 366μs -> 288μs (27.3% faster)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest # used for our unit tests
from transformers.data.processors.glue import RteProcessor
Minimal InputExample class for testing
class InputExample:
def init(self, guid, text_a, text_b=None, label=None):
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
Minimal DataProcessor class for testing
class DataProcessor:
pass
from transformers.data.processors.glue import RteProcessor
------------------ UNIT TESTS ------------------
@pytest.fixture
def processor():
# Fixture to create a processor instance for tests
return RteProcessor()
1. Basic Test Cases
def test_empty_lines(processor):
# Test with empty lines (should return empty list)
lines = []
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 808ns -> 1.08μs (24.9% slower)
def test_only_header(processor):
# Test with only header row (should return empty list)
lines = [["index", "sentence1", "sentence2", "label"]]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 865ns -> 1.33μs (34.8% slower)
def test_single_train_example(processor):
# Test with one train example
lines = [
["index", "sentence1", "sentence2", "label"],
["0", "The sky is blue.", "The sky is colored.", "entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 2.67μs -> 2.92μs (8.69% slower)
expected = [
InputExample(guid="train-0", text_a="The sky is blue.", text_b="The sky is colored.", label="entailment")
]
def test_single_dev_example(processor):
# Test with one dev example
lines = [
["index", "sentence1", "sentence2", "label"],
["1", "Cats are animals.", "Cats are mammals.", "not_entailment"],
]
codeflash_output = processor._create_examples(lines, "dev"); result = codeflash_output # 2.35μs -> 2.59μs (9.37% slower)
expected = [
InputExample(guid="dev-1", text_a="Cats are animals.", text_b="Cats are mammals.", label="not_entailment")
]
def test_single_test_example(processor):
# Test with one test example (label should be None)
lines = [
["index", "sentence1", "sentence2", "label"],
["2", "Dogs bark.", "Dogs make noise.", "entailment"],
]
codeflash_output = processor._create_examples(lines, "test"); result = codeflash_output # 2.31μs -> 2.49μs (7.27% slower)
expected = [
InputExample(guid="test-2", text_a="Dogs bark.", text_b="Dogs make noise.", label=None)
]
def test_multiple_examples(processor):
# Test with multiple examples
lines = [
["index", "sentence1", "sentence2", "label"],
["3", "Water is wet.", "Water is dry.", "not_entailment"],
["4", "Fire is hot.", "Fire is cold.", "not_entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 3.01μs -> 3.08μs (2.37% slower)
expected = [
InputExample(guid="train-3", text_a="Water is wet.", text_b="Water is dry.", label="not_entailment"),
InputExample(guid="train-4", text_a="Fire is hot.", text_b="Fire is cold.", label="not_entailment"),
]
2. Edge Test Cases
def test_missing_label_for_train(processor):
# Test with missing label in train (should raise IndexError)
lines = [
["index", "sentence1", "sentence2", "label"],
["5", "Birds fly.", "Birds swim."], # Missing label
]
try:
processor._create_examples(lines, "train")
except IndexError:
pass
def test_missing_label_for_test(processor):
# Test with missing label in test (should work, label=None)
lines = [
["index", "sentence1", "sentence2", "label"],
["6", "Fish swim.", "Fish walk."], # Missing label
]
codeflash_output = processor._create_examples(lines, "test"); result = codeflash_output # 2.32μs -> 2.51μs (7.62% slower)
expected = [
InputExample(guid="test-6", text_a="Fish swim.", text_b="Fish walk.", label=None)
]
def test_extra_columns(processor):
# Test with extra columns (should use the last column as label)
lines = [
["index", "sentence1", "sentence2", "label", "extra"],
["7", "Sun rises.", "Sun sets.", "entailment", "extra_info"],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 2.25μs -> 2.47μs (8.63% slower)
expected = [
InputExample(guid="train-7", text_a="Sun rises.", text_b="Sun sets.", label="extra_info")
]
def test_non_string_inputs(processor):
# Test with non-string values (should handle them as is)
lines = [
["index", "sentence1", "sentence2", "label"],
[8, 123, None, True],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 2.34μs -> 2.60μs (9.94% slower)
expected = [
InputExample(guid="train-8", text_a=123, text_b=None, label=True)
]
def test_empty_strings(processor):
# Test with empty strings
lines = [
["index", "sentence1", "sentence2", "label"],
["9", "", "", ""],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 2.30μs -> 2.55μs (9.72% slower)
expected = [
InputExample(guid="train-9", text_a="", text_b="", label="")
]
def test_long_texts(processor):
# Test with very long text fields
long_text = "a" * 10000
lines = [
["index", "sentence1", "sentence2", "label"],
["10", long_text, long_text, "entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 2.25μs -> 2.49μs (9.65% slower)
expected = [
InputExample(guid="train-10", text_a=long_text, text_b=long_text, label="entailment")
]
def test_guid_uniqueness(processor):
# Test that guids are correctly composed and unique
lines = [
["index", "sentence1", "sentence2", "label"],
["11", "A", "B", "entailment"],
["12", "C", "D", "not_entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 3.09μs -> 3.06μs (1.08% faster)
guids = [ex.guid for ex in result]
def test_label_none_for_test(processor):
# Test that label is None for test set, even if label is present
lines = [
["index", "sentence1", "sentence2", "label"],
["13", "E", "F", "entailment"],
]
codeflash_output = processor._create_examples(lines, "test"); result = codeflash_output # 2.22μs -> 2.42μs (8.42% slower)
def test_label_not_none_for_train(processor):
# Test that label is not None for train set
lines = [
["index", "sentence1", "sentence2", "label"],
["14", "G", "H", "entailment"],
]
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 2.28μs -> 2.32μs (1.81% slower)
def test_label_not_none_for_dev(processor):
# Test that label is not None for dev set
lines = [
["index", "sentence1", "sentence2", "label"],
["15", "I", "J", "not_entailment"],
]
codeflash_output = processor._create_examples(lines, "dev"); result = codeflash_output # 2.33μs -> 2.54μs (8.25% slower)
def test_incorrect_line_length(processor):
# Test with lines of incorrect length (should raise IndexError)
lines = [
["index", "sentence1", "sentence2", "label"],
["16", "K"], # Not enough columns
]
try:
processor._create_examples(lines, "train")
except IndexError:
pass
3. Large Scale Test Cases
def test_large_scale_train(processor):
# Test with a large number of train examples (1000)
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(1000):
lines.append([str(i), f"sentenceA_{i}", f"sentenceB_{i}", "entailment" if i % 2 == 0 else "not_entailment"])
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 372μs -> 294μs (26.7% faster)
def test_large_scale_test(processor):
# Test with a large number of test examples (1000)
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(1000):
lines.append([str(i), f"testA_{i}", f"testB_{i}", "entailment"])
codeflash_output = processor._create_examples(lines, "test"); result = codeflash_output # 368μs -> 288μs (27.6% faster)
# Check that all labels are None
for ex in result:
pass
def test_large_scale_long_text(processor):
# Test with large number of examples and long text fields
long_text = "x" * 1000
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(500):
lines.append([str(i), long_text, long_text, "entailment"])
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 185μs -> 148μs (24.6% faster)
for ex in result:
pass
def test_large_scale_edge_labels(processor):
# Test with large number of examples and edge labels
lines = [["index", "sentence1", "sentence2", "label"]]
for i in range(100):
lines.append([str(i), f"A{i}", f"B{i}", None])
codeflash_output = processor._create_examples(lines, "train"); result = codeflash_output # 39.0μs -> 32.2μs (21.1% faster)
for ex in result:
pass
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
To edit these changes
git checkout codeflash/optimize-RteProcessor._create_examples-mhviy7a9and push.