⚡️ Speed up method SentenceSplitter._split by 28%
#122
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 28% (0.28x) speedup for
SentenceSplitter._splitinllama-index-core/llama_index/core/node_parser/text/sentence.py⏱️ Runtime :
612 microseconds→478 microseconds(best of35runs)📝 Explanation and details
The optimized code achieves a 28% speedup by implementing batch tokenization to reduce redundant tokenizer calls.
Key optimization: Added
_token_size_batch()method that attempts to tokenize multiple text splits at once, falling back to individual tokenization if batch processing isn't supported by the tokenizer.Why this works:
_token_size()was called 154 times (line profiler shows), with each call invoking the tokenizer individuallyPerformance impact:
Test case analysis: The optimization shows strong gains on larger texts that get split into many chunks (like the 500-word test case improving 82.7%), while smaller improvements occur when fewer splits are needed (11.2% for overlap scenarios).
The fallback mechanism ensures compatibility with tokenizers that don't support batch operations, maintaining the same behavior while providing significant speedups when possible.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
import pytest
from llama_index.core.node_parser.text.sentence import SentenceSplitter
----------- Basic Test Cases -----------
def test_chunk_overlap_greater_than_chunk_size_raises():
with pytest.raises(ValueError):
SentenceSplitter(chunk_size=2, chunk_overlap=3)
def test_chunk_overlap_equals_chunk_size_allowed():
# Should not raise
SentenceSplitter(chunk_size=2, chunk_overlap=2)
def test_large_chunk_size():
splitter = SentenceSplitter(chunk_size=1000)
text = " ".join(["word"] * 500)
codeflash_output = splitter._split(text, chunk_size=1000); splits = codeflash_output # 208μs -> 114μs (82.7% faster)
def test_large_chunk_overlap():
# Should allow overlap == chunk_size
splitter = SentenceSplitter(chunk_size=100, chunk_overlap=100)
text = " ".join(["word"] * 150)
codeflash_output = splitter._split(text, chunk_size=100); splits = codeflash_output # 403μs -> 363μs (11.2% faster)
----------- Mutation Testing (Functional Behavior) -----------
#------------------------------------------------
import re
from typing import List
imports
import pytest
from llama_index.core.node_parser.text.sentence import SentenceSplitter
------------------- UNIT TESTS -------------------
Basic Test Cases
def test_chunk_overlap_greater_than_chunk_size():
# Should raise ValueError
with pytest.raises(ValueError):
SentenceSplitter(chunk_size=2, chunk_overlap=3)
To edit these changes
git checkout codeflash/optimize-SentenceSplitter._split-mhv7mjigand push.