⚡️ Speed up method `SentenceSplitter._split` by 28% #122

codeflash-ai · 2025-11-11T23:36:15Z

📄 28% (0.28x) speedup for `SentenceSplitter._split` in `llama-index-core/llama_index/core/node_parser/text/sentence.py`

⏱️ Runtime : 612 microseconds → 478 microseconds (best of 35 runs)

📝 Explanation and details

The optimized code achieves a 28% speedup by implementing batch tokenization to reduce redundant tokenizer calls.

Key optimization: Added _token_size_batch() method that attempts to tokenize multiple text splits at once, falling back to individual tokenization if batch processing isn't supported by the tokenizer.

Why this works:

In the original code, _token_size() was called 154 times (line profiler shows), with each call invoking the tokenizer individually
The optimized version reduces tokenizer calls from 154 to just 2 by batching 150 text splits together in one call
Tokenizers like tiktoken support efficient batch processing, avoiding repeated setup/teardown overhead per string

Performance impact:

Tokenization overhead drops from 813μs to 228μs (72% reduction)
Total method time improves from 3.18ms to 2.70ms
Most effective for workloads with many small text chunks that need tokenization

Test case analysis: The optimization shows strong gains on larger texts that get split into many chunks (like the 500-word test case improving 82.7%), while smaller improvements occur when fewer splits are needed (11.2% for overlap scenarios).

The fallback mechanism ensures compatibility with tokenizers that don't support batch operations, maintaining the same behavior while providing significant speedups when possible.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 7 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	83.3%

🌀 Generated Regression Tests and Runtime

import pytest
from llama_index.core.node_parser.text.sentence import SentenceSplitter

----------- Basic Test Cases -----------

def test_chunk_overlap_greater_than_chunk_size_raises():
with pytest.raises(ValueError):
SentenceSplitter(chunk_size=2, chunk_overlap=3)

def test_chunk_overlap_equals_chunk_size_allowed():
# Should not raise
SentenceSplitter(chunk_size=2, chunk_overlap=2)

def test_large_chunk_size():
splitter = SentenceSplitter(chunk_size=1000)
text = " ".join(["word"] * 500)
codeflash_output = splitter._split(text, chunk_size=1000); splits = codeflash_output # 208μs -> 114μs (82.7% faster)

def test_large_chunk_overlap():
# Should allow overlap == chunk_size
splitter = SentenceSplitter(chunk_size=100, chunk_overlap=100)
text = " ".join(["word"] * 150)
codeflash_output = splitter._split(text, chunk_size=100); splits = codeflash_output # 403μs -> 363μs (11.2% faster)

----------- Mutation Testing (Functional Behavior) -----------

#------------------------------------------------
import re
from typing import List

imports

import pytest
from llama_index.core.node_parser.text.sentence import SentenceSplitter

------------------- UNIT TESTS -------------------

Basic Test Cases

def test_chunk_overlap_greater_than_chunk_size():
# Should raise ValueError
with pytest.raises(ValueError):
SentenceSplitter(chunk_size=2, chunk_overlap=3)

To edit these changes git checkout codeflash/optimize-SentenceSplitter._split-mhv7mjig and push.

The optimized code achieves a **28% speedup** by implementing **batch tokenization** to reduce redundant tokenizer calls. **Key optimization**: Added `_token_size_batch()` method that attempts to tokenize multiple text splits at once, falling back to individual tokenization if batch processing isn't supported by the tokenizer. **Why this works**: - In the original code, `_token_size()` was called 154 times (line profiler shows), with each call invoking the tokenizer individually - The optimized version reduces tokenizer calls from 154 to just 2 by batching 150 text splits together in one call - Tokenizers like tiktoken support efficient batch processing, avoiding repeated setup/teardown overhead per string **Performance impact**: - Tokenization overhead drops from 813μs to 228μs (72% reduction) - Total method time improves from 3.18ms to 2.70ms - Most effective for workloads with many small text chunks that need tokenization **Test case analysis**: The optimization shows strong gains on larger texts that get split into many chunks (like the 500-word test case improving 82.7%), while smaller improvements occur when fewer splits are needed (11.2% for overlap scenarios). The fallback mechanism ensures compatibility with tokenizers that don't support batch operations, maintaining the same behavior while providing significant speedups when possible.

codeflash-ai bot requested a review from mashraf-222 November 11, 2025 23:36

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `SentenceSplitter._split` by 28% #122

⚡️ Speed up method `SentenceSplitter._split` by 28% #122

Uh oh!

codeflash-ai bot commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method SentenceSplitter._split by 28% #122

Are you sure you want to change the base?

⚡️ Speed up method SentenceSplitter._split by 28% #122

Uh oh!

Conversation

codeflash-ai bot commented Nov 11, 2025

📄 28% (0.28x) speedup for SentenceSplitter._split in llama-index-core/llama_index/core/node_parser/text/sentence.py

📝 Explanation and details

----------- Basic Test Cases -----------

----------- Mutation Testing (Functional Behavior) -----------

imports

------------------- UNIT TESTS -------------------

Basic Test Cases

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `SentenceSplitter._split` by 28% #122

⚡️ Speed up method `SentenceSplitter._split` by 28% #122

📄 28% (0.28x) speedup for `SentenceSplitter._split` in `llama-index-core/llama_index/core/node_parser/text/sentence.py`