Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 11, 2025

📄 28% (0.28x) speedup for SentenceSplitter._split in llama-index-core/llama_index/core/node_parser/text/sentence.py

⏱️ Runtime : 612 microseconds 478 microseconds (best of 35 runs)

📝 Explanation and details

The optimized code achieves a 28% speedup by implementing batch tokenization to reduce redundant tokenizer calls.

Key optimization: Added _token_size_batch() method that attempts to tokenize multiple text splits at once, falling back to individual tokenization if batch processing isn't supported by the tokenizer.

Why this works:

  • In the original code, _token_size() was called 154 times (line profiler shows), with each call invoking the tokenizer individually
  • The optimized version reduces tokenizer calls from 154 to just 2 by batching 150 text splits together in one call
  • Tokenizers like tiktoken support efficient batch processing, avoiding repeated setup/teardown overhead per string

Performance impact:

  • Tokenization overhead drops from 813μs to 228μs (72% reduction)
  • Total method time improves from 3.18ms to 2.70ms
  • Most effective for workloads with many small text chunks that need tokenization

Test case analysis: The optimization shows strong gains on larger texts that get split into many chunks (like the 500-word test case improving 82.7%), while smaller improvements occur when fewer splits are needed (11.2% for overlap scenarios).

The fallback mechanism ensures compatibility with tokenizers that don't support batch operations, maintaining the same behavior while providing significant speedups when possible.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 7 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 83.3%
🌀 Generated Regression Tests and Runtime

import pytest
from llama_index.core.node_parser.text.sentence import SentenceSplitter

----------- Basic Test Cases -----------

def test_chunk_overlap_greater_than_chunk_size_raises():
with pytest.raises(ValueError):
SentenceSplitter(chunk_size=2, chunk_overlap=3)

def test_chunk_overlap_equals_chunk_size_allowed():
# Should not raise
SentenceSplitter(chunk_size=2, chunk_overlap=2)

def test_large_chunk_size():
splitter = SentenceSplitter(chunk_size=1000)
text = " ".join(["word"] * 500)
codeflash_output = splitter._split(text, chunk_size=1000); splits = codeflash_output # 208μs -> 114μs (82.7% faster)

def test_large_chunk_overlap():
# Should allow overlap == chunk_size
splitter = SentenceSplitter(chunk_size=100, chunk_overlap=100)
text = " ".join(["word"] * 150)
codeflash_output = splitter._split(text, chunk_size=100); splits = codeflash_output # 403μs -> 363μs (11.2% faster)

----------- Mutation Testing (Functional Behavior) -----------

#------------------------------------------------
import re
from typing import List

imports

import pytest
from llama_index.core.node_parser.text.sentence import SentenceSplitter

------------------- UNIT TESTS -------------------

Basic Test Cases

def test_chunk_overlap_greater_than_chunk_size():
# Should raise ValueError
with pytest.raises(ValueError):
SentenceSplitter(chunk_size=2, chunk_overlap=3)

To edit these changes git checkout codeflash/optimize-SentenceSplitter._split-mhv7mjig and push.

Codeflash Static Badge

The optimized code achieves a **28% speedup** by implementing **batch tokenization** to reduce redundant tokenizer calls. 

**Key optimization**: Added `_token_size_batch()` method that attempts to tokenize multiple text splits at once, falling back to individual tokenization if batch processing isn't supported by the tokenizer.

**Why this works**: 
- In the original code, `_token_size()` was called 154 times (line profiler shows), with each call invoking the tokenizer individually
- The optimized version reduces tokenizer calls from 154 to just 2 by batching 150 text splits together in one call
- Tokenizers like tiktoken support efficient batch processing, avoiding repeated setup/teardown overhead per string

**Performance impact**:
- Tokenization overhead drops from 813μs to 228μs (72% reduction) 
- Total method time improves from 3.18ms to 2.70ms
- Most effective for workloads with many small text chunks that need tokenization

**Test case analysis**: The optimization shows strong gains on larger texts that get split into many chunks (like the 500-word test case improving 82.7%), while smaller improvements occur when fewer splits are needed (11.2% for overlap scenarios).

The fallback mechanism ensures compatibility with tokenizers that don't support batch operations, maintaining the same behavior while providing significant speedups when possible.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 11, 2025 23:36
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant