Skip to content

feat(text-splitters): add strict_chunk_size to RecursiveCharacterTextSplitter and chunk_position to TextSplitter #30220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

AdeStAff
Copy link

@AdeStAff AdeStAff commented Mar 11, 2025

Description: This PR enhances TextSplitter and RecursiveCharacterTextSplitter by proposing a new approach to chunking. The current RecursiveCharacterTextSplitter does not guarantee that chunks stay within the chunk_size limit. When no separators are found, chunks can exceed chunk_size, leading to:

  • Inconsistent chunk sizes, which negatively impacts embedding models expecting uniform input lengths or at least inputs with a maximum length.
  • Performance overhead, as excessive splitting and merging occur when " " is used as a fallback separator.

This PR improves chunking by:

  • Prioritizing meaningful separators for splitting.
  • Enforcing chunk_size strictly when no separators remain, ensuring all chunks are ≤ chunk_size.
  • Introducing chunk_position metadata, enabling windowed retrieval for better context-aware lookups.

What has changed:

  • strict_chunk_size (New Argument): when strict_chunk_size=True, chunks never exceed chunk_size, even if no separators exist. Avoids " " as a fallback separator when unnecessary.
    Reduces redundant merging operations for better efficiency. Default: strict_chunk_size=False (preserves existing behavior).
  • chunk_position Metadata (New): tracks the chunk's position as "X/N", where: X = Current chunk number and N = Total number of chunks. Enables windowed retrieval, allowing retrieval of neighboring chunks (previous & next). To activate it, set add_chunk_position to True when initializing TextSplitter.

Issue: this issue was discussed by the community here

Tests: Unit tests added in test_text_splitters.py to validate:

  • strict_chunk_size enforcement (ensures no chunk exceeds chunk_size).
  • chunk_position correctness (verifies metadata assignment).
  • Default behavior is unchanged when strict_chunk_size=False.

Tested with different chunk sizes and edge cases (long words, no separators, mixed separators). All tests pass.

Additional notes:

  • Backward compatibility maintained (strict_chunk_size and add_chunk_position default to False).
  • This improvement enhances both chunking efficiency and retrieval accuracy in RAG pipelines.

Copy link

vercel bot commented Mar 11, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Jul 16, 2025 3:19pm

@dosubot dosubot bot added size:L labels Mar 11, 2025
@mdrxy mdrxy changed the title text-splitters: Add strict_chunk_size to RecursiveCharacterTextSplitter and chunk_position to TextSplitter feat(text-splitters): add strict_chunk_size to RecursiveCharacterTextSplitter and chunk_position to TextSplitter Jul 16, 2025
Copy link

codspeed-hq bot commented Jul 16, 2025

CodSpeed WallTime Performance Report

Merging #30220 will not alter performance

Comparing AdeStAff:feat-strict-chunk-size (ee356a5) with master (12d370a)

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

✅ 13 untouched benchmarks

Copy link

codspeed-hq bot commented Jul 16, 2025

CodSpeed Instrumentation Performance Report

Merging #30220 will not alter performance

Comparing AdeStAff:feat-strict-chunk-size (ee356a5) with master (12d370a)

Summary

✅ 14 untouched benchmarks

@mdrxy mdrxy added the text-splitters Related to the package `text-splitters` label Aug 7, 2025
@mdrxy mdrxy changed the title feat(text-splitters): add strict_chunk_size to RecursiveCharacterTextSplitter and chunk_position to TextSplitter feat(text-splitters): add strict_chunk_size to RecursiveCharacterTextSplitter and chunk_position to TextSplitter Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
text-splitters Related to the package `text-splitters`
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants