-
Notifications
You must be signed in to change notification settings - Fork 18.5k
feat(text-splitters): add strict_chunk_size
to RecursiveCharacterTextSplitter
and chunk_position
to TextSplitter
#30220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
…plitter with tests
7bfe089
to
ef6dbf6
Compare
CodSpeed WallTime Performance ReportMerging #30220 will not alter performanceComparing
|
CodSpeed Instrumentation Performance ReportMerging #30220 will not alter performanceComparing Summary
|
strict_chunk_size
to RecursiveCharacterTextSplitter
and chunk_position
to TextSplitter
Description: This PR enhances TextSplitter and RecursiveCharacterTextSplitter by proposing a new approach to chunking. The current RecursiveCharacterTextSplitter does not guarantee that chunks stay within the chunk_size limit. When no separators are found, chunks can exceed chunk_size, leading to:
This PR improves chunking by:
What has changed:
Reduces redundant merging operations for better efficiency. Default: strict_chunk_size=False (preserves existing behavior).
Issue: this issue was discussed by the community here
Tests: Unit tests added in test_text_splitters.py to validate:
Tested with different chunk sizes and edge cases (long words, no separators, mixed separators). All tests pass.
Additional notes: