User Story
As a developer maintaining text processing utilities,
I want the split_text_by_words function to use regex-based whitespace splitting
so that it consistently handles multiple whitespace types and preserves trailing space information when needed.
Background
The current implementation in main.py uses text.split(), which:
- Collapses all whitespace (spaces, tabs, newlines) into single splits
- Automatically trims leading/trailing whitespace, potentially losing meaningful trailing spaces in text chunks
- Fails test cases with multiple consecutive spaces (see
test_multiple_spaces in tests/test_eknowledge.py)
The regex pattern re.split(r'\s+', text) will:
- Split on any sequence of whitespace characters
- Preserve trailing whitespace by not stripping the original text
- Maintain empty strings in split results to indicate leading/trailing whitespace
Acceptance Criteria
User Story
As a developer maintaining text processing utilities,
I want the
split_text_by_wordsfunction to use regex-based whitespace splittingso that it consistently handles multiple whitespace types and preserves trailing space information when needed.
Background
The current implementation in
main.pyusestext.split(), which:test_multiple_spacesintests/test_eknowledge.py)The regex pattern
re.split(r'\s+', text)will:Acceptance Criteria
split_text_by_wordsineknowledge/main.py:words = text.split()withwords = [w for w in re.split(r'\s+', text) if w]to split while preserving positional whitespace infoimport reat the top of the file if not presenttests/test_eknowledge.py:test_multiple_spacesto verify preservation of trailing spaces in final chunk" Leading and trailing ")" ".join(chunks)reconstructs original whitespace pattern