Fix regex-based whitespace splitting in split_text_by_words function

**User Story**  
As a developer maintaining text processing utilities,  
I want the `split_text_by_words` function to use regex-based whitespace splitting  
so that it consistently handles multiple whitespace types and preserves trailing space information when needed.

**Background**  
The current implementation in `main.py` uses `text.split()`, which:  
1. Collapses all whitespace (spaces, tabs, newlines) into single splits  
2. Automatically trims leading/trailing whitespace, potentially losing meaningful trailing spaces in text chunks  
3. Fails test cases with multiple consecutive spaces (see `test_multiple_spaces` in `tests/test_eknowledge.py`)  

The regex pattern `re.split(r'\s+', text)` will:  
- Split on any sequence of whitespace characters  
- Preserve trailing whitespace by not stripping the original text  
- Maintain empty strings in split results to indicate leading/trailing whitespace  

**Acceptance Criteria**  
- [ ] Modify `split_text_by_words` in `eknowledge/main.py`:  
  - Replace `words = text.split()` with `words = [w for w in re.split(r'\s+', text) if w]` to split while preserving positional whitespace info  
  - Add `import re` at the top of the file if not present  
- [ ] Update unit tests in `tests/test_eknowledge.py`:  
  - Expand `test_multiple_spaces` to verify preservation of trailing spaces in final chunk  
  - Add test case for text starting/ending with whitespace (e.g., `"   Leading and trailing   "`)  
- [ ] Validation:  
  - All existing tests pass after refactoring  
  - New tests confirm handling of 5+ consecutive spaces and tab/newline mixes  
  - Chunk recombination via `" ".join(chunks)` reconstructs original whitespace pattern

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix regex-based whitespace splitting in split_text_by_words function #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Fix regex-based whitespace splitting in split_text_by_words function #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions