Skip to content

Fix regex-based whitespace splitting in split_text_by_words function #3

@chigwell

Description

@chigwell

User Story
As a developer maintaining text processing utilities,
I want the split_text_by_words function to use regex-based whitespace splitting
so that it consistently handles multiple whitespace types and preserves trailing space information when needed.

Background
The current implementation in main.py uses text.split(), which:

  1. Collapses all whitespace (spaces, tabs, newlines) into single splits
  2. Automatically trims leading/trailing whitespace, potentially losing meaningful trailing spaces in text chunks
  3. Fails test cases with multiple consecutive spaces (see test_multiple_spaces in tests/test_eknowledge.py)

The regex pattern re.split(r'\s+', text) will:

  • Split on any sequence of whitespace characters
  • Preserve trailing whitespace by not stripping the original text
  • Maintain empty strings in split results to indicate leading/trailing whitespace

Acceptance Criteria

  • Modify split_text_by_words in eknowledge/main.py:
    • Replace words = text.split() with words = [w for w in re.split(r'\s+', text) if w] to split while preserving positional whitespace info
    • Add import re at the top of the file if not present
  • Update unit tests in tests/test_eknowledge.py:
    • Expand test_multiple_spaces to verify preservation of trailing spaces in final chunk
    • Add test case for text starting/ending with whitespace (e.g., " Leading and trailing ")
  • Validation:
    • All existing tests pass after refactoring
    • New tests confirm handling of 5+ consecutive spaces and tab/newline mixes
    • Chunk recombination via " ".join(chunks) reconstructs original whitespace pattern

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions