Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 6, 2026

Applies fixes from code review comments on the text preprocessing notebook, addressing validation gaps, regex precision, security issues, and documentation clarity.

Input Validation

  • Added parameter validation to preprocess_text() to prevent negative/invalid max_chunk_chars and chunk_overlap values
  • Prevents potential infinite loops from chunk_overlap >= max_chunk_chars

Chunking Logic

  • Fixed overlap calculation to match actual join behavior (spaces between sentences)
  • Changed from sum(len(s) + 1 for s in current) to sum(len(s) for s in current) + max(0, len(current) - 1)

Regex Patterns

  • Email: Now accepts numeric TLDs with 2-63 char limit: [A-Z0-9-]{2,63}
  • URL: Handles trailing punctuation via lookahead: (?=[\s\)\]\}>"'.,!?]|$)
  • Credit card: Tightened from (?:\d[ -]*?){13,19} to (?:\d{13,19}|\d{4}(?:[ -]\d{4}){3})

Security

  • PII redaction returns [REDACTED] placeholders instead of actual sensitive values
  • Renamed pii variable to redacted_entities for clarity
  • Added error handling to strip_html() and redact_pii() to prevent malformed input from breaking the pipeline

Sentence Splitting

  • Added NLTK Punkt tokenizer support with regex fallback:
if _nltk_sent_tokenize is not None:
    sentences.extend(_nltk_sent_tokenize(para))
else:
    # Regex fallback with known limitations

Documentation

  • Expanded Unicode normalization comment to mention glyph changes (fi → fi)
  • Added language limitation warnings to syllable estimation and FRE scoring
  • Documented FRE formula constants (206.835, 1.015, 84.6) with Wikipedia reference
  • Renamed should_send_to_llm() to requires_llm_simplification() for clarity

Code Quality

  • Removed unreachable conditional: (word_count / sentence_count) if sentence_count else 0.0word_count / sentence_count
  • Made drop_all_caps_short more selective: requires spaces and 8+ chars to avoid filtering acronyms
  • Consistent double-quote style in return dictionaries

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Add data_text_prep.ipynb for text preprocessing Address code review feedback for text preprocessing pipeline Jan 6, 2026
Copilot AI requested a review from HussiJS January 6, 2026 01:12
@HussiJS HussiJS marked this pull request as ready for review January 6, 2026 01:15
@HussiJS HussiJS merged commit 1566257 into notebook/data_text_prep Jan 6, 2026
@esmahoney esmahoney deleted the copilot/sub-pr-2 branch January 7, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants