-
Notifications
You must be signed in to change notification settings - Fork 0
added data_text_prep.ipynb for preparing the text #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a new Jupyter notebook 02_data_text_prep.ipynb that implements a comprehensive text preprocessing pipeline for preparing text data before sending it to LLM processing. The notebook provides a standalone, standard-library-first approach with optional dependencies for enhanced functionality.
Key changes:
- Implements a complete text preprocessing pipeline with HTML stripping, Unicode normalization, whitespace handling, and PII redaction
- Adds text statistics computation including Flesch Reading Ease scores to determine if text needs LLM simplification
- Provides sentence splitting and chunking utilities for managing large documents
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "def preprocess_text(\n", | ||
| " text: str,\n", | ||
| " *,\n", | ||
| " strip_html_input: bool = True,\n", | ||
| " redact: bool = True,\n", | ||
| " drop_noise: bool = True,\n", | ||
| " dedupe_lines: bool = True,\n", | ||
| " max_chunk_chars: int = 1800,\n", | ||
| " chunk_overlap: int = 1,\n", | ||
| ") -> Dict[str, object]:\n", |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preprocess_text function lacks input validation for the max_chunk_chars and chunk_overlap parameters. If a user passes negative values or if chunk_overlap is greater than or equal to max_chunk_chars, it could lead to unexpected behavior or infinite loops in the chunking logic. Consider adding validation to ensure these parameters are within reasonable ranges.
notebooks/02_data_text_prep.ipynb
Outdated
| " if overlap > 0:\n", | ||
| " current = current[-overlap:]\n", | ||
| " current_len = sum(len(s) + 1 for s in current)\n", |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overlap calculation in the flush function may not work as intended. When overlap > 0, it keeps the last N sentences but recalculates current_len by adding +1 for each sentence. However, this doesn't account for the spaces between sentences that will be added when joining. If sentences are very short, this could lead to chunks that are slightly larger than max_chars after joining. The length calculation should match the joining logic in line 195 and 214.
notebooks/02_data_text_prep.ipynb
Outdated
| " )\n", | ||
| "\n", | ||
| "\n", | ||
| "def should_send_to_llm(\n", |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The should_send_to_llm function name doesn't clearly convey what criteria it uses for the decision. A more descriptive name like should_send_to_llm_based_on_complexity or requires_llm_simplification would better communicate that this is based on readability metrics, not just size or content type.
notebooks/02_data_text_prep.ipynb
Outdated
| "\n", | ||
| "\n", | ||
| "def normalize_unicode(text: str) -> str:\n", | ||
| " # NFKC makes quotes, full-width chars, and compatibility forms consistent\n", |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "NFKC makes quotes, full-width chars, and compatibility forms consistent" is incomplete and could be misleading. NFKC normalization does more than this - it also decomposes and recomposes characters, which can have significant effects on text (e.g., converting fi ligature to fi). Consider expanding this comment to mention potential side effects or providing a reference to Unicode normalization documentation.
| " # NFKC makes quotes, full-width chars, and compatibility forms consistent\n", | |
| " # Unicode NFKC normalization:\n", | |
| " # - Makes quotes, full-width chars, and compatibility forms consistent\n", | |
| " # - Also decomposes/recomposes characters and may change glyphs (e.g. fi -> fi)\n", | |
| " # See Unicode Normalization Forms: https://unicode.org/reports/tr15/\n", |
notebooks/02_data_text_prep.ipynb
Outdated
| " return [c for c in chunks if c]\n", | ||
| "\n", | ||
| "\n", | ||
| "_RE_EMAIL = re.compile(r\"\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,}\\b\", re.IGNORECASE)\n", |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex pattern for email addresses is too restrictive. The domain extension pattern [A-Z]{2,} only matches alphabetic characters, but valid TLDs can contain digits (e.g., .com1, though rare) and many TLDs are now longer than 2 characters. Additionally, this pattern won't properly handle internationalized domain names or newer TLDs. Consider using a more comprehensive pattern or library for email validation.
| "_RE_EMAIL = re.compile(r\"\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,}\\b\", re.IGNORECASE)\n", | |
| "_RE_EMAIL = re.compile(r\"\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z0-9-]{2,63}\\b\", re.IGNORECASE)\n", |
| "def compute_stats(text: str) -> TextStats:\n", | ||
| " lang = detect_language(text)\n", | ||
| " words = re.findall(r\"[A-Za-z]+(?:'[A-Za-z]+)?\", text)\n", | ||
| " sentences = split_sentences(text)\n", | ||
| "\n", | ||
| " word_count = len(words)\n", | ||
| " sentence_count = max(1, len(sentences))\n", | ||
| " avg_sentence_len = (word_count / sentence_count) if sentence_count else 0.0\n", | ||
| " avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n", | ||
| "\n", | ||
| " long_words = [w for w in words if _estimate_syllables(w) >= 3 or len(w) >= 10]\n", | ||
| " long_ratio = len(long_words) / max(1, word_count)\n", | ||
| "\n", | ||
| " # Flesch Reading Ease: 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n", | ||
| " syllables = sum(_estimate_syllables(w) for w in words)\n", | ||
| " syllables_per_word = syllables / max(1, word_count)\n", | ||
| " fre = 206.835 - 1.015 * avg_sentence_len - 84.6 * syllables_per_word\n", | ||
| "\n", | ||
| " return TextStats(\n", | ||
| " language=lang,\n", | ||
| " char_count=len(text),\n", | ||
| " word_count=word_count,\n", | ||
| " sentence_count=len(sentences),\n", | ||
| " avg_sentence_len_words=float(avg_sentence_len),\n", | ||
| " avg_word_len_chars=float(avg_word_len),\n", | ||
| " long_word_ratio=float(long_ratio),\n", | ||
| " flesch_reading_ease=float(fre),\n", | ||
| " )\n", |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _estimate_syllables function is described as "English-ish" and "good enough for gating, not research-grade," but there's no documentation or warning in the main functions that use it (like compute_stats and should_send_to_llm) about this language limitation. Users working with non-English text might get unreliable results from the Flesch Reading Ease score and the LLM gating decision without understanding why.
| "\n", | ||
| " word_count = len(words)\n", | ||
| " sentence_count = max(1, len(sentences))\n", | ||
| " avg_sentence_len = (word_count / sentence_count) if sentence_count else 0.0\n", | ||
| " avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n", | ||
| "\n", | ||
| " long_words = [w for w in words if _estimate_syllables(w) >= 3 or len(w) >= 10]\n", | ||
| " long_ratio = len(long_words) / max(1, word_count)\n", | ||
| "\n", | ||
| " # Flesch Reading Ease: 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n", | ||
| " syllables = sum(_estimate_syllables(w) for w in words)\n", | ||
| " syllables_per_word = syllables / max(1, word_count)\n", | ||
| " fre = 206.835 - 1.015 * avg_sentence_len - 84.6 * syllables_per_word\n", | ||
| "\n", | ||
| " return TextStats(\n", | ||
| " language=lang,\n", | ||
| " char_count=len(text),\n", | ||
| " word_count=word_count,\n", |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Flesch Reading Ease formula implementation uses a simplified sentence splitter that has known limitations (as noted in line 163), which could lead to inaccurate FRE scores. Since the FRE score directly influences the should_send_to_llm decision (line 364), incorrect sentence counts could cause documents to be incorrectly routed. Consider either improving the sentence splitter or adding a disclaimer about FRE accuracy.
| "\n", | |
| " word_count = len(words)\n", | |
| " sentence_count = max(1, len(sentences))\n", | |
| " avg_sentence_len = (word_count / sentence_count) if sentence_count else 0.0\n", | |
| " avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n", | |
| "\n", | |
| " long_words = [w for w in words if _estimate_syllables(w) >= 3 or len(w) >= 10]\n", | |
| " long_ratio = len(long_words) / max(1, word_count)\n", | |
| "\n", | |
| " # Flesch Reading Ease: 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n", | |
| " syllables = sum(_estimate_syllables(w) for w in words)\n", | |
| " syllables_per_word = syllables / max(1, word_count)\n", | |
| " fre = 206.835 - 1.015 * avg_sentence_len - 84.6 * syllables_per_word\n", | |
| "\n", | |
| " return TextStats(\n", | |
| " language=lang,\n", | |
| " char_count=len(text),\n", | |
| " word_count=word_count,\n", | |
| "\n", | |
| " word_count = len(words)\n", | |
| " sentence_count = max(1, len(sentences))\n", | |
| " avg_sentence_len = (word_count / sentence_count) if sentence_count else 0.0\n", | |
| " avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n", | |
| "\n", | |
| " long_words = [w for w in words if _estimate_syllables(w) >= 3 or len(w) >= 10]\n", | |
| " long_ratio = len(long_words) / max(1, word_count)\n", | |
| "\n", | |
| " # Flesch Reading Ease: 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n", | |
| " # Note: Sentence splitting is heuristic, so this FRE score is an approximation and may mis-estimate readability.\n", | |
| " syllables = sum(_estimate_syllables(w) for w in words)\n", | |
| " syllables_per_word = syllables / max(1, word_count)\n", | |
| " fre = 206.835 - 1.015 * avg_sentence_len - 84.6 * syllables_per_word\n", | |
| "\n", | |
| " return TextStats(\n", | |
| " language=lang,\n", | |
| " char_count=len(text),\n", | |
| " word_count=word_count,\n", |
notebooks/02_data_text_prep.ipynb
Outdated
| " # Flesch Reading Ease: 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n", | ||
| " syllables = sum(_estimate_syllables(w) for w in words)\n", | ||
| " syllables_per_word = syllables / max(1, word_count)\n", | ||
| " fre = 206.835 - 1.015 * avg_sentence_len - 84.6 * syllables_per_word\n", |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic numbers 206.835, 1.015, and 84.6 in the Flesch Reading Ease formula are not explained. While these are standard constants for the FRE formula, adding a comment or reference to the formula would improve code maintainability and help future developers understand the calculation.
| " # Flesch Reading Ease: 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n", | |
| " syllables = sum(_estimate_syllables(w) for w in words)\n", | |
| " syllables_per_word = syllables / max(1, word_count)\n", | |
| " fre = 206.835 - 1.015 * avg_sentence_len - 84.6 * syllables_per_word\n", | |
| " # Flesch Reading Ease (FRE) score.\n", | |
| " # Coefficients from the standard FRE formula:\n", | |
| " # FRE = 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n", | |
| " # See e.g. https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests\n", | |
| " FRE_BASE = 206.835 # Base score for English texts in the FRE formula.\n", | |
| " FRE_SENTENCE_WEIGHT = 1.015 # Penalty per average sentence length (words per sentence).\n", | |
| " FRE_SYLLABLE_WEIGHT = 84.6 # Penalty per average syllables per word.\n", | |
| " syllables = sum(_estimate_syllables(w) for w in words)\n", | |
| " syllables_per_word = syllables / max(1, word_count)\n", | |
| " fre = FRE_BASE - FRE_SENTENCE_WEIGHT * avg_sentence_len - FRE_SYLLABLE_WEIGHT * syllables_per_word\n", |
| "\n", | ||
| " word_count = len(words)\n", | ||
| " sentence_count = max(1, len(sentences))\n", | ||
| " avg_sentence_len = (word_count / sentence_count) if sentence_count else 0.0\n", | ||
| " avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n", | ||
| "\n", |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if sentence_count else 0.0 is unreachable because sentence_count is already set to max(1, len(sentences)) on line 327, which guarantees it's at least 1. This conditional logic is redundant and should be simplified to just word_count / sentence_count.
| "\n", | |
| " word_count = len(words)\n", | |
| " sentence_count = max(1, len(sentences))\n", | |
| " avg_sentence_len = (word_count / sentence_count) if sentence_count else 0.0\n", | |
| " avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n", | |
| "\n", | |
| "\n", | |
| " word_count = len(words)\n", | |
| " sentence_count = max(1, len(sentences))\n", | |
| " avg_sentence_len = word_count / sentence_count\n", | |
| " avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n", | |
| "\n", |
notebooks/02_data_text_prep.ipynb
Outdated
| " return val\n", | ||
| " entities[\"card\"].append(val)\n", | ||
| " return f\"[CARD_{len(entities['card'])}]\"\n", | ||
| "\n", | ||
| " text = _RE_CREDIT_CARD.sub(card_repl, text)\n", | ||
| " return text, {k: v for k, v in entities.items() if v}\n", |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PII redaction feature stores actual PII values (emails, phone numbers, credit cards, URLs) in the returned dictionary. This creates a security risk as these sensitive values are retained in memory and could be logged, saved, or exposed. Consider either not returning the redacted values at all, or at minimum, add clear documentation warning about the security implications of storing this data.
| " return val\n", | |
| " entities[\"card\"].append(val)\n", | |
| " return f\"[CARD_{len(entities['card'])}]\"\n", | |
| "\n", | |
| " text = _RE_CREDIT_CARD.sub(card_repl, text)\n", | |
| " return text, {k: v for k, v in entities.items() if v}\n", | |
| " return val\n", | |
| " entities[\"card\"].append(val)\n", | |
| " return f\"[CARD_{len(entities['card'])}]\"\n", | |
| "\n", | |
| " text = _RE_CREDIT_CARD.sub(card_repl, text)\n", | |
| " # Do not return raw PII values to avoid retaining sensitive data in memory.\n", | |
| " # Instead, return only non-sensitive placeholders for detected entities.\n", | |
| " safe_entities = {k: [\"[REDACTED]\" for _ in v] for k, v in entities.items() if v}\n", | |
| " return text, safe_entities\n", |
|
@copilot open a new pull request to apply changes based on the comments in this thread |
…ror handling Co-authored-by: HussiJS <[email protected]>
Co-authored-by: HussiJS <[email protected]>
Co-authored-by: HussiJS <[email protected]>
Address code review feedback for text preprocessing pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 14 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| " raise ValueError(\"chunk_overlap cannot be negative\")\n", | ||
| " if chunk_overlap >= max_chunk_chars:\n", | ||
| " raise ValueError(\"chunk_overlap must be less than max_chunk_chars\")\n", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation logic has a bug. When chunk_overlap equals max_chunk_chars, the condition on line 439 should use greater than or equal to, but more critically, the comparison on line 439 should be checking if chunk_overlap is greater than or equal to max_chunk_chars (which it does), but the issue is that overlap being equal to max_chars makes no logical sense for chunking. However, there's a more fundamental issue: the validation should ensure chunk_overlap is measured in sentences (an integer count), but max_chunk_chars is in characters. These are different units and cannot be meaningfully compared. The validation is comparing apples to oranges.
| " raise ValueError(\"chunk_overlap cannot be negative\")\n", | |
| " if chunk_overlap >= max_chunk_chars:\n", | |
| " raise ValueError(\"chunk_overlap must be less than max_chunk_chars\")\n", | |
| " raise ValueError(\"chunk_overlap cannot be negative\")\n", |
| " return True\n", | ||
| " if ratio_punct > max_punct_ratio:\n", | ||
| " return True\n", | ||
| " if drop_all_caps_short and ln.isupper() and 8 <= len(ln) <= 25 and \" \" in ln:\n", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic numbers used without explanation. The thresholds 8 and 25 for detecting all-caps lines appear arbitrary. These should either be defined as named constants with explanatory comments, or the rationale for these specific values should be documented. This makes the code harder to maintain and adjust.
| " FRE_BASE = 206.835 # Base score for English texts in the FRE formula.\n", | ||
| " FRE_SENTENCE_WEIGHT = 1.015 # Penalty per average sentence length (words per sentence).\n", | ||
| " FRE_SYLLABLE_WEIGHT = 84.6 # Penalty per average syllables per word.\n", | ||
| " syllables = sum(_estimate_syllables(w) for w in words)\n", | ||
| " syllables_per_word = syllables / max(1, word_count)\n", | ||
| " fre = FRE_BASE - FRE_SENTENCE_WEIGHT * avg_sentence_len - FRE_SYLLABLE_WEIGHT * syllables_per_word\n", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent naming convention. The variable uses all caps (FRE_BASE, FRE_SENTENCE_WEIGHT, FRE_SYLLABLE_WEIGHT) which typically indicates module-level constants, but these are defined as local variables inside a function. Either move these to module level as true constants, or use lowercase naming to indicate they are local variables. The current naming is misleading.
| " FRE_BASE = 206.835 # Base score for English texts in the FRE formula.\n", | |
| " FRE_SENTENCE_WEIGHT = 1.015 # Penalty per average sentence length (words per sentence).\n", | |
| " FRE_SYLLABLE_WEIGHT = 84.6 # Penalty per average syllables per word.\n", | |
| " syllables = sum(_estimate_syllables(w) for w in words)\n", | |
| " syllables_per_word = syllables / max(1, word_count)\n", | |
| " fre = FRE_BASE - FRE_SENTENCE_WEIGHT * avg_sentence_len - FRE_SYLLABLE_WEIGHT * syllables_per_word\n", | |
| " fre_base = 206.835 # Base score for English texts in the FRE formula.\n", | |
| " fre_sentence_weight = 1.015 # Penalty per average sentence length (words per sentence).\n", | |
| " fre_syllable_weight = 84.6 # Penalty per average syllables per word.\n", | |
| " syllables = sum(_estimate_syllables(w) for w in words)\n", | |
| " syllables_per_word = syllables / max(1, word_count)\n", | |
| " fre = fre_base - fre_sentence_weight * avg_sentence_len - fre_syllable_weight * syllables_per_word\n", |
| "def detect_language(text: str) -> str:\n", | ||
| " # Optional: only if langdetect installed\n", | ||
| " if lang_detect is None:\n", | ||
| " return \"unknown\"\n", | ||
| " sample = text\n", | ||
| " if len(sample) > 4000:\n", | ||
| " sample = sample[:4000]\n", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number without explanation. The value 4000 is used to limit the sample size for language detection, but there's no comment explaining why this specific limit was chosen. Consider defining this as a named constant with a descriptive comment about why this size is optimal for language detection accuracy vs performance.
| "def detect_language(text: str) -> str:\n", | |
| " # Optional: only if langdetect installed\n", | |
| " if lang_detect is None:\n", | |
| " return \"unknown\"\n", | |
| " sample = text\n", | |
| " if len(sample) > 4000:\n", | |
| " sample = sample[:4000]\n", | |
| "# Max characters to sample for language detection to balance accuracy and performance.\n", | |
| "LANG_DETECT_SAMPLE_LIMIT = 4000\n", | |
| "\n", | |
| "def detect_language(text: str) -> str:\n", | |
| " # Optional: only if langdetect installed\n", | |
| " if lang_detect is None:\n", | |
| " return \"unknown\"\n", | |
| " sample = text\n", | |
| " if len(sample) > LANG_DETECT_SAMPLE_LIMIT:\n", | |
| " sample = sample[:LANG_DETECT_SAMPLE_LIMIT]\n", |
| " Compute text statistics including Flesch Reading Ease score.\n", | ||
| " WARNING: Syllable estimation and sentence splitting are English-focused heuristics.\n", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation issue: The function computes both statistics and makes a decision about LLM simplification, but the docstring only mentions computing statistics. The docstring should document that this function also triggers language detection as a side effect, and note the return value is a TextStats dataclass. Additionally, the warnings about English-focused heuristics should mention which specific metrics are affected (Flesch Reading Ease, syllable counting, sentence splitting).
| " Compute text statistics including Flesch Reading Ease score.\n", | |
| " WARNING: Syllable estimation and sentence splitting are English-focused heuristics.\n", | |
| " Compute text statistics and run language detection.\n", | |
| "\n", | |
| " This function computes basic text metrics (character, word, and sentence counts,\n", | |
| " average sentence length, average word length, long-word ratio) and the Flesch\n", | |
| " Reading Ease score, and returns them in a TextStats dataclass.\n", | |
| "\n", | |
| " As a side effect, this function calls `detect_language` on the input text and\n", | |
| " stores the detected language in the resulting TextStats instance.\n", | |
| "\n", | |
| " WARNING: Several metrics rely on English-focused heuristics, specifically\n", | |
| " syllable counting, sentence splitting, and the Flesch Reading Ease score.\n", |
| "\n", | ||
| "_RE_EMAIL = re.compile(r\"\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z0-9-]{2,63}\\b\", re.IGNORECASE)\n", | ||
| "_RE_PHONE = re.compile(r\"\\b(?:\\+?\\d{1,3}[-. ]?)?(?:\\(?\\d{2,4}\\)?[-. ]?)?\\d{3,4}[-. ]?\\d{3,4}\\b\")\n", | ||
| "_RE_URL = re.compile(r\"\\bhttps?://[^\\s]+?(?=[\\s\\)\\]\\}>\\\"\\'.,!?]|$)\", re.IGNORECASE)\n", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL regex pattern has a potential issue with the negative lookahead. The pattern ends with a complex negative lookahead that may not match URLs correctly in all contexts. Specifically, the escaped backslash before the quote characters appears incorrect - in a raw Python string this would be fine, but in a notebook string it should be just a single backslash or the quote should not be escaped. This could cause the regex to fail matching URLs that appear before quotes.
| " # Phone regex can overmatch; keep conservative by only replacing matches with enough digits\n", | ||
| " def phone_repl(m: re.Match) -> str:\n", | ||
| " val = m.group(0)\n", | ||
| " digits = re.sub(r\"\\D\", \"\", val)\n", | ||
| " if len(digits) < 9:\n", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing input validation for the phone number replacement threshold. The hardcoded value of 9 digits on line 288 is used to filter phone matches, but this arbitrary threshold is not validated against the regex pattern on line 249 which matches sequences with as few as 6 digits (e.g., 3+3). This inconsistency could lead to unexpected behavior. Consider adding a comment explaining the relationship between the regex pattern and the digit threshold, or making the threshold configurable.
| " # Phone regex can overmatch; keep conservative by only replacing matches with enough digits\n", | |
| " def phone_repl(m: re.Match) -> str:\n", | |
| " val = m.group(0)\n", | |
| " digits = re.sub(r\"\\D\", \"\", val)\n", | |
| " if len(digits) < 9:\n", | |
| " # Phone regex can overmatch; keep conservative by only replacing matches with enough digits.\n", | |
| " # NOTE: _RE_PHONE currently matches sequences with as few as 6 digits; this threshold must be\n", | |
| " # kept in sync with that minimum and is defined as a single constant for easier validation\n", | |
| " # and configuration alongside the regex.\n", | |
| " PHONE_MIN_DIGITS = 9 # configurable threshold for phone replacement\n", | |
| " def phone_repl(m: re.Match) -> str:\n", | |
| " val = m.group(0)\n", | |
| " digits = re.sub(r\"\\D\", \"\", val)\n", | |
| " if len(digits) < PHONE_MIN_DIGITS:\n", |
| " # Coefficients from the standard FRE formula:\n", | ||
| " # FRE = 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n", | ||
| " # See e.g. https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests\n", | ||
| " # Note: Sentence splitting is heuristic, so this FRE score is an approximation and may mis-estimate readability.\n", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: 'mis-estimate' should be 'misestimate' (without hyphen). The word 'misestimate' is a standard English word that doesn't require hyphenation.
| " # Note: Sentence splitting is heuristic, so this FRE score is an approximation and may mis-estimate readability.\n", | |
| " # Note: Sentence splitting is heuristic, so this FRE score is an approximation and may misestimate readability.\n", |
| " dedupe_lines: bool = True,\n", | ||
| " max_chunk_chars: int = 1800,\n", | ||
| " chunk_overlap: int = 1,\n", | ||
| ") -> Dict[str, object]:\n", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docstring. This is a public-facing function that serves as the main entry point for the preprocessing pipeline, but it lacks a docstring. A comprehensive docstring should document the purpose, all parameters with their types and defaults, the return value structure, and any exceptions that may be raised. This is especially important given the complexity of the returned dictionary.
| ") -> Dict[str, object]:\n", | |
| ") -> Dict[str, object]:\n", | |
| " \"\"\"Run the full text preprocessing pipeline on a single document.\n", | |
| "\n", | |
| " This is the main entry point for the preprocessing pipeline. It can strip\n", | |
| " HTML from the input, redact sensitive patterns, drop noisy lines,\n", | |
| " de-duplicate lines, and then chunk the cleaned text into overlapping\n", | |
| " segments suitable for downstream tasks such as embedding or indexing.\n", | |
| "\n", | |
| " Parameters\n", | |
| " ----------\n", | |
| " text : str\n", | |
| " Raw input text to preprocess. If ``None`` or an empty string is\n", | |
| " provided, it is treated as an empty document.\n", | |
| " strip_html_input : bool, default True\n", | |
| " Whether to remove HTML tags from the input text before any other\n", | |
| " processing.\n", | |
| " redact : bool, default True\n", | |
| " Whether to apply redaction of sensitive or unwanted patterns (for\n", | |
| " example, certain identifiers) before chunking.\n", | |
| " drop_noise : bool, default True\n", | |
| " Whether to filter out lines that are likely to be boilerplate or\n", | |
| " otherwise unhelpful noise.\n", | |
| " dedupe_lines : bool, default True\n", | |
| " Whether to remove duplicate lines after cleaning and noise removal.\n", | |
| " max_chunk_chars : int, default 1800\n", | |
| " Target maximum character length for each generated text chunk.\n", | |
| " chunk_overlap : int, default 1\n", | |
| " Number of overlapping characters between consecutive chunks. Must be\n", | |
| " non-negative and strictly less than ``max_chunk_chars``.\n", | |
| "\n", | |
| " Returns\n", | |
| " -------\n", | |
| " Dict[str, object]\n", | |
| " A dictionary containing the original text, intermediate preprocessing\n", | |
| " artifacts, and the final list of text chunks and related metadata.\n", | |
| " The exact structure is intended to be consumed by downstream stages\n", | |
| " of the pipeline and may include additional fields beyond those\n", | |
| " described here.\n", | |
| "\n", | |
| " Raises\n", | |
| " ------\n", | |
| " ValueError\n", | |
| " If ``max_chunk_chars`` is not positive, if ``chunk_overlap`` is\n", | |
| " negative, or if ``chunk_overlap`` is greater than or equal to\n", | |
| " ``max_chunk_chars``.\n", | |
| " \"\"\"\n", |
| "except Exception:\n", | ||
| " BeautifulSoup = None\n", | ||
| "\n", | ||
| "try:\n", | ||
| " from langdetect import detect as lang_detect # type: ignore\n", | ||
| "except Exception:\n", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best practice violation: Catching bare Exception is too broad. The code catches generic Exception on lines 33, 38, 115, 170, 304, and 323, which can hide unexpected errors and make debugging difficult. Consider catching specific exception types (e.g., ImportError for optional dependencies, ValueError for parsing errors) to handle known failure cases while allowing unexpected errors to propagate.
| "except Exception:\n", | |
| " BeautifulSoup = None\n", | |
| "\n", | |
| "try:\n", | |
| " from langdetect import detect as lang_detect # type: ignore\n", | |
| "except Exception:\n", | |
| "except ImportError:\n", | |
| " BeautifulSoup = None\n", | |
| "\n", | |
| "try:\n", | |
| " from langdetect import detect as lang_detect # type: ignore\n", | |
| "except ImportError:\n", |
|
@copilot open a new pull request to apply changes based on the comments in this thread |
Request clarification on review thread feedback
No description provided.