added data_text_prep.ipynb for preparing the text #2

esmahoney · 2025-12-19T10:18:43Z

No description provided.

Copilot

Pull request overview

This PR adds a new Jupyter notebook 02_data_text_prep.ipynb that implements a comprehensive text preprocessing pipeline for preparing text data before sending it to LLM processing. The notebook provides a standalone, standard-library-first approach with optional dependencies for enhanced functionality.

Key changes:

Implements a complete text preprocessing pipeline with HTML stripping, Unicode normalization, whitespace handling, and PII redaction
Adds text statistics computation including Flesch Reading Ease scores to determine if text needs LLM simplification
Provides sentence splitting and chunking utilities for managing large documents

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-05T16:43:59Z

notebooks/02_data_text_prep.ipynb

+    "def preprocess_text(\n",
+    "    text: str,\n",
+    "    *,\n",
+    "    strip_html_input: bool = True,\n",
+    "    redact: bool = True,\n",
+    "    drop_noise: bool = True,\n",
+    "    dedupe_lines: bool = True,\n",
+    "    max_chunk_chars: int = 1800,\n",
+    "    chunk_overlap: int = 1,\n",
+    ") -> Dict[str, object]:\n",


The preprocess_text function lacks input validation for the max_chunk_chars and chunk_overlap parameters. If a user passes negative values or if chunk_overlap is greater than or equal to max_chunk_chars, it could lead to unexpected behavior or infinite loops in the chunking logic. Consider adding validation to ensure these parameters are within reasonable ranges.

Copilot · 2026-01-05T16:43:59Z

notebooks/02_data_text_prep.ipynb

+    "            if overlap > 0:\n",
+    "                current = current[-overlap:]\n",
+    "                current_len = sum(len(s) + 1 for s in current)\n",


The overlap calculation in the flush function may not work as intended. When overlap > 0, it keeps the last N sentences but recalculates current_len by adding +1 for each sentence. However, this doesn't account for the spaces between sentences that will be added when joining. If sentences are very short, this could lead to chunks that are slightly larger than max_chars after joining. The length calculation should match the joining logic in line 195 and 214.

Copilot · 2026-01-05T16:43:59Z

notebooks/02_data_text_prep.ipynb

+    "    )\n",
+    "\n",
+    "\n",
+    "def should_send_to_llm(\n",


The should_send_to_llm function name doesn't clearly convey what criteria it uses for the decision. A more descriptive name like should_send_to_llm_based_on_complexity or requires_llm_simplification would better communicate that this is based on readability metrics, not just size or content type.

Copilot · 2026-01-05T16:43:59Z

notebooks/02_data_text_prep.ipynb

+    "\n",
+    "\n",
+    "def normalize_unicode(text: str) -> str:\n",
+    "    # NFKC makes quotes, full-width chars, and compatibility forms consistent\n",


The comment "NFKC makes quotes, full-width chars, and compatibility forms consistent" is incomplete and could be misleading. NFKC normalization does more than this - it also decomposes and recomposes characters, which can have significant effects on text (e.g., converting ﬁ ligature to fi). Consider expanding this comment to mention potential side effects or providing a reference to Unicode normalization documentation.

Suggested change

" # NFKC makes quotes, full-width chars, and compatibility forms consistent\n",

" # Unicode NFKC normalization:\n",

" # - Makes quotes, full-width chars, and compatibility forms consistent\n",

" # - Also decomposes/recomposes characters and may change glyphs (e.g. ﬁ -> fi)\n",

" # See Unicode Normalization Forms: https://unicode.org/reports/tr15/\n",

Copilot · 2026-01-05T16:44:00Z

notebooks/02_data_text_prep.ipynb

+    "    return [c for c in chunks if c]\n",
+    "\n",
+    "\n",
+    "_RE_EMAIL = re.compile(r\"\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,}\\b\", re.IGNORECASE)\n",


The regex pattern for email addresses is too restrictive. The domain extension pattern [A-Z]{2,} only matches alphabetic characters, but valid TLDs can contain digits (e.g., .com1, though rare) and many TLDs are now longer than 2 characters. Additionally, this pattern won't properly handle internationalized domain names or newer TLDs. Consider using a more comprehensive pattern or library for email validation.

Suggested change

"_RE_EMAIL = re.compile(r\"\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,}\\b\", re.IGNORECASE)\n",

"_RE_EMAIL = re.compile(r\"\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z0-9-]{2,63}\\b\", re.IGNORECASE)\n",

Copilot · 2026-01-05T16:44:03Z

notebooks/02_data_text_prep.ipynb

+    "def compute_stats(text: str) -> TextStats:\n",
+    "    lang = detect_language(text)\n",
+    "    words = re.findall(r\"[A-Za-z]+(?:'[A-Za-z]+)?\", text)\n",
+    "    sentences = split_sentences(text)\n",
+    "\n",
+    "    word_count = len(words)\n",
+    "    sentence_count = max(1, len(sentences))\n",
+    "    avg_sentence_len = (word_count / sentence_count) if sentence_count else 0.0\n",
+    "    avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n",
+    "\n",
+    "    long_words = [w for w in words if _estimate_syllables(w) >= 3 or len(w) >= 10]\n",
+    "    long_ratio = len(long_words) / max(1, word_count)\n",
+    "\n",
+    "    # Flesch Reading Ease: 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n",
+    "    syllables = sum(_estimate_syllables(w) for w in words)\n",
+    "    syllables_per_word = syllables / max(1, word_count)\n",
+    "    fre = 206.835 - 1.015 * avg_sentence_len - 84.6 * syllables_per_word\n",
+    "\n",
+    "    return TextStats(\n",
+    "        language=lang,\n",
+    "        char_count=len(text),\n",
+    "        word_count=word_count,\n",
+    "        sentence_count=len(sentences),\n",
+    "        avg_sentence_len_words=float(avg_sentence_len),\n",
+    "        avg_word_len_chars=float(avg_word_len),\n",
+    "        long_word_ratio=float(long_ratio),\n",
+    "        flesch_reading_ease=float(fre),\n",
+    "    )\n",


The _estimate_syllables function is described as "English-ish" and "good enough for gating, not research-grade," but there's no documentation or warning in the main functions that use it (like compute_stats and should_send_to_llm) about this language limitation. Users working with non-English text might get unreliable results from the Flesch Reading Ease score and the LLM gating decision without understanding why.

Copilot · 2026-01-05T16:44:03Z

notebooks/02_data_text_prep.ipynb

+    "\n",
+    "    word_count = len(words)\n",
+    "    sentence_count = max(1, len(sentences))\n",
+    "    avg_sentence_len = (word_count / sentence_count) if sentence_count else 0.0\n",
+    "    avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n",
+    "\n",
+    "    long_words = [w for w in words if _estimate_syllables(w) >= 3 or len(w) >= 10]\n",
+    "    long_ratio = len(long_words) / max(1, word_count)\n",
+    "\n",
+    "    # Flesch Reading Ease: 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n",
+    "    syllables = sum(_estimate_syllables(w) for w in words)\n",
+    "    syllables_per_word = syllables / max(1, word_count)\n",
+    "    fre = 206.835 - 1.015 * avg_sentence_len - 84.6 * syllables_per_word\n",
+    "\n",
+    "    return TextStats(\n",
+    "        language=lang,\n",
+    "        char_count=len(text),\n",
+    "        word_count=word_count,\n",


The Flesch Reading Ease formula implementation uses a simplified sentence splitter that has known limitations (as noted in line 163), which could lead to inaccurate FRE scores. Since the FRE score directly influences the should_send_to_llm decision (line 364), incorrect sentence counts could cause documents to be incorrectly routed. Consider either improving the sentence splitter or adding a disclaimer about FRE accuracy.

Suggested change

"\n",

" word_count = len(words)\n",

" sentence_count = max(1, len(sentences))\n",

" avg_sentence_len = (word_count / sentence_count) if sentence_count else 0.0\n",

" avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n",

"\n",

" long_words = [w for w in words if _estimate_syllables(w) >= 3 or len(w) >= 10]\n",

" long_ratio = len(long_words) / max(1, word_count)\n",

"\n",

" # Flesch Reading Ease: 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n",

" syllables = sum(_estimate_syllables(w) for w in words)\n",

" syllables_per_word = syllables / max(1, word_count)\n",

" fre = 206.835 - 1.015 * avg_sentence_len - 84.6 * syllables_per_word\n",

"\n",

" return TextStats(\n",

" language=lang,\n",

" char_count=len(text),\n",

" word_count=word_count,\n",

"\n",

" word_count = len(words)\n",

" sentence_count = max(1, len(sentences))\n",

" avg_sentence_len = (word_count / sentence_count) if sentence_count else 0.0\n",

" avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n",

"\n",

" long_words = [w for w in words if _estimate_syllables(w) >= 3 or len(w) >= 10]\n",

" long_ratio = len(long_words) / max(1, word_count)\n",

"\n",

" # Flesch Reading Ease: 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n",

" # Note: Sentence splitting is heuristic, so this FRE score is an approximation and may mis-estimate readability.\n",

" syllables = sum(_estimate_syllables(w) for w in words)\n",

" syllables_per_word = syllables / max(1, word_count)\n",

" fre = 206.835 - 1.015 * avg_sentence_len - 84.6 * syllables_per_word\n",

"\n",

" return TextStats(\n",

" language=lang,\n",

" char_count=len(text),\n",

" word_count=word_count,\n",

Copilot · 2026-01-05T16:44:03Z

notebooks/02_data_text_prep.ipynb

+    "    # Flesch Reading Ease: 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n",
+    "    syllables = sum(_estimate_syllables(w) for w in words)\n",
+    "    syllables_per_word = syllables / max(1, word_count)\n",
+    "    fre = 206.835 - 1.015 * avg_sentence_len - 84.6 * syllables_per_word\n",


The magic numbers 206.835, 1.015, and 84.6 in the Flesch Reading Ease formula are not explained. While these are standard constants for the FRE formula, adding a comment or reference to the formula would improve code maintainability and help future developers understand the calculation.

Suggested change

" # Flesch Reading Ease: 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n",

" syllables = sum(_estimate_syllables(w) for w in words)\n",

" syllables_per_word = syllables / max(1, word_count)\n",

" fre = 206.835 - 1.015 * avg_sentence_len - 84.6 * syllables_per_word\n",

" # Flesch Reading Ease (FRE) score.\n",

" # Coefficients from the standard FRE formula:\n",

" # FRE = 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n",

" # See e.g. https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests\n",

" FRE_BASE = 206.835 # Base score for English texts in the FRE formula.\n",

" FRE_SENTENCE_WEIGHT = 1.015 # Penalty per average sentence length (words per sentence).\n",

" FRE_SYLLABLE_WEIGHT = 84.6 # Penalty per average syllables per word.\n",

" syllables = sum(_estimate_syllables(w) for w in words)\n",

" syllables_per_word = syllables / max(1, word_count)\n",

" fre = FRE_BASE - FRE_SENTENCE_WEIGHT * avg_sentence_len - FRE_SYLLABLE_WEIGHT * syllables_per_word\n",

Copilot · 2026-01-05T16:44:04Z

notebooks/02_data_text_prep.ipynb

+    "\n",
+    "    word_count = len(words)\n",
+    "    sentence_count = max(1, len(sentences))\n",
+    "    avg_sentence_len = (word_count / sentence_count) if sentence_count else 0.0\n",
+    "    avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n",
+    "\n",


The condition if sentence_count else 0.0 is unreachable because sentence_count is already set to max(1, len(sentences)) on line 327, which guarantees it's at least 1. This conditional logic is redundant and should be simplified to just word_count / sentence_count.

Suggested change

"\n",

" word_count = len(words)\n",

" sentence_count = max(1, len(sentences))\n",

" avg_sentence_len = (word_count / sentence_count) if sentence_count else 0.0\n",

" avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n",

"\n",

"\n",

" word_count = len(words)\n",

" sentence_count = max(1, len(sentences))\n",

" avg_sentence_len = word_count / sentence_count\n",

" avg_word_len = (sum(len(w) for w in words) / max(1, word_count))\n",

"\n",

Copilot · 2026-01-05T16:44:05Z

notebooks/02_data_text_prep.ipynb

+    "            return val\n",
+    "        entities[\"card\"].append(val)\n",
+    "        return f\"[CARD_{len(entities['card'])}]\"\n",
+    "\n",
+    "    text = _RE_CREDIT_CARD.sub(card_repl, text)\n",
+    "    return text, {k: v for k, v in entities.items() if v}\n",


The PII redaction feature stores actual PII values (emails, phone numbers, credit cards, URLs) in the returned dictionary. This creates a security risk as these sensitive values are retained in memory and could be logged, saved, or exposed. Consider either not returning the redacted values at all, or at minimum, add clear documentation warning about the security implications of storing this data.

Suggested change

" return val\n",

" entities[\"card\"].append(val)\n",

" return f\"[CARD_{len(entities['card'])}]\"\n",

"\n",

" text = _RE_CREDIT_CARD.sub(card_repl, text)\n",

" return text, {k: v for k, v in entities.items() if v}\n",

" return val\n",

" entities[\"card\"].append(val)\n",

" return f\"[CARD_{len(entities['card'])}]\"\n",

"\n",

" text = _RE_CREDIT_CARD.sub(card_repl, text)\n",

" # Do not return raw PII values to avoid retaining sensitive data in memory.\n",

" # Instead, return only non-sensitive placeholders for detected entities.\n",

" safe_entities = {k: [\"[REDACTED]\" for _ in v] for k, v in entities.items() if v}\n",

" return text, safe_entities\n",

HussiJS · 2026-01-06T01:00:27Z

@copilot open a new pull request to apply changes based on the comments in this thread

Copilot · 2026-01-06T01:00:35Z

@HussiJS I've opened a new pull request, #27, to work on those changes. Once the pull request is ready, I'll request review from you.

…ror handling Co-authored-by: HussiJS <[email protected]>

Co-authored-by: HussiJS <[email protected]>

Address code review feedback for text preprocessing pipeline

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 14 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-06T01:21:16Z

notebooks/02_data_text_prep.ipynb

+    "        raise ValueError(\"chunk_overlap cannot be negative\")\n",
+    "    if chunk_overlap >= max_chunk_chars:\n",
+    "        raise ValueError(\"chunk_overlap must be less than max_chunk_chars\")\n",


The validation logic has a bug. When chunk_overlap equals max_chunk_chars, the condition on line 439 should use greater than or equal to, but more critically, the comparison on line 439 should be checking if chunk_overlap is greater than or equal to max_chunk_chars (which it does), but the issue is that overlap being equal to max_chars makes no logical sense for chunking. However, there's a more fundamental issue: the validation should ensure chunk_overlap is measured in sentences (an integer count), but max_chunk_chars is in characters. These are different units and cannot be meaningfully compared. The validation is comparing apples to oranges.

Suggested change

" raise ValueError(\"chunk_overlap cannot be negative\")\n",

" if chunk_overlap >= max_chunk_chars:\n",

" raise ValueError(\"chunk_overlap must be less than max_chunk_chars\")\n",

" raise ValueError(\"chunk_overlap cannot be negative\")\n",

Copilot · 2026-01-06T01:21:16Z

notebooks/02_data_text_prep.ipynb

+    "            return True\n",
+    "        if ratio_punct > max_punct_ratio:\n",
+    "            return True\n",
+    "        if drop_all_caps_short and ln.isupper() and 8 <= len(ln) <= 25 and \" \" in ln:\n",


Magic numbers used without explanation. The thresholds 8 and 25 for detecting all-caps lines appear arbitrary. These should either be defined as named constants with explanatory comments, or the rationale for these specific values should be documented. This makes the code harder to maintain and adjust.

Copilot · 2026-01-06T01:21:16Z

notebooks/02_data_text_prep.ipynb

+    "    FRE_BASE = 206.835  # Base score for English texts in the FRE formula.\n",
+    "    FRE_SENTENCE_WEIGHT = 1.015  # Penalty per average sentence length (words per sentence).\n",
+    "    FRE_SYLLABLE_WEIGHT = 84.6  # Penalty per average syllables per word.\n",
+    "    syllables = sum(_estimate_syllables(w) for w in words)\n",
+    "    syllables_per_word = syllables / max(1, word_count)\n",
+    "    fre = FRE_BASE - FRE_SENTENCE_WEIGHT * avg_sentence_len - FRE_SYLLABLE_WEIGHT * syllables_per_word\n",


Inconsistent naming convention. The variable uses all caps (FRE_BASE, FRE_SENTENCE_WEIGHT, FRE_SYLLABLE_WEIGHT) which typically indicates module-level constants, but these are defined as local variables inside a function. Either move these to module level as true constants, or use lowercase naming to indicate they are local variables. The current naming is misleading.

Suggested change

" FRE_BASE = 206.835 # Base score for English texts in the FRE formula.\n",

" FRE_SENTENCE_WEIGHT = 1.015 # Penalty per average sentence length (words per sentence).\n",

" FRE_SYLLABLE_WEIGHT = 84.6 # Penalty per average syllables per word.\n",

" syllables = sum(_estimate_syllables(w) for w in words)\n",

" syllables_per_word = syllables / max(1, word_count)\n",

" fre = FRE_BASE - FRE_SENTENCE_WEIGHT * avg_sentence_len - FRE_SYLLABLE_WEIGHT * syllables_per_word\n",

" fre_base = 206.835 # Base score for English texts in the FRE formula.\n",

" fre_sentence_weight = 1.015 # Penalty per average sentence length (words per sentence).\n",

" fre_syllable_weight = 84.6 # Penalty per average syllables per word.\n",

" syllables = sum(_estimate_syllables(w) for w in words)\n",

" syllables_per_word = syllables / max(1, word_count)\n",

" fre = fre_base - fre_sentence_weight * avg_sentence_len - fre_syllable_weight * syllables_per_word\n",

Copilot · 2026-01-06T01:21:17Z

notebooks/02_data_text_prep.ipynb

+    "def detect_language(text: str) -> str:\n",
+    "    # Optional: only if langdetect installed\n",
+    "    if lang_detect is None:\n",
+    "        return \"unknown\"\n",
+    "    sample = text\n",
+    "    if len(sample) > 4000:\n",
+    "        sample = sample[:4000]\n",


Magic number without explanation. The value 4000 is used to limit the sample size for language detection, but there's no comment explaining why this specific limit was chosen. Consider defining this as a named constant with a descriptive comment about why this size is optimal for language detection accuracy vs performance.

Suggested change

"def detect_language(text: str) -> str:\n",

" # Optional: only if langdetect installed\n",

" if lang_detect is None:\n",

" return \"unknown\"\n",

" sample = text\n",

" if len(sample) > 4000:\n",

" sample = sample[:4000]\n",

"# Max characters to sample for language detection to balance accuracy and performance.\n",

"LANG_DETECT_SAMPLE_LIMIT = 4000\n",

"\n",

"def detect_language(text: str) -> str:\n",

" # Optional: only if langdetect installed\n",

" if lang_detect is None:\n",

" return \"unknown\"\n",

" sample = text\n",

" if len(sample) > LANG_DETECT_SAMPLE_LIMIT:\n",

" sample = sample[:LANG_DETECT_SAMPLE_LIMIT]\n",

Copilot · 2026-01-06T01:21:17Z

notebooks/02_data_text_prep.ipynb

+    "    Compute text statistics including Flesch Reading Ease score.\n",
+    "    WARNING: Syllable estimation and sentence splitting are English-focused heuristics.\n",


Documentation issue: The function computes both statistics and makes a decision about LLM simplification, but the docstring only mentions computing statistics. The docstring should document that this function also triggers language detection as a side effect, and note the return value is a TextStats dataclass. Additionally, the warnings about English-focused heuristics should mention which specific metrics are affected (Flesch Reading Ease, syllable counting, sentence splitting).

Suggested change

" Compute text statistics including Flesch Reading Ease score.\n",

" WARNING: Syllable estimation and sentence splitting are English-focused heuristics.\n",

" Compute text statistics and run language detection.\n",

"\n",

" This function computes basic text metrics (character, word, and sentence counts,\n",

" average sentence length, average word length, long-word ratio) and the Flesch\n",

" Reading Ease score, and returns them in a TextStats dataclass.\n",

"\n",

" As a side effect, this function calls `detect_language` on the input text and\n",

" stores the detected language in the resulting TextStats instance.\n",

"\n",

" WARNING: Several metrics rely on English-focused heuristics, specifically\n",

" syllable counting, sentence splitting, and the Flesch Reading Ease score.\n",

Copilot · 2026-01-06T01:21:19Z

notebooks/02_data_text_prep.ipynb

+    "\n",
+    "_RE_EMAIL = re.compile(r\"\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z0-9-]{2,63}\\b\", re.IGNORECASE)\n",
+    "_RE_PHONE = re.compile(r\"\\b(?:\\+?\\d{1,3}[-. ]?)?(?:\\(?\\d{2,4}\\)?[-. ]?)?\\d{3,4}[-. ]?\\d{3,4}\\b\")\n",
+    "_RE_URL = re.compile(r\"\\bhttps?://[^\\s]+?(?=[\\s\\)\\]\\}>\\\"\\'.,!?]|$)\", re.IGNORECASE)\n",


The URL regex pattern has a potential issue with the negative lookahead. The pattern ends with a complex negative lookahead that may not match URLs correctly in all contexts. Specifically, the escaped backslash before the quote characters appears incorrect - in a raw Python string this would be fine, but in a notebook string it should be just a single backslash or the quote should not be escaped. This could cause the regex to fail matching URLs that appear before quotes.

Copilot · 2026-01-06T01:21:19Z

notebooks/02_data_text_prep.ipynb

+    "        # Phone regex can overmatch; keep conservative by only replacing matches with enough digits\n",
+    "        def phone_repl(m: re.Match) -> str:\n",
+    "            val = m.group(0)\n",
+    "            digits = re.sub(r\"\\D\", \"\", val)\n",
+    "            if len(digits) < 9:\n",


Missing input validation for the phone number replacement threshold. The hardcoded value of 9 digits on line 288 is used to filter phone matches, but this arbitrary threshold is not validated against the regex pattern on line 249 which matches sequences with as few as 6 digits (e.g., 3+3). This inconsistency could lead to unexpected behavior. Consider adding a comment explaining the relationship between the regex pattern and the digit threshold, or making the threshold configurable.

Suggested change

" # Phone regex can overmatch; keep conservative by only replacing matches with enough digits\n",

" def phone_repl(m: re.Match) -> str:\n",

" val = m.group(0)\n",

" digits = re.sub(r\"\\D\", \"\", val)\n",

" if len(digits) < 9:\n",

" # Phone regex can overmatch; keep conservative by only replacing matches with enough digits.\n",

" # NOTE: _RE_PHONE currently matches sequences with as few as 6 digits; this threshold must be\n",

" # kept in sync with that minimum and is defined as a single constant for easier validation\n",

" # and configuration alongside the regex.\n",

" PHONE_MIN_DIGITS = 9 # configurable threshold for phone replacement\n",

" def phone_repl(m: re.Match) -> str:\n",

" val = m.group(0)\n",

" digits = re.sub(r\"\\D\", \"\", val)\n",

" if len(digits) < PHONE_MIN_DIGITS:\n",

Copilot · 2026-01-06T01:21:19Z

notebooks/02_data_text_prep.ipynb

+    "    # Coefficients from the standard FRE formula:\n",
+    "    #   FRE = 206.835 − 1.015*(words/sentences) − 84.6*(syllables/words)\n",
+    "    # See e.g. https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests\n",
+    "    # Note: Sentence splitting is heuristic, so this FRE score is an approximation and may mis-estimate readability.\n",


Spelling error: 'mis-estimate' should be 'misestimate' (without hyphen). The word 'misestimate' is a standard English word that doesn't require hyphenation.

Suggested change

" # Note: Sentence splitting is heuristic, so this FRE score is an approximation and may mis-estimate readability.\n",

" # Note: Sentence splitting is heuristic, so this FRE score is an approximation and may misestimate readability.\n",

Copilot · 2026-01-06T01:21:19Z

notebooks/02_data_text_prep.ipynb

+    "    dedupe_lines: bool = True,\n",
+    "    max_chunk_chars: int = 1800,\n",
+    "    chunk_overlap: int = 1,\n",
+    ") -> Dict[str, object]:\n",


Missing docstring. This is a public-facing function that serves as the main entry point for the preprocessing pipeline, but it lacks a docstring. A comprehensive docstring should document the purpose, all parameters with their types and defaults, the return value structure, and any exceptions that may be raised. This is especially important given the complexity of the returned dictionary.

Suggested change

") -> Dict[str, object]:\n",

") -> Dict[str, object]:\n",

" \"\"\"Run the full text preprocessing pipeline on a single document.\n",

"\n",

" This is the main entry point for the preprocessing pipeline. It can strip\n",

" HTML from the input, redact sensitive patterns, drop noisy lines,\n",

" de-duplicate lines, and then chunk the cleaned text into overlapping\n",

" segments suitable for downstream tasks such as embedding or indexing.\n",

"\n",

" Parameters\n",

" ----------\n",

" text : str\n",

" Raw input text to preprocess. If ``None`` or an empty string is\n",

" provided, it is treated as an empty document.\n",

" strip_html_input : bool, default True\n",

" Whether to remove HTML tags from the input text before any other\n",

" processing.\n",

" redact : bool, default True\n",

" Whether to apply redaction of sensitive or unwanted patterns (for\n",

" example, certain identifiers) before chunking.\n",

" drop_noise : bool, default True\n",

" Whether to filter out lines that are likely to be boilerplate or\n",

" otherwise unhelpful noise.\n",

" dedupe_lines : bool, default True\n",

" Whether to remove duplicate lines after cleaning and noise removal.\n",

" max_chunk_chars : int, default 1800\n",

" Target maximum character length for each generated text chunk.\n",

" chunk_overlap : int, default 1\n",

" Number of overlapping characters between consecutive chunks. Must be\n",

" non-negative and strictly less than ``max_chunk_chars``.\n",

"\n",

" Returns\n",

" -------\n",

" Dict[str, object]\n",

" A dictionary containing the original text, intermediate preprocessing\n",

" artifacts, and the final list of text chunks and related metadata.\n",

" The exact structure is intended to be consumed by downstream stages\n",

" of the pipeline and may include additional fields beyond those\n",

" described here.\n",

"\n",

" Raises\n",

" ------\n",

" ValueError\n",

" If ``max_chunk_chars`` is not positive, if ``chunk_overlap`` is\n",

" negative, or if ``chunk_overlap`` is greater than or equal to\n",

" ``max_chunk_chars``.\n",

" \"\"\"\n",

Copilot · 2026-01-06T01:21:20Z

notebooks/02_data_text_prep.ipynb

+    "except Exception:\n",
+    "    BeautifulSoup = None\n",
+    "\n",
+    "try:\n",
+    "    from langdetect import detect as lang_detect  # type: ignore\n",
+    "except Exception:\n",


Best practice violation: Catching bare Exception is too broad. The code catches generic Exception on lines 33, 38, 115, 170, 304, and 323, which can hide unexpected errors and make debugging difficult. Consider catching specific exception types (e.g., ImportError for optional dependencies, ValueError for parsing errors) to handle known failure cases while allowing unexpected errors to propagate.

Suggested change

"except Exception:\n",

" BeautifulSoup = None\n",

"\n",

"try:\n",

" from langdetect import detect as lang_detect # type: ignore\n",

"except Exception:\n",

"except ImportError:\n",

" BeautifulSoup = None\n",

"\n",

"try:\n",

" from langdetect import detect as lang_detect # type: ignore\n",

"except ImportError:\n",

HussiJS · 2026-01-06T16:42:43Z

@copilot open a new pull request to apply changes based on the comments in this thread

Copilot · 2026-01-06T16:42:52Z

@HussiJS I've opened a new pull request, #30, to work on those changes. Once the pull request is ready, I'll request review from you.

Request clarification on review thread feedback

added data_text_prep.ipynb for preparing the text

0f4852d

esmahoney requested a review from HussiJS December 19, 2025 10:19

HussiJS requested a review from Copilot January 5, 2026 16:37

Copilot started reviewing on behalf of HussiJS January 5, 2026 16:38 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

Initial plan

142dc3a

Copilot AI mentioned this pull request Jan 6, 2026

Address code review feedback for text preprocessing pipeline #27

Merged

Copilot AI and others added 4 commits January 6, 2026 01:07

Apply code review feedback: improve validation, documentation, and er…

1dc5605

…ror handling Co-authored-by: HussiJS <[email protected]>

Fix logic: large documents should return True for LLM simplification

a6aa592

Co-authored-by: HussiJS <[email protected]>

Add clarifying comments for division safety

2064be0

Co-authored-by: HussiJS <[email protected]>

Merge pull request #27 from EasySprache/copilot/sub-pr-2

1566257

Address code review feedback for text preprocessing pipeline

HussiJS requested a review from Copilot January 6, 2026 01:16

Copilot started reviewing on behalf of HussiJS January 6, 2026 01:16 View session

Copilot AI reviewed Jan 6, 2026

View reviewed changes

Initial plan

ec81143

Copilot AI mentioned this pull request Jan 6, 2026

Request clarification on review thread feedback #30

Merged

Merge pull request #30 from EasySprache/copilot/sub-pr-2-again

16379dc

Request clarification on review thread feedback

HussiJS approved these changes Jan 7, 2026

View reviewed changes

HussiJS merged commit 375aeb1 into main Jan 7, 2026

esmahoney deleted the notebook/data_text_prep branch January 7, 2026 14:33

-    "    # NFKC makes quotes, full-width chars, and compatibility forms consistent\n",
+    "    # Unicode NFKC normalization:\n",
+    "    # - Makes quotes, full-width chars, and compatibility forms consistent\n",
+    "    # - Also decomposes/recomposes characters and may change glyphs (e.g. ﬁ -> fi)\n",
+    "    #   See Unicode Normalization Forms: https://unicode.org/reports/tr15/\n",

	"_RE_EMAIL = re.compile(r\"\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,}\\b\", re.IGNORECASE)\n",
	"_RE_EMAIL = re.compile(r\"\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z0-9-]{2,63}\\b\", re.IGNORECASE)\n",

		" Compute text statistics including Flesch Reading Ease score.\n",
		" WARNING: Syllable estimation and sentence splitting are English-focused heuristics.\n",

-    "    Compute text statistics including Flesch Reading Ease score.\n",
-    "    WARNING: Syllable estimation and sentence splitting are English-focused heuristics.\n",
+    "    Compute text statistics and run language detection.\n",
+    "\n",
+    "    This function computes basic text metrics (character, word, and sentence counts,\n",
+    "    average sentence length, average word length, long-word ratio) and the Flesch\n",
+    "    Reading Ease score, and returns them in a TextStats dataclass.\n",
+    "\n",
+    "    As a side effect, this function calls `detect_language` on the input text and\n",
+    "    stores the detected language in the resulting TextStats instance.\n",
+    "\n",
+    "    WARNING: Several metrics rely on English-focused heuristics, specifically\n",
+    "    syllable counting, sentence splitting, and the Flesch Reading Ease score.\n",

	" # Note: Sentence splitting is heuristic, so this FRE score is an approximation and may mis-estimate readability.\n",
	" # Note: Sentence splitting is heuristic, so this FRE score is an approximation and may misestimate readability.\n",

-    ") -> Dict[str, object]:\n",
+    ") -> Dict[str, object]:\n",
+    "    \"\"\"Run the full text preprocessing pipeline on a single document.\n",
+    "\n",
+    "    This is the main entry point for the preprocessing pipeline. It can strip\n",
+    "    HTML from the input, redact sensitive patterns, drop noisy lines,\n",
+    "    de-duplicate lines, and then chunk the cleaned text into overlapping\n",
+    "    segments suitable for downstream tasks such as embedding or indexing.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    text : str\n",
+    "        Raw input text to preprocess. If ``None`` or an empty string is\n",
+    "        provided, it is treated as an empty document.\n",
+    "    strip_html_input : bool, default True\n",
+    "        Whether to remove HTML tags from the input text before any other\n",
+    "        processing.\n",
+    "    redact : bool, default True\n",
+    "        Whether to apply redaction of sensitive or unwanted patterns (for\n",
+    "        example, certain identifiers) before chunking.\n",
+    "    drop_noise : bool, default True\n",
+    "        Whether to filter out lines that are likely to be boilerplate or\n",
+    "        otherwise unhelpful noise.\n",
+    "    dedupe_lines : bool, default True\n",
+    "        Whether to remove duplicate lines after cleaning and noise removal.\n",
+    "    max_chunk_chars : int, default 1800\n",
+    "        Target maximum character length for each generated text chunk.\n",
+    "    chunk_overlap : int, default 1\n",
+    "        Number of overlapping characters between consecutive chunks. Must be\n",
+    "        non-negative and strictly less than ``max_chunk_chars``.\n",
+    "\n",
+    "    Returns\n",
+    "    -------\n",
+    "    Dict[str, object]\n",
+    "        A dictionary containing the original text, intermediate preprocessing\n",
+    "        artifacts, and the final list of text chunks and related metadata.\n",
+    "        The exact structure is intended to be consumed by downstream stages\n",
+    "        of the pipeline and may include additional fields beyond those\n",
+    "        described here.\n",
+    "\n",
+    "    Raises\n",
+    "    ------\n",
+    "    ValueError\n",
+    "        If ``max_chunk_chars`` is not positive, if ``chunk_overlap`` is\n",
+    "        negative, or if ``chunk_overlap`` is greater than or equal to\n",
+    "        ``max_chunk_chars``.\n",
+    "    \"\"\"\n",

added data_text_prep.ipynb for preparing the text #2

added data_text_prep.ipynb for preparing the text #2

Uh oh!

Conversation

esmahoney commented Dec 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

HussiJS commented Jan 6, 2026

Uh oh!

Copilot AI commented Jan 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

HussiJS commented Jan 6, 2026

Uh oh!