You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+26Lines changed: 26 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,32 @@
3
3
All notable changes to this project are documented in this file.
4
4
Format based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
5
5
6
+
## [0.12.0] - 2025-06-27
7
+
8
+
### Added
9
+
-**5 new languages** — Arabic (ar), Chinese Simplified (zh), Japanese (ja), Korean (ko), Turkish (tr). Total: **14 languages** with full deep processing support.
-**Japanese** — 60+ entries per category, keigo→casual register replacements
13
+
-**Korean** — 60+ entries per category, honorific→casual register replacements
14
+
-**Turkish** — 60+ entries per category, Ottoman→modern Turkish replacements
15
+
-**Placeholder guard system** — all 6 text processing modules (structure, naturalizer, universal, decancel, repetitions, liveliness) now skip words and sentences containing placeholder tokens. Prevents `\x00THZ_*\x00` artifacts from leaking into output.
16
+
-**HTML block protection** — entire `<ul>`, `<ol>`, `<table>`, `<pre>`, `<code>`, `<script>`, `<style>`, `<blockquote>` blocks are now protected as single segments. Individual `<li>` items also protected.
17
+
-**Bare domain protection** — domains like `site.com.ua`, `portal.kh.ua`, `example.co.uk` are now protected without requiring `http://` prefix. Covers 24 TLDs and 18 country sub-TLDs.
18
+
-**Watermark cleaning in pipeline** — `WatermarkDetector.clean()` now runs automatically as the first pipeline stage (before segmentation), removing zero-width characters, homoglyphs, invisible Unicode, and spacing anomalies. Supports plugin hooks (`before`/`after` the `watermark` stage).
19
+
-**Language detection for new scripts** — Arabic (Unicode \u0600–\u06FF), CJK (Chinese \u4E00–\u9FFF, Japanese hiragana/katakana, Korean hangul), Turkish (marker-based with ş, ğ, ı).
20
+
-**54 new tests** for all v0.12.0 features — HTML protection, domain safety, placeholder safety, new languages, watermark pipeline, language detection, restore robustness.
21
+
22
+
### Fixed
23
+
-**Placeholder token leaks** — processing stages no longer corrupt `\x00THZ_*\x00` tokens through word-boundary regex, `.lower()` operations, or sentence splitting. 3-pass `restore()` recovery: exact match → case-insensitive → orphan cleanup.
24
+
-**Homoglyph detector corrupting Cyrillic** — removed Cyrillic `е` (U+0435), `а` (U+0430), `і` (U+0456) from `_SPECIAL_HOMOGLYPHS` table. These are normal Cyrillic/Ukrainian characters, not watermark homoglyphs. Contextual detection via `_CYRILLIC_TO_LATIN` / `_LATIN_TO_CYRILLIC` remains intact.
| 🚀 **Blazing fast**| 30,000+ chars/sec — process a full article in milliseconds, not seconds |
112
112
| 🔒 **100% private**| All processing is local. Your text never leaves your machine |
113
113
| 🎯 **Precise control**| Intensity 0–100, 9 profiles, keyword preservation, max change ratio |
114
-
| 🌍 **9 languages + universal**| Full dictionaries for 9 languages; statistical processor for any other |
114
+
| 🌍 **14 languages + universal**| Full dictionaries for 14 languages; statistical processor for any other |
115
115
| 📦 **Zero dependencies**| Pure Python stdlib — no pip packages, no model downloads |
116
116
| 🔁 **Reproducible**| Seed-based PRNG — same input + same seed = identical output |
117
117
| 🔌 **Extensible**| Plugin system to inject custom stages before/after any pipeline step |
@@ -1283,7 +1283,7 @@ The pipeline automatically adjusts processing based on how "AI-like" the input i
1283
1283
1284
1284
If processing exceeds `max_change_ratio`, the pipeline automatically retries at lower intensity (×0.4, then ×0.15) instead of discarding all changes. This ensures maximum quality within constraints.
1285
1285
1286
-
**Stages 3–6** require full dictionary support (9 languages).
1286
+
**Stages 3–6** require full dictionary support (14 languages).
1287
1287
**Stages 2, 7–8** work for any language, including those without dictionaries.
1288
1288
**Stage 10** validates quality and retries if needed (configurable via `max_change_ratio`).
0 commit comments