ksanyok
diff --git a/‎CHANGELOG.md‎
Lines changed: 26 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 14 additions & 9 deletions b/‎README.md‎
Lines changed: 14 additions & 9 deletions
diff --git a/‎composer.json‎
Lines changed: 1 addition & 1 deletion b/‎composer.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎package.json‎
Lines changed: 1 addition & 1 deletion b/‎package.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎php/composer.json‎
Lines changed: 1 addition & 1 deletion b/‎php/composer.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎php/src/TextHumanize.php‎
Lines changed: 1 addition & 1 deletion b/‎php/src/TextHumanize.php‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎php/tests/TextHumanizeTest.php‎
Lines changed: 1 addition & 1 deletion b/‎php/tests/TextHumanizeTest.php‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎pyproject.toml‎
Lines changed: 1 addition & 1 deletion b/‎pyproject.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎tests/test_cli.py‎
Lines changed: 1 addition & 1 deletion b/‎tests/test_cli.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎tests/test_multilang.py‎
Lines changed: 4 additions & 4 deletions b/‎tests/test_multilang.py‎
Lines changed: 4 additions & 4 deletions
@@ -3,6 +3,32 @@
 All notable changes to this project are documented in this file.
 Format based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
+## [0.12.0] - 2025-06-27
+
+### Added
+- **5 new languages** — Arabic (ar), Chinese Simplified (zh), Japanese (ja), Korean (ko), Turkish (tr). Total: **14 languages** with full deep processing support.
+  - **Arabic** — 81 bureaucratic, 80 synonyms, 49 AI connectors, 40 colloquial markers, 47 abbreviations, 40 perplexity boosters, 30 sentence starters, 40 bureaucratic phrases, 39 split conjunctions
+  - **Chinese** — 80 bureaucratic, 80 synonyms, 36 AI connectors, 40 colloquial markers, 32 abbreviations, 40 perplexity boosters, 30 sentence starters, 40 bureaucratic phrases, 41 split conjunctions
+  - **Japanese** — 60+ entries per category, keigo→casual register replacements
+  - **Korean** — 60+ entries per category, honorific→casual register replacements
+  - **Turkish** — 60+ entries per category, Ottoman→modern Turkish replacements
+- **Placeholder guard system** — all 6 text processing modules (structure, naturalizer, universal, decancel, repetitions, liveliness) now skip words and sentences containing placeholder tokens. Prevents `\x00THZ_*\x00` artifacts from leaking into output.
+- **HTML block protection** — entire `<ul>`, `<ol>`, `<table>`, `<pre>`, `<code>`, `<script>`, `<style>`, `<blockquote>` blocks are now protected as single segments. Individual `<li>` items also protected.
+- **Bare domain protection** — domains like `site.com.ua`, `portal.kh.ua`, `example.co.uk` are now protected without requiring `http://` prefix. Covers 24 TLDs and 18 country sub-TLDs.
+- **Watermark cleaning in pipeline** — `WatermarkDetector.clean()` now runs automatically as the first pipeline stage (before segmentation), removing zero-width characters, homoglyphs, invisible Unicode, and spacing anomalies. Supports plugin hooks (`before`/`after` the `watermark` stage).
+- **Language detection for new scripts** — Arabic (Unicode \u0600–\u06FF), CJK (Chinese \u4E00–\u9FFF, Japanese hiragana/katakana, Korean hangul), Turkish (marker-based with ş, ğ, ı).
+- **54 new tests** for all v0.12.0 features — HTML protection, domain safety, placeholder safety, new languages, watermark pipeline, language detection, restore robustness.
+
+### Fixed
+- **Placeholder token leaks** — processing stages no longer corrupt `\x00THZ_*\x00` tokens through word-boundary regex, `.lower()` operations, or sentence splitting. 3-pass `restore()` recovery: exact match → case-insensitive → orphan cleanup.
+- **Homoglyph detector corrupting Cyrillic** — removed Cyrillic `е` (U+0435), `а` (U+0430), `і` (U+0456) from `_SPECIAL_HOMOGLYPHS` table. These are normal Cyrillic/Ukrainian characters, not watermark homoglyphs. Contextual detection via `_CYRILLIC_TO_LATIN` / `_LATIN_TO_CYRILLIC` remains intact.
+- **Duplicate dictionary keys** — removed F601 duplicates in ar.py (1), ja.py (1), tr.py (4).
+- **Test for unknown language** — updated test to use truly unknown language codes instead of now-supported zh/ja.
+
+### Changed
+- **Pipeline stages** — now 12 stages (was 11): watermark → segmentation → typography → debureaucratization → structure → repetitions → liveliness → paraphrasing → universal → naturalization → validation → restore.
+- **1,509 Python tests** — up from 1,455 (100% pass rate).
+
 ## [0.11.0] - 2025-06-26
 
 ### Added
 
@@ -12,7 +12,7 @@
 [![TypeScript](https://img.shields.io/badge/TypeScript-5.x-3178C6.svg?logo=typescript&logoColor=white)]()
 [![PHP 8.1+](https://img.shields.io/badge/php-8.1+-777BB4.svg?logo=php&logoColor=white)](https://www.php.net/)
 &nbsp;&nbsp;
-[![Python Tests](https://img.shields.io/badge/tests-1455%20passed-2ea44f.svg?logo=pytest&logoColor=white)]()
+[![Python Tests](https://img.shields.io/badge/tests-1509%20passed-2ea44f.svg?logo=pytest&logoColor=white)]()
 [![PHP Tests](https://img.shields.io/badge/tests-223%20passed-2ea44f.svg?logo=php&logoColor=white)]()
 [![JS Tests](https://img.shields.io/badge/tests-28%20passed-2ea44f.svg?logo=vitest&logoColor=white)]()
 &nbsp;&nbsp;
@@ -27,7 +27,7 @@
 
 <br/>
 
-**27,000+ lines of code** · **44 Python modules** · **11-stage pipeline** · **9 languages + universal**
+**27,000+ lines of code** · **44 Python modules** · **12-stage pipeline** · **14 languages + universal**
 
 [Quick Start](#quick-start) · [API Reference](#api-reference) · [AI Detection](#ai-detection--how-it-works) · [Cookbook](docs/COOKBOOK.md)
 
@@ -46,7 +46,7 @@ It normalizes typography, simplifies bureaucratic language, diversifies sentence
 **Python** (full) · **TypeScript/JavaScript** (core pipeline) · **PHP** (full)
 
 ### Languages:
-🇷🇺 Russian · 🇺🇦 Ukrainian · 🇬🇧 English · 🇩🇪 German · 🇫🇷 French · 🇪🇸 Spanish · 🇵🇱 Polish · 🇧🇷 Portuguese · 🇮🇹 Italian · 🌍 **any language** via universal processor
+🇷🇺 Russian · 🇺🇦 Ukrainian · 🇬🇧 English · 🇩🇪 German · 🇫🇷 French · 🇪🇸 Spanish · 🇵🇱 Polish · 🇧🇷 Portuguese · 🇮🇹 Italian · �🇦 Arabic · 🇨🇳 Chinese · 🇯🇵 Japanese · 🇰🇷 Korean · 🇹🇷 Turkish · �🌍 **any language** via universal processor
 
 ---
 
@@ -111,7 +111,7 @@ It normalizes typography, simplifies bureaucratic language, diversifies sentence
 | 🚀 **Blazing fast** | 30,000+ chars/sec — process a full article in milliseconds, not seconds |
 | 🔒 **100% private** | All processing is local. Your text never leaves your machine |
 | 🎯 **Precise control** | Intensity 0–100, 9 profiles, keyword preservation, max change ratio |
-| 🌍 **9 languages + universal** | Full dictionaries for 9 languages; statistical processor for any other |
+| 🌍 **14 languages + universal** | Full dictionaries for 14 languages; statistical processor for any other |
 | 📦 **Zero dependencies** | Pure Python stdlib — no pip packages, no model downloads |
 | 🔁 **Reproducible** | Seed-based PRNG — same input + same seed = identical output |
 | 🔌 **Extensible** | Plugin system to inject custom stages before/after any pipeline step |
@@ -1283,7 +1283,7 @@ The pipeline automatically adjusts processing based on how "AI-like" the input i
 
 If processing exceeds `max_change_ratio`, the pipeline automatically retries at lower intensity (×0.4, then ×0.15) instead of discarding all changes. This ensures maximum quality within constraints.
 
-**Stages 3–6** require full dictionary support (9 languages).
+**Stages 3–6** require full dictionary support (14 languages).
 **Stages 2, 7–8** work for any language, including those without dictionaries.
 **Stage 10** validates quality and retries if needed (configurable via `max_change_ratio`).
 
@@ -1472,7 +1472,7 @@ print(f"Verdict: {result['verdict']}")   # → "human_written"
 
 ## Language Support
 
-### Full Dictionary Support (9 languages)
+### Full Dictionary Support (14 languages)
 
 Each language pack includes:
 - Bureaucratic word → natural replacements
@@ -1497,6 +1497,11 @@ Each language pack includes:
 | Polish | `pl` | 18 | 12 | 18 | 15+ | 8+ |
 | Portuguese | `pt` | 16 | 12 | 17 | 12+ | 6+ |
 | Italian | `it` | 16 | 12 | 17 | 12+ | 6+ |
+| Arabic | `ar` | 81 | 49 | 80 | 40+ | 47 |
+| Chinese | `zh` | 80 | 36 | 80 | 40+ | 32 |
+| Japanese | `ja` | 60+ | 30+ | 60+ | 30+ | 25+ |
+| Korean | `ko` | 60+ | 30+ | 60+ | 30+ | 25+ |
+| Turkish | `tr` | 60+ | 30+ | 60+ | 30+ | 25+ |
 
 ### Universal Processor
 
@@ -2043,7 +2048,7 @@ All benchmarks on Apple Silicon (M1 Pro), Python 3.12, single thread. Reproducib
 
 ### Quality Benchmark
 
-Tested on 45 curated samples across 9 languages, multiple profiles, and edge cases:
+Tested on 45 curated samples across 14 languages, multiple profiles, and edge cases:
 
 ```
 ┌──────────────────────────────────────────────────┐
@@ -2211,7 +2216,7 @@ texthumanize/                   # 44 Python modules, 16,820 lines
 ├── context.py                  # Context-aware synonyms (WSD + negative collocations)
 ├── autotune.py                 # Auto-Tuner (feedback loop + JSON persistence)
 │
-├── lang_detect.py              # Language detection (9 languages)
+├── lang_detect.py              # Language detection (14 languages)
 ├── utils.py                    # Options, profiles, result classes
 ├── __main__.py                 # python -m texthumanize
 │
@@ -2327,7 +2332,7 @@ $casual = TextHumanize::adjustTone("Formal text", target: 'casual');
 | Watermark Detection | `WatermarkDetector` | ✅ |
 | Content Spinning | `ContentSpinner` | ✅ |
 | Coherence Analysis | `CoherenceAnalyzer` | ✅ |
-| Language Packs | 9 languages | ✅ |
+| Language Packs | 14 languages | ✅ |
 
 ```bash
 cd php/
 
@@ -1,6 +1,6 @@
 {
     "name": "ksanyok/text-humanize",
-    "version": "0.11.0",
+    "version": "0.12.0",
     "description": "Zero-dependency PHP library for algorithmic text humanization — transforms machine-generated text into natural prose",
     "type": "library",
     "keywords": [
 
@@ -1,6 +1,6 @@
 {
   "name": "texthumanize",
-  "version": "0.11.0",
+  "version": "0.12.0",
   "private": true,
   "description": "Text style normalization & readability engine — Python, TypeScript, PHP",
   "repository": {
 
@@ -1,6 +1,6 @@
 {
     "name": "ksanyok/text-humanize",
-    "version": "0.11.0",
+    "version": "0.12.0",
     "description": "Zero-dependency PHP library for algorithmic text humanization — transforms machine-generated text into natural prose",
     "type": "library",
     "license": "proprietary",
 
@@ -21,7 +21,7 @@
  */
 class TextHumanize
 {
-    public const VERSION = '0.11.0';
+    public const VERSION = '0.12.0';
 
     /**
      * Humanize text — the primary API method.
 
@@ -390,7 +390,7 @@ public function testResultChangeRatioModified(): void
 
     public function testVersion(): void
     {
-        $this->assertSame('0.11.0', TextHumanize::VERSION);
+        $this->assertSame('0.12.0', TextHumanize::VERSION);
     }
 
     // ==================== Integration ====================
 
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "texthumanize"
-version = "0.11.0"
+version = "0.12.0"
 description = "Алгоритмическая гуманизация текста с AI-детекцией, тональным анализом, перефразированием и спиннингом"
 readme = "README.md"
 license = {text = "Dual License — Free for personal use, commercial license required for business"}
 
@@ -22,7 +22,7 @@ def test_version_flag(self, capsys):
             run_cli('--version')
         assert exc.value.code == 0
         out = capsys.readouterr().out
-        assert '0.11.0' in out
+        assert '0.12.0' in out
 
 
 class TestCLIHumanize:
 
@@ -111,14 +111,14 @@ def test_all_deep_languages(self):
 
     def test_unknown_language_returns_empty(self):
         """Неизвестный язык возвращает пустой пакет (без ошибки)."""
-        pack = get_lang_pack("zh")
-        assert pack["code"] == "zh"
+        pack = get_lang_pack("xx")
+        assert pack["code"] == "xx"
         assert pack["bureaucratic"] == {}
 
     def test_has_deep_support_unknown(self):
         """Неизвестный язык — нет глубокой поддержки."""
-        assert not has_deep_support("zh")
-        assert not has_deep_support("ja")
+        assert not has_deep_support("xx")
+        assert not has_deep_support("sw")
 
 
 class TestMultilingualProcessing:
Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"name": "ksanyok/text-humanize",`
`3`		`- "version": "0.11.0",`
	`3`	`+ "version": "0.12.0",`
`4`	`4`	`"description": "Zero-dependency PHP library for algorithmic text humanization — transforms machine-generated text into natural prose",`
`5`	`5`	`"type": "library",`
`6`	`6`	`"keywords": [`
Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"name": "texthumanize",`
`3`		`- "version": "0.11.0",`
	`3`	`+ "version": "0.12.0",`
`4`	`4`	`"private": true,`
`5`	`5`	`"description": "Text style normalization & readability engine — Python, TypeScript, PHP",`
`6`	`6`	`"repository": {`
Original file line number	Diff line number	Diff line change
`@@ -21,7 +21,7 @@`
`21`	`21`	`*/`
`22`	`22`	`class TextHumanize`
`23`	`23`	`{`
`24`		`- public const VERSION = '0.11.0';`
	`24`	`+ public const VERSION = '0.12.0';`
`25`	`25`
`26`	`26`	`/**`
`27`	`27`	`* Humanize text — the primary API method.`
Original file line number	Diff line number	Diff line change
`@@ -390,7 +390,7 @@ public function testResultChangeRatioModified(): void`
`390`	`390`
`391`	`391`	`public function testVersion(): void`
`392`	`392`	`{`
`393`		`- $this->assertSame('0.11.0', TextHumanize::VERSION);`
	`393`	`+ $this->assertSame('0.12.0', TextHumanize::VERSION);`
`394`	`394`	`}`
`395`	`395`
`396`	`396`	`// ==================== Integration ====================`