ksanyok
diff --git a/‎CHANGELOG.md‎
Lines changed: 26 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 164 additions & 6 deletions b/‎README.md‎
Lines changed: 164 additions & 6 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 1 addition & 1 deletion b/‎pyproject.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎tests/test_cli.py‎
Lines changed: 1 addition & 1 deletion b/‎tests/test_cli.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎tests/test_v013_features.py‎
Lines changed: 2 additions & 2 deletions b/‎tests/test_v013_features.py‎
Lines changed: 2 additions & 2 deletions
@@ -3,6 +3,32 @@
 All notable changes to this project are documented in this file.
 Format based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
+## [0.15.0] - 2025-06-28
+
+### Added
+- **9 new core modules** — full audit gap closure (100% of C1-C4, H1-H7, M1-M5, N1-N8 items):
+  - `ai_backend` — Three-tier AI backend: OpenAI API → OSS Gradio model (rate-limited, circuit-breaker) → built-in rules. New `humanize_ai()` function in core.
+  - `pos_tagger` — Rule-based POS tagger for EN (500+ exceptions), RU/UK (200+ each), DE (300+). Universal tagset with context disambiguation.
+  - `cjk_segmenter` — Chinese BiMM (2504-entry dict), Japanese character-type, Korean space+particle segmentation. Functions: `segment_cjk()`, `is_cjk_text()`, `detect_cjk_lang()`.
+  - `syntax_rewriter` — 8 sentence-level transformations (active↔passive, clause inversion, enumeration reorder, adverb migration, etc.). 150+ irregular verbs, EN/RU/UK/DE support. Integrated as pipeline stage 7b.
+  - `statistical_detector` — 35-feature AI text classifier with logistic regression. EN 85+ AI markers, RU 38+ markers. Integrated into `detect_ai()` with 60/40 weighted merge (heuristic/statistical).
+  - `word_lm` — Word-level unigram/bigram language model replacing character-trigram perplexity. 14 language frequency tables. Perplexity, burstiness, and naturalness scoring.
+  - `collocation_engine` — PMI-based collocation scoring for context-aware synonym ranking. EN ~130, RU ~30, DE ~20, FR ~15, ES ~12 collocations.
+  - `fingerprint_randomizer` — Anti-fingerprint diversification: plan randomization, synonym pool variation, whitespace jitter, paragraph intensity variation. Integrated as pipeline stage 13b.
+  - `benchmark_suite` — Automated quality benchmarking (6 dimensions): detection evasion, naturalness, meaning retention, diversity, length preservation, perplexity boost.
+- **Pipeline expanded to 17 stages** — added `syntax_rewriting` (stage 7b) and anti-fingerprint diversification (stage 13b).
+- **92 new tests** for all v0.15.0 modules — AI backend, POS tagger, CJK segmenter, syntax rewriter, statistical detector, word LM, collocation engine, fingerprint randomizer, benchmark suite, plus integration tests.
+
+### Fixed
+- **NO-OP `_reduce_adjacent_repeats()`** — was finding repeated words but doing `pass`. Now correctly removes second occurrences within a sliding window of 8 words, with article removal support.
+- **Paragraph whitespace preservation** — `_reduce_adjacent_repeats()` now uses `re.split(r'(\s+)')` to tokenize while preserving `\n\n` paragraph breaks.
+- **Syntax rewriter placeholder safety** — skips sentences containing `THZ_*` placeholders to prevent email/URL mangling.
+- **Operator precedence bug** in syntax rewriter pipeline stage — fixed `return t, changes if ...` → `return (t, changes) if ...`.
+
+### Changed
+- **1,696 Python tests** — up from 1,604 (100% pass rate).
+- **`detect_ai()` enhanced** — now returns `combined_score` (60% heuristic + 40% statistical) and `stat_probability` in results dict.
+
 ## [0.14.0] - 2025-06-27
 
 ### Added
 
@@ -12,7 +12,7 @@
 [![TypeScript](https://img.shields.io/badge/TypeScript-5.x-3178C6.svg?logo=typescript&logoColor=white)]()
 [![PHP 8.1+](https://img.shields.io/badge/php-8.1+-777BB4.svg?logo=php&logoColor=white)](https://www.php.net/)
 &nbsp;&nbsp;
-[![Python Tests](https://img.shields.io/badge/tests-1604%20passed-2ea44f.svg?logo=pytest&logoColor=white)]()
+[![Python Tests](https://img.shields.io/badge/tests-1696%20passed-2ea44f.svg?logo=pytest&logoColor=white)]()
 [![PHP Tests](https://img.shields.io/badge/tests-223%20passed-2ea44f.svg?logo=php&logoColor=white)]()
 [![JS Tests](https://img.shields.io/badge/tests-28%20passed-2ea44f.svg?logo=vitest&logoColor=white)]()
 &nbsp;&nbsp;
@@ -152,15 +152,16 @@ It normalizes typography, simplifies bureaucratic language, diversifies sentence
 
 | Category | Feature | Python | TS/JS | PHP |
 |:---------|:--------|:------:|:-----:|:---:|
-| **Core** | `humanize()` — 16-stage pipeline | ✅ | ✅ | ✅ |
+| **Core** | `humanize()` — 17-stage pipeline | ✅ | ✅ | ✅ |
 | | `humanize_batch()` — parallel processing | ✅ | — | ✅ |
 | | `humanize_chunked()` — large text support | ✅ | — | ✅ |
 | | `analyze()` — artificiality scoring | ✅ | ✅ | ✅ |
 | | `explain()` — change report | ✅ | — | ✅ |
-| **AI Detection** | `detect_ai()` — 13-metric ensemble | ✅ | ✅ | ✅ |
+| **AI Detection** | `detect_ai()` — 13-metric + statistical ML | ✅ | ✅ | ✅ |
 | | `detect_ai_batch()` — batch detection | ✅ | — | — |
 | | `detect_ai_sentences()` — per-sentence | ✅ | — | — |
 | | `detect_ai_mixed()` — mixed content | ✅ | — | — |
+| | `detect_ai_statistical()` — 35-feature ML | ✅ | — | — |
 | **Paraphrasing** | `paraphrase()` — syntactic transforms | ✅ | — | ✅ |
 | **Tone** | `analyze_tone()` — formality analysis | ✅ | — | ✅ |
 | | `adjust_tone()` — 7-level adjustment | ✅ | — | ✅ |
@@ -170,6 +171,16 @@ It normalizes typography, simplifies bureaucratic language, diversifies sentence
 | **Analysis** | `analyze_coherence()` — paragraph flow | ✅ | — | ✅ |
 | | `full_readability()` — 6 indices | ✅ | — | ✅ |
 | | Stylistic fingerprinting | ✅ | — | — |
+| **NLP** | `POSTagger` — rule-based POS tagger (EN/RU/UK/DE) | ✅ | — | — |
+| | `CJKSegmenter` — Chinese/Japanese/Korean word segmentation | ✅ | — | — |
+| | `SyntaxRewriter` — 8 sentence-level transforms | ✅ | — | — |
+| | `WordLanguageModel` — word-level LM (14 langs) | ✅ | — | — |
+| | `CollocEngine` — PMI collocation scoring | ✅ | — | — |
+| **AI Backend** | `humanize_ai()` — three-tier AI rewriting | ✅ | — | — |
+| | OpenAI API integration | ✅ | — | — |
+| | OSS model fallback (rate-limited) | ✅ | — | — |
+| **Quality** | `BenchmarkSuite` — 6-dimension quality scoring | ✅ | — | — |
+| | `FingerprintRandomizer` — anti-detection diversity | ✅ | — | — |
 | **Advanced** | Style presets (5 personas) | ✅ | — | — |
 | | Auto-Tuner (feedback loop) | ✅ | — | — |
 | | Plugin system | ✅ | — | ✅ |
@@ -220,10 +231,10 @@ It normalizes typography, simplifies bureaucratic language, diversifies sentence
 
 | Feature | TextHumanize v0.8 | Typical Alternatives |
 |:--------|:------------------:|:--------------------:|
-| Pipeline stages | **11** | 2–4 |
-| Languages | **9 + universal** | 1–2 |
+| Pipeline stages | **17** | 2–4 |
+| Languages | **9 + universal + CJK** | 1–2 |
 | AI detection built-in | ✅ 13 metrics + ensemble | ❌ |
-| Total test count | **1,584** (Py+PHP+JS) | 10–50 |
+| Total test count | **1,696** (Py+PHP+JS) | 10–50 |
 | Test coverage | **99%** | Unknown |
 | Benchmark pass rate | **100%** (45/45) | No benchmark |
 | Codebase size | **27K+ lines** | 500–2K |
@@ -883,6 +894,153 @@ print(f"Dale-Chall:           {r.get('dale_chall', 0):.1f}")
 
 ---
 
+## v0.15.0 — New Modules & APIs
+
+### `humanize_ai(text, lang, **options)`
+
+Three-tier AI-powered humanization: OpenAI → OSS model → built-in rules.
+
+```python
+from texthumanize import humanize_ai
+
+# Default: uses built-in rules (zero dependencies)
+result = humanize_ai("AI-generated text here.", lang="en")
+print(result.text)
+
+# With OpenAI API (best quality):
+result = humanize_ai(
+    "Text to humanize.",
+    lang="en",
+    openai_api_key="sk-...",
+    openai_model="gpt-4o-mini",
+)
+
+# With OSS model (free, rate-limited):
+result = humanize_ai("Text to humanize.", lang="en", enable_oss=True)
+```
+
+### `StatisticalDetector` — ML-based AI Detection
+
+35-feature classifier with logistic regression, integrated into `detect_ai()`.
+
+```python
+from texthumanize import StatisticalDetector, detect_ai_statistical
+
+# Standalone usage
+det = StatisticalDetector(lang="en")
+result = det.detect("Text to analyze for AI patterns.")
+print(f"Probability: {result['probability']:.1%}")
+print(f"Verdict: {result['verdict']}")  # human / mixed / ai
+
+# Or convenience function
+result = detect_ai_statistical("Your text here.", lang="en")
+```
+
+### `POSTagger` — Rule-based POS Tagging
+
+Part-of-speech tagger for EN (500+ exceptions), RU/UK (200+), DE (300+).
+
+```python
+from texthumanize import POSTagger
+
+tagger = POSTagger(lang="en")
+for word, tag in tagger.tag("The quick brown fox jumps"):
+    print(f"{word:12s} → {tag}")
+# The          → DET
+# quick        → ADJ
+# brown        → ADJ
+# fox          → NOUN
+# jumps        → VERB
+```
+
+### `CJKSegmenter` — Chinese/Japanese/Korean Word Segmentation
+
+```python
+from texthumanize import CJKSegmenter, is_cjk_text, detect_cjk_lang
+
+seg = CJKSegmenter(lang="zh")
+words = seg.segment("我们是中国人")  # ['我们', '是', '中国', '人']
+
+is_cjk_text("这是中文")      # True
+detect_cjk_lang("東京は大きい")  # "ja"
+```
+
+### `SyntaxRewriter` — Sentence-level Transforms
+
+8 transformations: active↔passive, clause inversion, enumeration reorder, adverb migration, etc.
+
+```python
+from texthumanize import SyntaxRewriter
+
+sr = SyntaxRewriter(lang="en", seed=42)
+variants = sr.rewrite("The team completed the project on time.")
+for v in variants:
+    print(v)
+```
+
+### `WordLanguageModel` — Word-level Perplexity
+
+14-language word-level unigram/bigram LM with naturalness scoring.
+
+```python
+from texthumanize import WordLanguageModel, word_perplexity, word_naturalness
+
+lm = WordLanguageModel(lang="en")
+pp = lm.perplexity("Some text to measure complexity")
+score = lm.naturalness_score("Your multi-sentence text here. Another one.")
+print(f"Verdict: {score['verdict']}")  # human / mixed / ai
+
+# Convenience:
+pp = word_perplexity("Quick check.", lang="en")
+ns = word_naturalness("Full analysis.", lang="en")
+```
+
+### `CollocEngine` — Collocation-Aware Synonym Ranking
+
+PMI-based scoring for choosing the most natural synonym in context.
+
+```python
+from texthumanize import CollocEngine
+
+eng = CollocEngine(lang="en")
+best = eng.best_synonym("important", ["crucial", "key", "significant"], context=["decision"])
+print(best)  # "crucial" (strongest collocation with "decision")
+```
+
+### `FingerprintRandomizer` — Anti-Detection Diversity
+
+Prevents detectable patterns in humanized output.
+
+```python
+from texthumanize import FingerprintRandomizer
+
+r = FingerprintRandomizer(seed=42, jitter_level=0.3)
+text1 = r.diversify_output("Some humanized text.")
+text2 = r.diversify_output("Some humanized text.")  # different each time
+```
+
+### `BenchmarkSuite` — Quality Measurement
+
+6-dimension automated quality benchmarking.
+
+```python
+from texthumanize import BenchmarkSuite, quick_benchmark
+
+# Quick single-pair benchmark:
+report = quick_benchmark("Original AI text.", "Humanized version.")
+print(report.summary())
+
+# Full suite:
+suite = BenchmarkSuite(lang="en")
+report = suite.run_all([
+    {"original": "AI text 1.", "humanized": "Human text 1."},
+    {"original": "AI text 2.", "humanized": "Human text 2."},
+])
+print(f"Overall score: {report.overall_score:.1f}/100")
+```
+
+---
+
 ## Profiles
 
 Nine built-in profiles control the processing style:
 
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "texthumanize"
-version = "0.14.0"
+version = "0.15.0"
 description = "Алгоритмическая гуманизация текста с AI-детекцией, тональным анализом, перефразированием и спиннингом"
 readme = "README.md"
 license = {text = "Dual License — Free for personal use, commercial license required for business"}
 
@@ -22,7 +22,7 @@ def test_version_flag(self, capsys):
             run_cli('--version')
         assert exc.value.code == 0
         out = capsys.readouterr().out
-        assert '0.14.0' in out
+        assert '0.15.0' in out
 
 
 class TestCLIHumanize:
 
@@ -16,13 +16,13 @@ class TestPipeline16Stages(unittest.TestCase):
     """Проверка 16-этапного пайплайна."""
 
     def test_stage_count(self):
-        self.assertEqual(len(Pipeline.STAGE_NAMES), 16)
+        self.assertEqual(len(Pipeline.STAGE_NAMES), 17)
 
     def test_stage_names(self):
         expected = (
             "watermark", "segmentation", "typography", "debureaucratization",
             "structure", "repetitions", "liveliness",
-            "paraphrasing", "tone", "universal", "naturalization",
+            "paraphrasing", "syntax_rewriting", "tone", "universal", "naturalization",
             "readability", "grammar", "coherence",
             "validation", "restore",
         )