Skip to content

Commit 121d079

Browse files
committed
v0.12.0: 5 new languages, placeholder safety, HTML/domain protection, watermark pipeline
New languages (14 total): - Arabic (ar): 81 bureaucratic, 80 synonyms, 49 AI connectors - Chinese (zh): 80 bureaucratic, 80 synonyms, 36 AI connectors - Japanese (ja): 60+ per category, keigo→casual register - Korean (ko): 60+ per category, honorific→casual register - Turkish (tr): 60+ per category, Ottoman→modern Turkish Critical fixes: - Placeholder guard system: all 6 processing modules skip \x00 tokens - 3-pass restore() recovery: exact → case-insensitive → orphan cleanup - HTML block protection: ul/ol/table/pre/blockquote as single segments - Bare domain protection: site.com.ua, portal.kh.ua, example.co.uk - Homoglyph detector no longer corrupts Cyrillic (removed е/а/і from _SPECIAL_HOMOGLYPHS) Pipeline: - Watermark cleaning as automatic first stage (12 stages total) - Language detection for Arabic, CJK, Turkish scripts Tests: 1,509 passed (54 new)
1 parent 076eafe commit 121d079

28 files changed

+2975
-50
lines changed

CHANGELOG.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,32 @@
33
All notable changes to this project are documented in this file.
44
Format based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
55

6+
## [0.12.0] - 2025-06-27
7+
8+
### Added
9+
- **5 new languages** — Arabic (ar), Chinese Simplified (zh), Japanese (ja), Korean (ko), Turkish (tr). Total: **14 languages** with full deep processing support.
10+
- **Arabic** — 81 bureaucratic, 80 synonyms, 49 AI connectors, 40 colloquial markers, 47 abbreviations, 40 perplexity boosters, 30 sentence starters, 40 bureaucratic phrases, 39 split conjunctions
11+
- **Chinese** — 80 bureaucratic, 80 synonyms, 36 AI connectors, 40 colloquial markers, 32 abbreviations, 40 perplexity boosters, 30 sentence starters, 40 bureaucratic phrases, 41 split conjunctions
12+
- **Japanese** — 60+ entries per category, keigo→casual register replacements
13+
- **Korean** — 60+ entries per category, honorific→casual register replacements
14+
- **Turkish** — 60+ entries per category, Ottoman→modern Turkish replacements
15+
- **Placeholder guard system** — all 6 text processing modules (structure, naturalizer, universal, decancel, repetitions, liveliness) now skip words and sentences containing placeholder tokens. Prevents `\x00THZ_*\x00` artifacts from leaking into output.
16+
- **HTML block protection** — entire `<ul>`, `<ol>`, `<table>`, `<pre>`, `<code>`, `<script>`, `<style>`, `<blockquote>` blocks are now protected as single segments. Individual `<li>` items also protected.
17+
- **Bare domain protection** — domains like `site.com.ua`, `portal.kh.ua`, `example.co.uk` are now protected without requiring `http://` prefix. Covers 24 TLDs and 18 country sub-TLDs.
18+
- **Watermark cleaning in pipeline**`WatermarkDetector.clean()` now runs automatically as the first pipeline stage (before segmentation), removing zero-width characters, homoglyphs, invisible Unicode, and spacing anomalies. Supports plugin hooks (`before`/`after` the `watermark` stage).
19+
- **Language detection for new scripts** — Arabic (Unicode \u0600–\u06FF), CJK (Chinese \u4E00–\u9FFF, Japanese hiragana/katakana, Korean hangul), Turkish (marker-based with ş, ğ, ı).
20+
- **54 new tests** for all v0.12.0 features — HTML protection, domain safety, placeholder safety, new languages, watermark pipeline, language detection, restore robustness.
21+
22+
### Fixed
23+
- **Placeholder token leaks** — processing stages no longer corrupt `\x00THZ_*\x00` tokens through word-boundary regex, `.lower()` operations, or sentence splitting. 3-pass `restore()` recovery: exact match → case-insensitive → orphan cleanup.
24+
- **Homoglyph detector corrupting Cyrillic** — removed Cyrillic `е` (U+0435), `а` (U+0430), `і` (U+0456) from `_SPECIAL_HOMOGLYPHS` table. These are normal Cyrillic/Ukrainian characters, not watermark homoglyphs. Contextual detection via `_CYRILLIC_TO_LATIN` / `_LATIN_TO_CYRILLIC` remains intact.
25+
- **Duplicate dictionary keys** — removed F601 duplicates in ar.py (1), ja.py (1), tr.py (4).
26+
- **Test for unknown language** — updated test to use truly unknown language codes instead of now-supported zh/ja.
27+
28+
### Changed
29+
- **Pipeline stages** — now 12 stages (was 11): watermark → segmentation → typography → debureaucratization → structure → repetitions → liveliness → paraphrasing → universal → naturalization → validation → restore.
30+
- **1,509 Python tests** — up from 1,455 (100% pass rate).
31+
632
## [0.11.0] - 2025-06-26
733

834
### Added

README.md

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
[![TypeScript](https://img.shields.io/badge/TypeScript-5.x-3178C6.svg?logo=typescript&logoColor=white)]()
1313
[![PHP 8.1+](https://img.shields.io/badge/php-8.1+-777BB4.svg?logo=php&logoColor=white)](https://www.php.net/)
1414
&nbsp;&nbsp;
15-
[![Python Tests](https://img.shields.io/badge/tests-1455%20passed-2ea44f.svg?logo=pytest&logoColor=white)]()
15+
[![Python Tests](https://img.shields.io/badge/tests-1509%20passed-2ea44f.svg?logo=pytest&logoColor=white)]()
1616
[![PHP Tests](https://img.shields.io/badge/tests-223%20passed-2ea44f.svg?logo=php&logoColor=white)]()
1717
[![JS Tests](https://img.shields.io/badge/tests-28%20passed-2ea44f.svg?logo=vitest&logoColor=white)]()
1818
&nbsp;&nbsp;
@@ -27,7 +27,7 @@
2727

2828
<br/>
2929

30-
**27,000+ lines of code** · **44 Python modules** · **11-stage pipeline** · **9 languages + universal**
30+
**27,000+ lines of code** · **44 Python modules** · **12-stage pipeline** · **14 languages + universal**
3131

3232
[Quick Start](#quick-start) · [API Reference](#api-reference) · [AI Detection](#ai-detection--how-it-works) · [Cookbook](docs/COOKBOOK.md)
3333

@@ -46,7 +46,7 @@ It normalizes typography, simplifies bureaucratic language, diversifies sentence
4646
**Python** (full) · **TypeScript/JavaScript** (core pipeline) · **PHP** (full)
4747

4848
### Languages:
49-
🇷🇺 Russian · 🇺🇦 Ukrainian · 🇬🇧 English · 🇩🇪 German · 🇫🇷 French · 🇪🇸 Spanish · 🇵🇱 Polish · 🇧🇷 Portuguese · 🇮🇹 Italian · 🌍 **any language** via universal processor
49+
🇷🇺 Russian · 🇺🇦 Ukrainian · 🇬🇧 English · 🇩🇪 German · 🇫🇷 French · 🇪🇸 Spanish · 🇵🇱 Polish · 🇧🇷 Portuguese · 🇮🇹 Italian · �🇦 Arabic · 🇨🇳 Chinese · 🇯🇵 Japanese · 🇰🇷 Korean · 🇹🇷 Turkish · �🌍 **any language** via universal processor
5050

5151
---
5252

@@ -111,7 +111,7 @@ It normalizes typography, simplifies bureaucratic language, diversifies sentence
111111
| 🚀 **Blazing fast** | 30,000+ chars/sec — process a full article in milliseconds, not seconds |
112112
| 🔒 **100% private** | All processing is local. Your text never leaves your machine |
113113
| 🎯 **Precise control** | Intensity 0–100, 9 profiles, keyword preservation, max change ratio |
114-
| 🌍 **9 languages + universal** | Full dictionaries for 9 languages; statistical processor for any other |
114+
| 🌍 **14 languages + universal** | Full dictionaries for 14 languages; statistical processor for any other |
115115
| 📦 **Zero dependencies** | Pure Python stdlib — no pip packages, no model downloads |
116116
| 🔁 **Reproducible** | Seed-based PRNG — same input + same seed = identical output |
117117
| 🔌 **Extensible** | Plugin system to inject custom stages before/after any pipeline step |
@@ -1283,7 +1283,7 @@ The pipeline automatically adjusts processing based on how "AI-like" the input i
12831283

12841284
If processing exceeds `max_change_ratio`, the pipeline automatically retries at lower intensity (×0.4, then ×0.15) instead of discarding all changes. This ensures maximum quality within constraints.
12851285

1286-
**Stages 3–6** require full dictionary support (9 languages).
1286+
**Stages 3–6** require full dictionary support (14 languages).
12871287
**Stages 2, 7–8** work for any language, including those without dictionaries.
12881288
**Stage 10** validates quality and retries if needed (configurable via `max_change_ratio`).
12891289

@@ -1472,7 +1472,7 @@ print(f"Verdict: {result['verdict']}") # → "human_written"
14721472

14731473
## Language Support
14741474

1475-
### Full Dictionary Support (9 languages)
1475+
### Full Dictionary Support (14 languages)
14761476

14771477
Each language pack includes:
14781478
- Bureaucratic word → natural replacements
@@ -1497,6 +1497,11 @@ Each language pack includes:
14971497
| Polish | `pl` | 18 | 12 | 18 | 15+ | 8+ |
14981498
| Portuguese | `pt` | 16 | 12 | 17 | 12+ | 6+ |
14991499
| Italian | `it` | 16 | 12 | 17 | 12+ | 6+ |
1500+
| Arabic | `ar` | 81 | 49 | 80 | 40+ | 47 |
1501+
| Chinese | `zh` | 80 | 36 | 80 | 40+ | 32 |
1502+
| Japanese | `ja` | 60+ | 30+ | 60+ | 30+ | 25+ |
1503+
| Korean | `ko` | 60+ | 30+ | 60+ | 30+ | 25+ |
1504+
| Turkish | `tr` | 60+ | 30+ | 60+ | 30+ | 25+ |
15001505

15011506
### Universal Processor
15021507

@@ -2043,7 +2048,7 @@ All benchmarks on Apple Silicon (M1 Pro), Python 3.12, single thread. Reproducib
20432048

20442049
### Quality Benchmark
20452050

2046-
Tested on 45 curated samples across 9 languages, multiple profiles, and edge cases:
2051+
Tested on 45 curated samples across 14 languages, multiple profiles, and edge cases:
20472052

20482053
```
20492054
┌──────────────────────────────────────────────────┐
@@ -2211,7 +2216,7 @@ texthumanize/ # 44 Python modules, 16,820 lines
22112216
├── context.py # Context-aware synonyms (WSD + negative collocations)
22122217
├── autotune.py # Auto-Tuner (feedback loop + JSON persistence)
22132218
2214-
├── lang_detect.py # Language detection (9 languages)
2219+
├── lang_detect.py # Language detection (14 languages)
22152220
├── utils.py # Options, profiles, result classes
22162221
├── __main__.py # python -m texthumanize
22172222
@@ -2327,7 +2332,7 @@ $casual = TextHumanize::adjustTone("Formal text", target: 'casual');
23272332
| Watermark Detection | `WatermarkDetector` ||
23282333
| Content Spinning | `ContentSpinner` ||
23292334
| Coherence Analysis | `CoherenceAnalyzer` ||
2330-
| Language Packs | 9 languages ||
2335+
| Language Packs | 14 languages ||
23312336

23322337
```bash
23332338
cd php/

composer.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "ksanyok/text-humanize",
3-
"version": "0.11.0",
3+
"version": "0.12.0",
44
"description": "Zero-dependency PHP library for algorithmic text humanization — transforms machine-generated text into natural prose",
55
"type": "library",
66
"keywords": [

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "texthumanize",
3-
"version": "0.11.0",
3+
"version": "0.12.0",
44
"private": true,
55
"description": "Text style normalization & readability engine — Python, TypeScript, PHP",
66
"repository": {

php/composer.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "ksanyok/text-humanize",
3-
"version": "0.11.0",
3+
"version": "0.12.0",
44
"description": "Zero-dependency PHP library for algorithmic text humanization — transforms machine-generated text into natural prose",
55
"type": "library",
66
"license": "proprietary",

php/src/TextHumanize.php

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
*/
2222
class TextHumanize
2323
{
24-
public const VERSION = '0.11.0';
24+
public const VERSION = '0.12.0';
2525

2626
/**
2727
* Humanize text — the primary API method.

php/tests/TextHumanizeTest.php

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -390,7 +390,7 @@ public function testResultChangeRatioModified(): void
390390

391391
public function testVersion(): void
392392
{
393-
$this->assertSame('0.11.0', TextHumanize::VERSION);
393+
$this->assertSame('0.12.0', TextHumanize::VERSION);
394394
}
395395

396396
// ==================== Integration ====================

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "texthumanize"
7-
version = "0.11.0"
7+
version = "0.12.0"
88
description = "Алгоритмическая гуманизация текста с AI-детекцией, тональным анализом, перефразированием и спиннингом"
99
readme = "README.md"
1010
license = {text = "Dual License — Free for personal use, commercial license required for business"}

tests/test_cli.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ def test_version_flag(self, capsys):
2222
run_cli('--version')
2323
assert exc.value.code == 0
2424
out = capsys.readouterr().out
25-
assert '0.11.0' in out
25+
assert '0.12.0' in out
2626

2727

2828
class TestCLIHumanize:

tests/test_multilang.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -111,14 +111,14 @@ def test_all_deep_languages(self):
111111

112112
def test_unknown_language_returns_empty(self):
113113
"""Неизвестный язык возвращает пустой пакет (без ошибки)."""
114-
pack = get_lang_pack("zh")
115-
assert pack["code"] == "zh"
114+
pack = get_lang_pack("xx")
115+
assert pack["code"] == "xx"
116116
assert pack["bureaucratic"] == {}
117117

118118
def test_has_deep_support_unknown(self):
119119
"""Неизвестный язык — нет глубокой поддержки."""
120-
assert not has_deep_support("zh")
121-
assert not has_deep_support("ja")
120+
assert not has_deep_support("xx")
121+
assert not has_deep_support("sw")
122122

123123

124124
class TestMultilingualProcessing:

0 commit comments

Comments
 (0)