Skip to content

Commit cfc7136

Browse files
committed
v0.15.0: 9 new modules — AI backend, POS tagger, CJK segmenter, syntax rewriter, statistical detector, word LM, collocation engine, fingerprint randomizer, benchmark suite
Major changes: - Three-tier AI backend: OpenAI → OSS model → built-in rules (humanize_ai) - POS tagger: rule-based EN/RU/UK/DE with 500+ exceptions - CJK segmenter: Chinese BiMM, Japanese char-type, Korean space+particle - Syntax rewriter: 8 sentence-level transforms, 150+ irregular verbs - Statistical AI detector: 35-feature classifier, integrated into detect_ai - Word-level language model: 14-language perplexity + naturalness scoring - Collocation engine: PMI-based context-aware synonym ranking - Fingerprint randomizer: anti-detection output diversification - Benchmark suite: 6-dimension automated quality scoring - Pipeline expanded to 17 stages (was 16) - Fixed NO-OP _reduce_adjacent_repeats with whitespace preservation - 1696 tests (92 new), all passing
1 parent 7314c37 commit cfc7136

19 files changed

+9856
-24
lines changed

CHANGELOG.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,32 @@
33
All notable changes to this project are documented in this file.
44
Format based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
55

6+
## [0.15.0] - 2025-06-28
7+
8+
### Added
9+
- **9 new core modules** — full audit gap closure (100% of C1-C4, H1-H7, M1-M5, N1-N8 items):
10+
- `ai_backend` — Three-tier AI backend: OpenAI API → OSS Gradio model (rate-limited, circuit-breaker) → built-in rules. New `humanize_ai()` function in core.
11+
- `pos_tagger` — Rule-based POS tagger for EN (500+ exceptions), RU/UK (200+ each), DE (300+). Universal tagset with context disambiguation.
12+
- `cjk_segmenter` — Chinese BiMM (2504-entry dict), Japanese character-type, Korean space+particle segmentation. Functions: `segment_cjk()`, `is_cjk_text()`, `detect_cjk_lang()`.
13+
- `syntax_rewriter` — 8 sentence-level transformations (active↔passive, clause inversion, enumeration reorder, adverb migration, etc.). 150+ irregular verbs, EN/RU/UK/DE support. Integrated as pipeline stage 7b.
14+
- `statistical_detector` — 35-feature AI text classifier with logistic regression. EN 85+ AI markers, RU 38+ markers. Integrated into `detect_ai()` with 60/40 weighted merge (heuristic/statistical).
15+
- `word_lm` — Word-level unigram/bigram language model replacing character-trigram perplexity. 14 language frequency tables. Perplexity, burstiness, and naturalness scoring.
16+
- `collocation_engine` — PMI-based collocation scoring for context-aware synonym ranking. EN ~130, RU ~30, DE ~20, FR ~15, ES ~12 collocations.
17+
- `fingerprint_randomizer` — Anti-fingerprint diversification: plan randomization, synonym pool variation, whitespace jitter, paragraph intensity variation. Integrated as pipeline stage 13b.
18+
- `benchmark_suite` — Automated quality benchmarking (6 dimensions): detection evasion, naturalness, meaning retention, diversity, length preservation, perplexity boost.
19+
- **Pipeline expanded to 17 stages** — added `syntax_rewriting` (stage 7b) and anti-fingerprint diversification (stage 13b).
20+
- **92 new tests** for all v0.15.0 modules — AI backend, POS tagger, CJK segmenter, syntax rewriter, statistical detector, word LM, collocation engine, fingerprint randomizer, benchmark suite, plus integration tests.
21+
22+
### Fixed
23+
- **NO-OP `_reduce_adjacent_repeats()`** — was finding repeated words but doing `pass`. Now correctly removes second occurrences within a sliding window of 8 words, with article removal support.
24+
- **Paragraph whitespace preservation**`_reduce_adjacent_repeats()` now uses `re.split(r'(\s+)')` to tokenize while preserving `\n\n` paragraph breaks.
25+
- **Syntax rewriter placeholder safety** — skips sentences containing `THZ_*` placeholders to prevent email/URL mangling.
26+
- **Operator precedence bug** in syntax rewriter pipeline stage — fixed `return t, changes if ...``return (t, changes) if ...`.
27+
28+
### Changed
29+
- **1,696 Python tests** — up from 1,604 (100% pass rate).
30+
- **`detect_ai()` enhanced** — now returns `combined_score` (60% heuristic + 40% statistical) and `stat_probability` in results dict.
31+
632
## [0.14.0] - 2025-06-27
733

834
### Added

README.md

Lines changed: 164 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
[![TypeScript](https://img.shields.io/badge/TypeScript-5.x-3178C6.svg?logo=typescript&logoColor=white)]()
1313
[![PHP 8.1+](https://img.shields.io/badge/php-8.1+-777BB4.svg?logo=php&logoColor=white)](https://www.php.net/)
1414
  
15-
[![Python Tests](https://img.shields.io/badge/tests-1604%20passed-2ea44f.svg?logo=pytest&logoColor=white)]()
15+
[![Python Tests](https://img.shields.io/badge/tests-1696%20passed-2ea44f.svg?logo=pytest&logoColor=white)]()
1616
[![PHP Tests](https://img.shields.io/badge/tests-223%20passed-2ea44f.svg?logo=php&logoColor=white)]()
1717
[![JS Tests](https://img.shields.io/badge/tests-28%20passed-2ea44f.svg?logo=vitest&logoColor=white)]()
1818
  
@@ -152,15 +152,16 @@ It normalizes typography, simplifies bureaucratic language, diversifies sentence
152152

153153
| Category | Feature | Python | TS/JS | PHP |
154154
|:---------|:--------|:------:|:-----:|:---:|
155-
| **Core** | `humanize()`16-stage pipeline ||||
155+
| **Core** | `humanize()`17-stage pipeline ||||
156156
| | `humanize_batch()` — parallel processing ||||
157157
| | `humanize_chunked()` — large text support ||||
158158
| | `analyze()` — artificiality scoring ||||
159159
| | `explain()` — change report ||||
160-
| **AI Detection** | `detect_ai()` — 13-metric ensemble ||||
160+
| **AI Detection** | `detect_ai()` — 13-metric + statistical ML ||||
161161
| | `detect_ai_batch()` — batch detection ||||
162162
| | `detect_ai_sentences()` — per-sentence ||||
163163
| | `detect_ai_mixed()` — mixed content ||||
164+
| | `detect_ai_statistical()` — 35-feature ML ||||
164165
| **Paraphrasing** | `paraphrase()` — syntactic transforms ||||
165166
| **Tone** | `analyze_tone()` — formality analysis ||||
166167
| | `adjust_tone()` — 7-level adjustment ||||
@@ -170,6 +171,16 @@ It normalizes typography, simplifies bureaucratic language, diversifies sentence
170171
| **Analysis** | `analyze_coherence()` — paragraph flow ||||
171172
| | `full_readability()` — 6 indices ||||
172173
| | Stylistic fingerprinting ||||
174+
| **NLP** | `POSTagger` — rule-based POS tagger (EN/RU/UK/DE) ||||
175+
| | `CJKSegmenter` — Chinese/Japanese/Korean word segmentation ||||
176+
| | `SyntaxRewriter` — 8 sentence-level transforms ||||
177+
| | `WordLanguageModel` — word-level LM (14 langs) ||||
178+
| | `CollocEngine` — PMI collocation scoring ||||
179+
| **AI Backend** | `humanize_ai()` — three-tier AI rewriting ||||
180+
| | OpenAI API integration ||||
181+
| | OSS model fallback (rate-limited) ||||
182+
| **Quality** | `BenchmarkSuite` — 6-dimension quality scoring ||||
183+
| | `FingerprintRandomizer` — anti-detection diversity ||||
173184
| **Advanced** | Style presets (5 personas) ||||
174185
| | Auto-Tuner (feedback loop) ||||
175186
| | Plugin system ||||
@@ -220,10 +231,10 @@ It normalizes typography, simplifies bureaucratic language, diversifies sentence
220231

221232
| Feature | TextHumanize v0.8 | Typical Alternatives |
222233
|:--------|:------------------:|:--------------------:|
223-
| Pipeline stages | **11** | 2–4 |
224-
| Languages | **9 + universal** | 1–2 |
234+
| Pipeline stages | **17** | 2–4 |
235+
| Languages | **9 + universal + CJK** | 1–2 |
225236
| AI detection built-in | ✅ 13 metrics + ensemble ||
226-
| Total test count | **1,584** (Py+PHP+JS) | 10–50 |
237+
| Total test count | **1,696** (Py+PHP+JS) | 10–50 |
227238
| Test coverage | **99%** | Unknown |
228239
| Benchmark pass rate | **100%** (45/45) | No benchmark |
229240
| Codebase size | **27K+ lines** | 500–2K |
@@ -883,6 +894,153 @@ print(f"Dale-Chall: {r.get('dale_chall', 0):.1f}")
883894

884895
---
885896

897+
## v0.15.0 — New Modules & APIs
898+
899+
### `humanize_ai(text, lang, **options)`
900+
901+
Three-tier AI-powered humanization: OpenAI → OSS model → built-in rules.
902+
903+
```python
904+
from texthumanize import humanize_ai
905+
906+
# Default: uses built-in rules (zero dependencies)
907+
result = humanize_ai("AI-generated text here.", lang="en")
908+
print(result.text)
909+
910+
# With OpenAI API (best quality):
911+
result = humanize_ai(
912+
"Text to humanize.",
913+
lang="en",
914+
openai_api_key="sk-...",
915+
openai_model="gpt-4o-mini",
916+
)
917+
918+
# With OSS model (free, rate-limited):
919+
result = humanize_ai("Text to humanize.", lang="en", enable_oss=True)
920+
```
921+
922+
### `StatisticalDetector` — ML-based AI Detection
923+
924+
35-feature classifier with logistic regression, integrated into `detect_ai()`.
925+
926+
```python
927+
from texthumanize import StatisticalDetector, detect_ai_statistical
928+
929+
# Standalone usage
930+
det = StatisticalDetector(lang="en")
931+
result = det.detect("Text to analyze for AI patterns.")
932+
print(f"Probability: {result['probability']:.1%}")
933+
print(f"Verdict: {result['verdict']}") # human / mixed / ai
934+
935+
# Or convenience function
936+
result = detect_ai_statistical("Your text here.", lang="en")
937+
```
938+
939+
### `POSTagger` — Rule-based POS Tagging
940+
941+
Part-of-speech tagger for EN (500+ exceptions), RU/UK (200+), DE (300+).
942+
943+
```python
944+
from texthumanize import POSTagger
945+
946+
tagger = POSTagger(lang="en")
947+
for word, tag in tagger.tag("The quick brown fox jumps"):
948+
print(f"{word:12s}{tag}")
949+
# The → DET
950+
# quick → ADJ
951+
# brown → ADJ
952+
# fox → NOUN
953+
# jumps → VERB
954+
```
955+
956+
### `CJKSegmenter` — Chinese/Japanese/Korean Word Segmentation
957+
958+
```python
959+
from texthumanize import CJKSegmenter, is_cjk_text, detect_cjk_lang
960+
961+
seg = CJKSegmenter(lang="zh")
962+
words = seg.segment("我们是中国人") # ['我们', '是', '中国', '人']
963+
964+
is_cjk_text("这是中文") # True
965+
detect_cjk_lang("東京は大きい") # "ja"
966+
```
967+
968+
### `SyntaxRewriter` — Sentence-level Transforms
969+
970+
8 transformations: active↔passive, clause inversion, enumeration reorder, adverb migration, etc.
971+
972+
```python
973+
from texthumanize import SyntaxRewriter
974+
975+
sr = SyntaxRewriter(lang="en", seed=42)
976+
variants = sr.rewrite("The team completed the project on time.")
977+
for v in variants:
978+
print(v)
979+
```
980+
981+
### `WordLanguageModel` — Word-level Perplexity
982+
983+
14-language word-level unigram/bigram LM with naturalness scoring.
984+
985+
```python
986+
from texthumanize import WordLanguageModel, word_perplexity, word_naturalness
987+
988+
lm = WordLanguageModel(lang="en")
989+
pp = lm.perplexity("Some text to measure complexity")
990+
score = lm.naturalness_score("Your multi-sentence text here. Another one.")
991+
print(f"Verdict: {score['verdict']}") # human / mixed / ai
992+
993+
# Convenience:
994+
pp = word_perplexity("Quick check.", lang="en")
995+
ns = word_naturalness("Full analysis.", lang="en")
996+
```
997+
998+
### `CollocEngine` — Collocation-Aware Synonym Ranking
999+
1000+
PMI-based scoring for choosing the most natural synonym in context.
1001+
1002+
```python
1003+
from texthumanize import CollocEngine
1004+
1005+
eng = CollocEngine(lang="en")
1006+
best = eng.best_synonym("important", ["crucial", "key", "significant"], context=["decision"])
1007+
print(best) # "crucial" (strongest collocation with "decision")
1008+
```
1009+
1010+
### `FingerprintRandomizer` — Anti-Detection Diversity
1011+
1012+
Prevents detectable patterns in humanized output.
1013+
1014+
```python
1015+
from texthumanize import FingerprintRandomizer
1016+
1017+
r = FingerprintRandomizer(seed=42, jitter_level=0.3)
1018+
text1 = r.diversify_output("Some humanized text.")
1019+
text2 = r.diversify_output("Some humanized text.") # different each time
1020+
```
1021+
1022+
### `BenchmarkSuite` — Quality Measurement
1023+
1024+
6-dimension automated quality benchmarking.
1025+
1026+
```python
1027+
from texthumanize import BenchmarkSuite, quick_benchmark
1028+
1029+
# Quick single-pair benchmark:
1030+
report = quick_benchmark("Original AI text.", "Humanized version.")
1031+
print(report.summary())
1032+
1033+
# Full suite:
1034+
suite = BenchmarkSuite(lang="en")
1035+
report = suite.run_all([
1036+
{"original": "AI text 1.", "humanized": "Human text 1."},
1037+
{"original": "AI text 2.", "humanized": "Human text 2."},
1038+
])
1039+
print(f"Overall score: {report.overall_score:.1f}/100")
1040+
```
1041+
1042+
---
1043+
8861044
## Profiles
8871045

8881046
Nine built-in profiles control the processing style:

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "texthumanize"
7-
version = "0.14.0"
7+
version = "0.15.0"
88
description = "Алгоритмическая гуманизация текста с AI-детекцией, тональным анализом, перефразированием и спиннингом"
99
readme = "README.md"
1010
license = {text = "Dual License — Free for personal use, commercial license required for business"}

tests/test_cli.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ def test_version_flag(self, capsys):
2222
run_cli('--version')
2323
assert exc.value.code == 0
2424
out = capsys.readouterr().out
25-
assert '0.14.0' in out
25+
assert '0.15.0' in out
2626

2727

2828
class TestCLIHumanize:

tests/test_v013_features.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,13 @@ class TestPipeline16Stages(unittest.TestCase):
1616
"""Проверка 16-этапного пайплайна."""
1717

1818
def test_stage_count(self):
19-
self.assertEqual(len(Pipeline.STAGE_NAMES), 16)
19+
self.assertEqual(len(Pipeline.STAGE_NAMES), 17)
2020

2121
def test_stage_names(self):
2222
expected = (
2323
"watermark", "segmentation", "typography", "debureaucratization",
2424
"structure", "repetitions", "liveliness",
25-
"paraphrasing", "tone", "universal", "naturalization",
25+
"paraphrasing", "syntax_rewriting", "tone", "universal", "naturalization",
2626
"readability", "grammar", "coherence",
2727
"validation", "restore",
2828
)

0 commit comments

Comments
 (0)