Releases: ksanyok/TextHumanize
v0.25.0
What's Changed
Bug Fixes
- CRITICAL: Fixed naturalizer.py regex crash for RU/UK text (~50 patterns with non-capturing groups + backreferences). The entire naturalization stage was silently skipped.
- Added thread-safety locks to
_ai_cacheand_AI_WORDSfor multi-threaded usage. - Added division-by-zero guards in detector metric calculations.
Cleanup
- Removed dead module
tokenizer.py(replaced bysentence_split.py). - Removed 14 one-off diagnostic scripts, 4 outdated competitive analysis docs, debug artifacts.
- Synced PHP and JS package versions to 0.25.0.
Documentation
- Fixed 17-stage to 20-stage across all 15+ documentation files.
- Corrected test counts, LOC claims, speed benchmarks for consistency.
- Fixed CHANGELOG date chronology.
CI
- Raised per-test timeout from 120s to 300s to prevent false failures on slow CI runners.
v0.24.0 — Deep Humanization for EN/RU/UK
v0.24.0: deep humanization improvements for EN/RU/UK Neural detector: - Per-language feature normalization (RU/UK char_entropy baseline 4.8-4.9 vs 4.3 EN) - Expanded RU/UK conjunctions, transitions, AI word sets for MLP features 27/33/34 Naturalizer: - Transition-phrase deletion (22 EN / 23 RU / 23 UK patterns) - Em-dash injection (_comma_to_dash + _insert_dash_aside) - Aggressive burstiness (threshold 25→16-20, fragment insertion strategy) - Light perplexity boost (rhetorical questions for formal profiles) - Paragraph splitting (5+ sentence paragraphs) - +30 EN word simplification entries Pipeline: - Intensity cap raised 70→85, multipliers 1.15→1.20/1.1→1.15 - Stage 13a: final entropy re-injection post-grammar/coherence Results (local backend, 3-sentence AI text, intensity=60): EN: 0.920 → 0.372 (human) RU: 0.880 → 0.390 (human) UK: 0.840 → 0.351 (human) All 1984 tests pass.
v0.23.0 - OSS LLM Backend, PyPI Publication
What's New
Backend Parameter
- New backend parameter: local (default), oss, openai, auto
- OSS backend: Free AI humanization via amd/gpt-oss-120b-chatbot on HuggingFace Spaces
- OpenAI backend: Optional paid backend using GPT-4o-mini
- Auto mode: Tries OSS then OpenAI then local fallback
Install
pip install texthumanize==0.23.0
Usage
from texthumanize import humanize
result = humanize('AI text', backend='oss')
Full Changelog: v0.15.0...v0.23.0
v0.15.0 — Full Audit Closure: 9 New Modules
What's New
9 New Core Modules
ai_backend— Three-tier AI backend: OpenAI API → OSS Gradio model (rate-limited) → built-in rules. Newhumanize_ai()function.pos_tagger— Rule-based POS tagger for EN (500+ exceptions), RU/UK (200+), DE (300+). Universal tagset.cjk_segmenter— Chinese BiMM (2504 entries), Japanese character-type, Korean space+particle segmentation.syntax_rewriter— 8 sentence-level transforms (active↔passive, clause inversion, enumeration reorder, adverb migration). 150+ irregular verbs.statistical_detector— 35-feature ML classifier for AI text detection. Integrated intodetect_ai()with 60/40 weighted merge.word_lm— Word-level unigram/bigram language model for 14 languages. Perplexity, burstiness, naturalness scoring.collocation_engine— PMI-based collocation scoring for context-aware synonym selection. EN ~130, RU ~30, DE ~20 collocations.fingerprint_randomizer— Anti-fingerprint diversification for output variety.benchmark_suite— 6-dimension automated quality benchmarking.
Pipeline & Detection
- Pipeline expanded to 17 stages (added syntax rewriting + anti-fingerprint diversification)
detect_ai()now returnscombined_score(statistical + heuristic)- Fixed NO-OP
_reduce_adjacent_repeats()— now actually removes repetitions
Tests
- 1,696 tests — 92 new, all passing (100% pass rate)
v0.14.0
v0.14.0 -- Reliability, Analysis Tools & New APIs
New API Functions
humanize_sentences()-- per-sentence AI scoring with graduated intensity; only rewrites sentences above a configurable AI probability thresholdhumanize_variants()-- generates 1-10 humanization variants with different random seeds, sorted by qualityhumanize_stream()-- generator that yields humanized text chunk-by-chunk with progress tracking
New Analysis Modules (zero-dependency, offline)
perplexity_v2-- character-level trigram cross-entropy model withcross_entropy()andperplexity_score()returning naturalness score (0-100) and verdictdict_trainer-- corpus analysis for custom dictionary building withtrain_from_corpus()andexport_custom_dict()plagiarism-- offline originality detection via n-gram fingerprinting withcheck_originality()andcompare_originality()
Pipeline Improvements
- Error isolation -- each processing stage wrapped in
_safe_stage()with try/except; failing stages are skipped gracefully instead of crashing the pipeline - Partial rollback -- pipeline records checkpoints after each stage; on validation failure, rolls back stage-by-stage to find the last valid state
- Pipeline profiling --
stage_timingsdict andtotal_timeincluded inmetrics_afterfor performance analysis
Bug Fixes & Code Quality
- Fixed
adversarial_calibrateintensity parameter (float 0-1 changed to int 0-100 to match API) - Added input sanitization: TypeError for non-str, ValueError for >500K chars, early return for empty text
- Thread-safe lazy loading with double-checked locking on all module loaders
- Instance-level plugins preventing cross-instance interference
- Fixed
humanize_sentencescrash (detect_ai_sentences returns list, not dict)
Tests
- 1,604 tests -- up from 1,560 (44 new tests for all v0.14.0 features)
- 100% pass rate
v0.13.0 — 16-Stage Pipeline, Grammar & Tone & Readability & Coherence
TextHumanize v0.13.0
4 new pipeline stages (12 to 16):
- Tone harmonization — match text tone to profile (academic/blog/seo/casual)
- Readability optimization — split complex sentences, join short ones
- Grammar correction — fix doubled words, spacing, typos (9 languages)
- Coherence repair — transitions between paragraphs, diversify openings
Dictionary expansion (~3,600 new entries):
- EN: +475 | RU: +430 | UK: +337
- DE/ES/FR/IT/PL/PT: ~235 each
- AR/ZH/JA/KO/TR: ~205 each
- Total: ~13,800 entries across 14 languages
Tests: 1,560 (all passing)
Full changelog: https://github.com/ksanyok/TextHumanize/blob/main/CHANGELOG.md
v0.12.0 — 14 Languages, Placeholder Safety, Watermark Pipeline
What's New
5 New Languages (14 total)
- Arabic (ar) — 81 bureaucratic, 80 synonyms, 49 AI connectors, 47 abbreviations
- Chinese Simplified (zh) — 80 bureaucratic, 80 synonyms, 36 AI connectors
- Japanese (ja) — 60+ per category, keigo to casual register replacements
- Korean (ko) — 60+ per category, honorific to casual register
- Turkish (tr) — 60+ per category, Ottoman to modern Turkish
Critical Bug Fixes
- Placeholder safety — all 6 processing modules now skip placeholder tokens; no more leaked placeholders in output
- 3-pass restore() — exact match, case-insensitive, orphan cleanup
- HTML block protection — ul, ol, table, pre, blockquote preserved as single segments
- Bare domain protection — site.com.ua, portal.kh.ua, example.co.uk etc.
- Homoglyph fix — removed Cyrillic characters from special homoglyphs table (was corrupting all Cyrillic text)
Pipeline Improvements
- Watermark cleaning — automatic first stage (12 stages total), removes zero-width chars, homoglyphs, invisible Unicode
- Language detection — Arabic/CJK/Turkish script detection added
Tests
- 1,509 tests passed (54 new)
v0.11.0 — 3x Dictionary Expansion + Composer Fix
What's New
Massive Dictionary Expansion (3x total)
All 9 language dictionaries expanded from 2,281 to 6,881 entries (3.0x growth):
| Language | Before | After | Growth |
|---|---|---|---|
| English | 257 | 1,391 | 5.4x |
| Russian | 291 | 956 | 3.3x |
| Ukrainian | 252 | 780 | 3.1x |
| German | 235 | 724 | 3.1x |
| French | 263 | 599 | 2.3x |
| Spanish | 255 | 613 | 2.4x |
| Italian | 244 | 616 | 2.5x |
| Polish | 244 | 617 | 2.5x |
| Portuguese | 240 | 585 | 2.4x |
All 9 categories expanded: synonyms, bureaucratic words/phrases, AI connectors, sentence starters, colloquial markers, perplexity boosters, split conjunctions, abbreviations.
Bug Fixes
- Composer package name — root
composer.jsonhad incorrect nameksanyok/texthumanize(no hyphen). Fixed toksanyok/text-humanize. Also changedtypefromprojecttolibrarywith proper Packagist metadata. - TOC dots preservation — table-of-contents leader dots (
...........) no longer collapse into ellipsis.
Install
# Python
pip install texthumanize
# PHP
composer require ksanyok/text-humanize1,455 tests passing.
v0.10.0 — Grammar, Uniqueness, Health Score, Semantic & Sentence Readability
What's New in v0.10.0
5 New Analysis Modules (all offline, no ML/API)
| Module | Function | Description |
|---|---|---|
| Grammar Checker | check_grammar() / fix_grammar() |
Rule-based grammar checking for 9 languages |
| Uniqueness Score | uniqueness_score() / compare_texts() |
N-gram fingerprinting uniqueness analysis |
| Content Health | content_health() |
Composite quality: readability + grammar + uniqueness + AI + coherence |
| Semantic Similarity | semantic_similarity() |
Measures semantic preservation between original and processed text |
| Sentence Readability | sentence_readability() |
Per-sentence difficulty scoring (easy/medium/hard/very_hard) |
Custom Dictionary API
result = humanize(text, custom_dict={
"implement": "build",
"utilize": ["use", "apply", "employ"], # random pick
})Massively Expanded Dictionaries
All 9 language dictionaries balanced (367-439 entries each):
- FR: 281→397, ES: 275→388, IT: 272→379, PL: 257→368, PT: 256→367
- EN/RU/UK: added perplexity_boosters
Stats
- 28 files changed, +2333 lines
- 1455 tests passing (82 new)
- 17 new public exports
- Zero external dependencies
v0.9.0 — Kirchenbauer Watermark, HTML Diff, Quality Gate, Selective Humanization, Stylometric Anonymizer
What's New
Kirchenbauer Watermark Detector
Green-list z-test based on Kirchenbauer et al. 2023. Uses SHA-256 hash of previous token to partition vocabulary into green/red lists (γ=0.25), computes z-score and p-value. Flags AI watermark at z ≥ 4.0.
from texthumanize import detect_watermarks
report = detect_watermarks(text)
print(report.kirchenbauer_score, report.kirchenbauer_p_value)HTML Diff Report
explain() now supports multiple output formats:
html = explain(result, fmt='html') # self-contained HTML page
json_str = explain(result, fmt='json') # RFC 6902 JSON Patch
diff = explain(result, fmt='diff') # unified diffQuality Gate
CLI + GitHub Action + pre-commit hook to check text for AI artifacts:
python -m texthumanize.quality_gate README.md docs/ --ai-threshold 25Selective Humanization
Process only AI-flagged sentences, leaving human text untouched:
result = humanize(text, only_flagged=True)Stylometric Anonymizer
Disguise authorship by transforming text toward a target style:
from texthumanize import anonymize_style
result = anonymize_style(text, target='blogger')Stats
- 1,373 Python tests passing
- 40 new tests for v0.9.0 features
- Ruff lint clean
- 22 files changed, 1,637 additions