Skip to content

Releases: ksanyok/TextHumanize

v0.25.0

02 Mar 00:14

Choose a tag to compare

What's Changed

Bug Fixes

  • CRITICAL: Fixed naturalizer.py regex crash for RU/UK text (~50 patterns with non-capturing groups + backreferences). The entire naturalization stage was silently skipped.
  • Added thread-safety locks to _ai_cache and _AI_WORDS for multi-threaded usage.
  • Added division-by-zero guards in detector metric calculations.

Cleanup

  • Removed dead module tokenizer.py (replaced by sentence_split.py).
  • Removed 14 one-off diagnostic scripts, 4 outdated competitive analysis docs, debug artifacts.
  • Synced PHP and JS package versions to 0.25.0.

Documentation

  • Fixed 17-stage to 20-stage across all 15+ documentation files.
  • Corrected test counts, LOC claims, speed benchmarks for consistency.
  • Fixed CHANGELOG date chronology.

CI

  • Raised per-test timeout from 120s to 300s to prevent false failures on slow CI runners.

v0.24.0 — Deep Humanization for EN/RU/UK

01 Mar 00:32

Choose a tag to compare

v0.24.0: deep humanization improvements for EN/RU/UK

Neural detector:
- Per-language feature normalization (RU/UK char_entropy baseline 4.8-4.9 vs 4.3 EN)
- Expanded RU/UK conjunctions, transitions, AI word sets for MLP features 27/33/34

Naturalizer:
- Transition-phrase deletion (22 EN / 23 RU / 23 UK patterns)
- Em-dash injection (_comma_to_dash + _insert_dash_aside)
- Aggressive burstiness (threshold 25→16-20, fragment insertion strategy)
- Light perplexity boost (rhetorical questions for formal profiles)
- Paragraph splitting (5+ sentence paragraphs)
- +30 EN word simplification entries

Pipeline:
- Intensity cap raised 70→85, multipliers 1.15→1.20/1.1→1.15
- Stage 13a: final entropy re-injection post-grammar/coherence

Results (local backend, 3-sentence AI text, intensity=60):
  EN: 0.920 → 0.372 (human)
  RU: 0.880 → 0.390 (human)
  UK: 0.840 → 0.351 (human)

All 1984 tests pass.

v0.23.0 - OSS LLM Backend, PyPI Publication

28 Feb 22:04

Choose a tag to compare

What's New

Backend Parameter

  • New backend parameter: local (default), oss, openai, auto
  • OSS backend: Free AI humanization via amd/gpt-oss-120b-chatbot on HuggingFace Spaces
  • OpenAI backend: Optional paid backend using GPT-4o-mini
  • Auto mode: Tries OSS then OpenAI then local fallback

Install

pip install texthumanize==0.23.0

Usage

from texthumanize import humanize
result = humanize('AI text', backend='oss')

Full Changelog: v0.15.0...v0.23.0

v0.15.0 — Full Audit Closure: 9 New Modules

26 Feb 20:31

Choose a tag to compare

What's New

9 New Core Modules

  • ai_backend — Three-tier AI backend: OpenAI API → OSS Gradio model (rate-limited) → built-in rules. New humanize_ai() function.
  • pos_tagger — Rule-based POS tagger for EN (500+ exceptions), RU/UK (200+), DE (300+). Universal tagset.
  • cjk_segmenter — Chinese BiMM (2504 entries), Japanese character-type, Korean space+particle segmentation.
  • syntax_rewriter — 8 sentence-level transforms (active↔passive, clause inversion, enumeration reorder, adverb migration). 150+ irregular verbs.
  • statistical_detector — 35-feature ML classifier for AI text detection. Integrated into detect_ai() with 60/40 weighted merge.
  • word_lm — Word-level unigram/bigram language model for 14 languages. Perplexity, burstiness, naturalness scoring.
  • collocation_engine — PMI-based collocation scoring for context-aware synonym selection. EN ~130, RU ~30, DE ~20 collocations.
  • fingerprint_randomizer — Anti-fingerprint diversification for output variety.
  • benchmark_suite — 6-dimension automated quality benchmarking.

Pipeline & Detection

  • Pipeline expanded to 17 stages (added syntax rewriting + anti-fingerprint diversification)
  • detect_ai() now returns combined_score (statistical + heuristic)
  • Fixed NO-OP _reduce_adjacent_repeats() — now actually removes repetitions

Tests

  • 1,696 tests — 92 new, all passing (100% pass rate)

v0.14.0

26 Feb 17:22

Choose a tag to compare

v0.14.0 -- Reliability, Analysis Tools & New APIs

New API Functions

  • humanize_sentences() -- per-sentence AI scoring with graduated intensity; only rewrites sentences above a configurable AI probability threshold
  • humanize_variants() -- generates 1-10 humanization variants with different random seeds, sorted by quality
  • humanize_stream() -- generator that yields humanized text chunk-by-chunk with progress tracking

New Analysis Modules (zero-dependency, offline)

  • perplexity_v2 -- character-level trigram cross-entropy model with cross_entropy() and perplexity_score() returning naturalness score (0-100) and verdict
  • dict_trainer -- corpus analysis for custom dictionary building with train_from_corpus() and export_custom_dict()
  • plagiarism -- offline originality detection via n-gram fingerprinting with check_originality() and compare_originality()

Pipeline Improvements

  • Error isolation -- each processing stage wrapped in _safe_stage() with try/except; failing stages are skipped gracefully instead of crashing the pipeline
  • Partial rollback -- pipeline records checkpoints after each stage; on validation failure, rolls back stage-by-stage to find the last valid state
  • Pipeline profiling -- stage_timings dict and total_time included in metrics_after for performance analysis

Bug Fixes & Code Quality

  • Fixed adversarial_calibrate intensity parameter (float 0-1 changed to int 0-100 to match API)
  • Added input sanitization: TypeError for non-str, ValueError for >500K chars, early return for empty text
  • Thread-safe lazy loading with double-checked locking on all module loaders
  • Instance-level plugins preventing cross-instance interference
  • Fixed humanize_sentences crash (detect_ai_sentences returns list, not dict)

Tests

  • 1,604 tests -- up from 1,560 (44 new tests for all v0.14.0 features)
  • 100% pass rate

v0.13.0 — 16-Stage Pipeline, Grammar & Tone & Readability & Coherence

26 Feb 16:31

Choose a tag to compare

TextHumanize v0.13.0

4 new pipeline stages (12 to 16):

  • Tone harmonization — match text tone to profile (academic/blog/seo/casual)
  • Readability optimization — split complex sentences, join short ones
  • Grammar correction — fix doubled words, spacing, typos (9 languages)
  • Coherence repair — transitions between paragraphs, diversify openings

Dictionary expansion (~3,600 new entries):

  • EN: +475 | RU: +430 | UK: +337
  • DE/ES/FR/IT/PL/PT: ~235 each
  • AR/ZH/JA/KO/TR: ~205 each
  • Total: ~13,800 entries across 14 languages

Tests: 1,560 (all passing)

Full changelog: https://github.com/ksanyok/TextHumanize/blob/main/CHANGELOG.md

v0.12.0 — 14 Languages, Placeholder Safety, Watermark Pipeline

26 Feb 14:54

Choose a tag to compare

What's New

5 New Languages (14 total)

  • Arabic (ar) — 81 bureaucratic, 80 synonyms, 49 AI connectors, 47 abbreviations
  • Chinese Simplified (zh) — 80 bureaucratic, 80 synonyms, 36 AI connectors
  • Japanese (ja) — 60+ per category, keigo to casual register replacements
  • Korean (ko) — 60+ per category, honorific to casual register
  • Turkish (tr) — 60+ per category, Ottoman to modern Turkish

Critical Bug Fixes

  • Placeholder safety — all 6 processing modules now skip placeholder tokens; no more leaked placeholders in output
  • 3-pass restore() — exact match, case-insensitive, orphan cleanup
  • HTML block protection — ul, ol, table, pre, blockquote preserved as single segments
  • Bare domain protection — site.com.ua, portal.kh.ua, example.co.uk etc.
  • Homoglyph fix — removed Cyrillic characters from special homoglyphs table (was corrupting all Cyrillic text)

Pipeline Improvements

  • Watermark cleaning — automatic first stage (12 stages total), removes zero-width chars, homoglyphs, invisible Unicode
  • Language detection — Arabic/CJK/Turkish script detection added

Tests

  • 1,509 tests passed (54 new)

v0.11.0 — 3x Dictionary Expansion + Composer Fix

20 Feb 12:05

Choose a tag to compare

What's New

Massive Dictionary Expansion (3x total)

All 9 language dictionaries expanded from 2,281 to 6,881 entries (3.0x growth):

Language Before After Growth
English 257 1,391 5.4x
Russian 291 956 3.3x
Ukrainian 252 780 3.1x
German 235 724 3.1x
French 263 599 2.3x
Spanish 255 613 2.4x
Italian 244 616 2.5x
Polish 244 617 2.5x
Portuguese 240 585 2.4x

All 9 categories expanded: synonyms, bureaucratic words/phrases, AI connectors, sentence starters, colloquial markers, perplexity boosters, split conjunctions, abbreviations.

Bug Fixes

  • Composer package name — root composer.json had incorrect name ksanyok/texthumanize (no hyphen). Fixed to ksanyok/text-humanize. Also changed type from project to library with proper Packagist metadata.
  • TOC dots preservation — table-of-contents leader dots (...........) no longer collapse into ellipsis.

Install

# Python
pip install texthumanize

# PHP
composer require ksanyok/text-humanize

1,455 tests passing.

v0.10.0 — Grammar, Uniqueness, Health Score, Semantic & Sentence Readability

20 Feb 09:36

Choose a tag to compare

What's New in v0.10.0

5 New Analysis Modules (all offline, no ML/API)

Module Function Description
Grammar Checker check_grammar() / fix_grammar() Rule-based grammar checking for 9 languages
Uniqueness Score uniqueness_score() / compare_texts() N-gram fingerprinting uniqueness analysis
Content Health content_health() Composite quality: readability + grammar + uniqueness + AI + coherence
Semantic Similarity semantic_similarity() Measures semantic preservation between original and processed text
Sentence Readability sentence_readability() Per-sentence difficulty scoring (easy/medium/hard/very_hard)

Custom Dictionary API

result = humanize(text, custom_dict={
    "implement": "build",
    "utilize": ["use", "apply", "employ"],  # random pick
})

Massively Expanded Dictionaries

All 9 language dictionaries balanced (367-439 entries each):

  • FR: 281→397, ES: 275→388, IT: 272→379, PL: 257→368, PT: 256→367
  • EN/RU/UK: added perplexity_boosters

Stats

  • 28 files changed, +2333 lines
  • 1455 tests passing (82 new)
  • 17 new public exports
  • Zero external dependencies

v0.9.0 — Kirchenbauer Watermark, HTML Diff, Quality Gate, Selective Humanization, Stylometric Anonymizer

20 Feb 09:05

Choose a tag to compare

What's New

Kirchenbauer Watermark Detector

Green-list z-test based on Kirchenbauer et al. 2023. Uses SHA-256 hash of previous token to partition vocabulary into green/red lists (γ=0.25), computes z-score and p-value. Flags AI watermark at z ≥ 4.0.

from texthumanize import detect_watermarks
report = detect_watermarks(text)
print(report.kirchenbauer_score, report.kirchenbauer_p_value)

HTML Diff Report

explain() now supports multiple output formats:

html = explain(result, fmt='html')      # self-contained HTML page
json_str = explain(result, fmt='json')  # RFC 6902 JSON Patch
diff = explain(result, fmt='diff')      # unified diff

Quality Gate

CLI + GitHub Action + pre-commit hook to check text for AI artifacts:

python -m texthumanize.quality_gate README.md docs/ --ai-threshold 25

Selective Humanization

Process only AI-flagged sentences, leaving human text untouched:

result = humanize(text, only_flagged=True)

Stylometric Anonymizer

Disguise authorship by transforming text toward a target style:

from texthumanize import anonymize_style
result = anonymize_style(text, target='blogger')

Stats

  • 1,373 Python tests passing
  • 40 new tests for v0.9.0 features
  • Ruff lint clean
  • 22 files changed, 1,637 additions