Skip to content

feat(i18n): locale-aware compress-md with pt-BR support#25

Merged
claudioemmanuel merged 1 commit intomainfrom
feat/i18n-compress-md-24
Apr 7, 2026
Merged

feat(i18n): locale-aware compress-md with pt-BR support#25
claudioemmanuel merged 1 commit intomainfrom
feat/i18n-compress-md-24

Conversation

@claudioemmanuel
Copy link
Copy Markdown
Owner

Summary

  • Converts compress_md.rs to a module dir with locale.rs + locales/en.rs + locales/pt_br.rs
  • Adds Locale struct with per-locale word lists (articles, fillers, hedges, phrases, conjunctions, ultra_subs)
  • Unicode-correct helpers: is_clean_word, replace_word_boundary (char-stream + to_lowercase), drop_phrase_ci (dual-cursor invariant), clean_mid_orphan_punct
  • lang= config key and --lang <code> CLI flag; resolution: CLI > config > en
  • 28 new i18n integration tests + bench_i18n binary

Benchmarks

Locale Mode Reduction Latency
EN Full −14% 169µs
pt-BR Full −13% 254µs
pt-BR Ultra −17%

PT-BR overhead: ~1.5× vs EN, still sub-millisecond.

Before / after (pt-BR Full):

IN:    O sistema é basicamente apenas uma ferramenta para configurar o repositório. De modo geral, você pode considerar que a função principal inicializa a documentação do projeto.
Full:  sistema é ferramenta para configurar repositório. função principal inicializa documentação projeto.
Ultra: sistema é ferramenta p/ configurar repo. fn principal inicializa docs projeto.

Test plan

  • cargo test — 315 tests, 38 suites, 0 failures
  • cargo build --release — clean
  • cargo run --release --bin bench_i18n — ratio + latency verified
  • All existing EN tests pass unchanged

Closes #24

- Convert compress_md.rs to module dir (mod.rs + locale.rs + locales/)
- Add Locale struct with per-locale word lists (articles, fillers, hedges,
  phrases, conjunctions, ultra_subs); EN and pt-BR ship in v1
- Unicode-correct helpers: is_clean_word (char iter), replace_word_boundary
  (char-stream + to_lowercase), drop_phrase_ci (dual-cursor invariant),
  clean_mid_orphan_punct (post-phrase-drop cleanup)
- Wire lang= config key and --lang CLI flag; resolution: CLI > config > en
- Add 28 i18n integration tests (unit, feature, EN regression, cross-locale
  contract) + bench_i18n binary (ratio + latency)
- PT-BR overhead: ~1.5x vs EN, still sub-millisecond per call
- Update README with i18n benchmark table and before/after example

Closes #24
@claudioemmanuel claudioemmanuel merged commit 2e392ec into main Apr 7, 2026
4 checks passed
@claudioemmanuel claudioemmanuel deleted the feat/i18n-compress-md-24 branch April 7, 2026 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

i18n: compress-md word lists and matching are English-only, break pt-BR and other languages

1 participant