feat(i18n): locale-aware compress-md with pt-BR support (#25)

claudioemmanuel · web-flow · commit 2e392ece4895 · 2026-04-07T19:59:59.000-03:00
- Convert compress_md.rs to module dir (mod.rs + locale.rs + locales/) - Add Locale struct with per-locale word lists (articles, fillers, hedges, phrases, conjunctions, ultra_subs); EN and pt-BR ship in v1 - Unicode-correct helpers: is_clean_word (char iter), replace_word_boundary (char-stream + to_lowercase), drop_phrase_ci (dual-cursor invariant), clean_mid_orphan_punct (post-phrase-drop cleanup) - Wire lang= config key and --lang CLI flag; resolution: CLI > config > en - Add 28 i18n integration tests (unit, feature, EN regression, cross-locale contract) + bench_i18n binary (ratio + latency) - PT-BR overhead: ~1.5x vs EN, still sub-millisecond per call - Update README with i18n benchmark table and before/after example Closes #24
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -37,17 +37,17 @@ No Makefile — all build tooling is Cargo-native.
 
 Cross-call awareness across 16 recent invocations:
 - **cache.rs** — tracks seen outputs, file paths, errors from Read/Glob/Grep/Bash results
-- **redundancy.rs** — two-path dedup: exact FNV-1a hash (fast), then fuzzy bottom-k MinHash trigram Jaccard ≥0.85 (whitespace/timestamp changes don't break match). Emits `[squeez: identical to ...]` or `[squeez: ~P% similar to ...]`
+- **redundancy.rs** — two-path dedup: exact FNV-1a hash (fast), then fuzzy bottom-k MinHash trigram Jaccard ≥0.85 (whitespace/timestamp changes don't break match). Emits `[squeez: identical to ...]`  `[squeez: ~P% similar to ...]`
 - **summarize.rs** — triggered at >500 lines; benign outputs (no error markers) get 2× threshold (1000 lines). Produces ≤40-line dense summary (errors, files, test status, verbatim tail)
-- **intensity.rs** — truly adaptive: **Full** (×0.6) when used < 80% of budget, **Ultra** (×0.3) when ≥80%. `[adaptive: Full]` or `[adaptive: Ultra]` in header
+- **intensity.rs** — truly adaptive: **Full** (×0.6) when used < 80% of budget, **Ultra** (×0.3) when ≥80%. `[adaptive: Full]`  `[adaptive: Ultra]` in header
 - **hash.rs** — FNV-1a-64 + `shingle_minhash()` (bottom-k=96, whitespace-token trigrams) + `jaccard()` (sorted-merge O(n+m))
 
 ### Key files
 
 | File | Role |
 |------|------|
 | `src/commands/wrap.rs` | Main orchestrator: spawn subprocess, capture, compress, inject header |
-| `src/commands/compress_md.rs` | Markdown compressor: preserves code blocks, URLs, tables; compresses prose |
+| `src/commands/compress_md/` | Markdown compressor module: `mod.rs` (core logic), `locale.rs` (Locale struct + `from_code`), `locales/en.rs` + `locales/pt_br.rs` (word lists). Exposes `compress_text` (EN default) and `compress_text_with_locale`. Select locale via `lang=` in config or `--lang` CLI flag. |
 | `src/commands/init.rs` | Session start: finalize previous session memory, inject persona prompt |
 | `src/commands/benchmark.rs` | 19-scenario reproducible benchmark suite |
 | `src/config.rs` | Config struct + `~/.claude/squeez/config.ini` parser; all fields have defaults |
@@ -62,7 +62,7 @@ Cross-call awareness across 16 recent invocations:
 
 ### Tests
 
-35 integration test files under `tests/`. Each strategy and handler has dedicated test file. Notable new ones: `test_redundancy_shingle.rs` (8 fuzzy-match tests), `test_mcp_server.rs` (10 JSON-RPC tests). Benchmark fixtures live in `bench/fixtures/`; capture new ones w/ `bash bench/capture.sh`.
+35 integration test files under `tests/`. Each strategy and handler has dedicated test file. Notable new ones: `test_redundancy_shingle.rs` (8 fuzzy-match tests), `test_mcp_server.rs` (10 JSON-RPC tests). Benchmark fixtures live in `bench/fixtures/`capture new ones w/ `bash bench/capture.sh`.
 
 ### Release & distribution
 
diff --git a/Cargo.toml b/Cargo.toml
@@ -8,11 +8,16 @@ license = "MIT"
 keywords = ["claude-code", "token", "compression", "llm", "cli"]
 categories = ["command-line-utilities", "development-tools"]
 readme = "README.md"
+autobenches = false
 
 [[bin]]
 name = "squeez"
 path = "src/main.rs"
 
+[[bin]]
+name = "bench_i18n"
+path = "benches/bench_i18n.rs"
+
 [lib]
 name = "squeez"
 path = "src/lib.rs"
diff --git a/README.md b/README.md
@@ -80,7 +80,7 @@ squeez update --insecure  # skip checksum (not recommended)
 | **MCP server** | `squeez mcp` runs a JSON-RPC 2.0 server over stdio exposing 6 read-only tools so any MCP-compatible LLM can query session memory directly. Hand-rolled, no `mcp.server` dependency. |
 | **Auto-teach payload** | `squeez protocol` (or the `squeez_protocol` MCP tool) prints a 2.4 KB self-describing payload — the LLM learns squeez's markers and protocol on first call. |
 | **Caveman persona** | Injects an ultra-terse prompt at session start so the model responds with fewer tokens. |
-| **Memory-file compression** | `squeez compress-md` compresses CLAUDE.md / AGENTS.md / copilot-instructions.md in-place — pure Rust, zero LLM. |
+| **Memory-file compression** | `squeez compress-md` compresses CLAUDE.md / AGENTS.md / copilot-instructions.md in-place — pure Rust, zero LLM. i18n-aware: set `lang = pt` (or `--lang pt`) for pt-BR article/filler/phrase dropping and Unicode-correct matching. |
 | **Session memory** | On `SessionStart`, injects a summary of the previous session (files touched, errors, test results, git events). Summaries carry temporal validity (`valid_from`/`valid_to`) so invalidated entries age from `valid_to`. |
 | **Token tracking** | Every `PostToolUse` result (Bash, Read, Grep, Glob) feeds a `SessionContext` so squeez knows what the agent has already seen. |
 
@@ -123,6 +123,28 @@ Measured on macOS (Apple Silicon). Token count = `chars / 4` (matches Claude's ~
 | Latency p50 (filter mode) | **< 0.3 ms** |
 | Latency p95 (incl. wrap/summarize) | **64 ms** |
 
+### compress-md i18n — EN vs pt-BR (Apple Silicon, release build)
+
+| Locale | Mode | Before | After | Reduction | Latency |
+|--------|------|--------|-------|-----------|---------|
+| EN | Full | 514 tk | 445 tk | **−14%** | 170 µs |
+| EN | Ultra | 514 tk | 434 tk | **−16%** | — |
+| pt-BR | Full | 558 tk | 488 tk | **−13%** | 256 µs |
+| pt-BR | Ultra | 558 tk | 468 tk | **−17%** | — |
+
+PT-BR is **~1.5× slower** than EN due to Unicode case folding — still sub-millisecond per call. Both locales produce `result.safe = true`. Run `cargo run --release --bin bench_i18n` to reproduce.
+
+**Before / after — pt-BR Full mode:**
+```
+IN:    O sistema é basicamente apenas uma ferramenta para configurar o repositório.
+       De modo geral, você pode considerar que a função principal inicializa a documentação do projeto.
+
+Full:  sistema é ferramenta para configurar repositório. função principal inicializa documentação projeto.
+Ultra: sistema é ferramenta p/ configurar repo. fn principal inicializa docs projeto.
+```
+
+Drops: articles (`o`, `a`, `do`), fillers (`basicamente`, `apenas`), phrases (`De modo geral`, `você pode considerar que`). Ultra adds abbreviations (`repositório→repo`, `função→fn`, `documentação→docs`, `para→p/`).
+
 ### Estimated cost savings — Claude Sonnet 4.6 · $3.00 / MTok input
 
 | Usage | Baseline / month | Saved / month |
@@ -177,8 +199,9 @@ docker logs mycontainer 2>&1 | squeez filter docker
 Pure-Rust, zero-LLM compressor for markdown files. Preserves code blocks, inline code, URLs, headings, file paths, and tables. Compresses prose only. Always writes a backup at `<stem>.original.md`.
 
 ```bash
-squeez compress-md CLAUDE.md             # Full mode
+squeez compress-md CLAUDE.md             # Full mode (English default)
 squeez compress-md --ultra CLAUDE.md    # + abbreviations (with→w/, fn, cfg, etc.)
+squeez compress-md --lang pt CLAUDE.md  # pt-BR locale (articles, fillers, phrases)
 squeez compress-md --dry-run CLAUDE.md  # preview, no write
 squeez compress-md --all                # compress all known locations automatically
 ```
@@ -264,6 +287,7 @@ memory_retention_days = 30
 # ── Output / persona ───────────────────────────────────────────
 persona          = ultra    # off | lite | full | ultra
 auto_compress_md = true     # run compress-md on every session start
+lang             = en       # compress-md locale: en | pt (pt-BR) — more languages extensible
 ```
 
 ### Adaptive intensity — Full / Ultra split
@@ -392,7 +416,7 @@ Requires Rust stable. Windows requires Git Bash.
 git clone https://github.com/claudioemmanuel/squeez.git
 cd squeez
 
-cargo test                  # run all tests
+cargo test                  # run all tests (315 tests, 38 suites)
 cargo build --release       # build release binary
 
 bash bench/run.sh           # filter-mode benchmark (14 fixtures)
diff --git a/bench/report.md b/bench/report.md
@@ -1,17 +1,17 @@
 FIXTURE                               BEFORE    AFTER  REDUCTION  LATENCY STATUS
 ──────────────────────────────────────────────────────────────────────────────
-docker_logs.txt                         665tk     186tk        73%       4ms  ✅
+docker_logs.txt                         665tk     186tk        73%       3ms  ✅
 env_dump.txt                            441tk     287tk        35%       3ms  ✅
 find_deep.txt                           424tk     134tk        69%       3ms  ✅
 git_copilot_session.txt                 639tk     421tk        35%       3ms  ✅
 git_diff.txt                            502tk     317tk        37%       3ms  ✅
-git_log_200.txt                        2667tk     819tk        70%       4ms  ✅
+git_log_200.txt                        2667tk     819tk        70%       3ms  ✅
 git_status.txt                           50tk      16tk        68%       3ms  ✅
-intensity_budget80.txt                 4418tk      52tk        99%       4ms  ✅
+intensity_budget80.txt                 4418tk      52tk        99%       3ms  ✅
 ls_la.txt                              1782tk     886tk        51%       3ms  ✅
 mdcompress_claude_md.txt                316tk     246tk        23%       3ms  ✅
 mdcompress_prose.txt                    187tk     138tk        27%       3ms  ✅
-npm_install.txt                         524tk     231tk        56%       4ms  ✅
+npm_install.txt                         524tk     231tk        56%       3ms  ✅
 ps_aux.txt                            40373tk    2352tk        95%       6ms  ✅
 summarize_huge.txt                    82257tk      47tk       100%      12ms  ✅
 
diff --git a/benches/bench_i18n.rs b/benches/bench_i18n.rs
@@ -0,0 +1,75 @@
+// Benchmark: EN vs PT-BR compression speed + ratio.
+// Run with: cargo run --release --bin bench_i18n
+use std::hint::black_box;
+use squeez::commands::compress_md::{compress_text_with_locale, Locale, Mode};
+
+fn tokens(bytes: usize) -> usize { bytes / 4 }
+
+fn print_ratio(label: &str, input: &str, locale: &'static Locale, mode: Mode) {
+    let r = compress_text_with_locale(input, mode, locale);
+    let before_tk = tokens(r.stats.orig_bytes);
+    let after_tk  = tokens(r.stats.new_bytes);
+    let pct = 100usize.saturating_sub(after_tk * 100 / before_tk.max(1));
+    println!("  {:<24} {:>6}tk → {:>5}tk  -{:>2}%  safe={}", label, before_tk, after_tk, pct, r.safe);
+}
+
+fn main() {
+    let en = Locale::from_code("en");
+    let pt = Locale::from_code("pt-BR");
+
+    let en_input = include_str!("fixtures/en_prose.txt");
+    let pt_input = include_str!("fixtures/pt_br_prose.txt");
+
+    println!("── Compression ratio ────────────────────────────────────────");
+    print_ratio("EN prose  / Full",   en_input, en, Mode::Full);
+    print_ratio("EN prose  / Ultra",  en_input, en, Mode::Ultra);
+    print_ratio("PT-BR prose / Full", pt_input, pt, Mode::Full);
+    print_ratio("PT-BR prose / Ultra",pt_input, pt, Mode::Ultra);
+    println!();
+
+    let iters = 1000u32;
+
+    println!("── Latency (×{iters} iterations) ──────────────────────────────────");
+    let start = std::time::Instant::now();
+    for _ in 0..iters {
+        black_box(compress_text_with_locale(black_box(en_input), Mode::Full, en));
+    }
+    let en_ms = start.elapsed().as_millis();
+
+    let start = std::time::Instant::now();
+    for _ in 0..iters {
+        black_box(compress_text_with_locale(black_box(pt_input), Mode::Full, pt));
+    }
+    let pt_ms = start.elapsed().as_millis();
+
+    println!("  EN Full:    {}ms  ({:.0}µs/call)", en_ms, en_ms as f64 * 1000.0 / iters as f64);
+    println!("  PT-BR Full: {}ms  ({:.0}µs/call)  {:.2}× vs EN", pt_ms,
+        pt_ms as f64 * 1000.0 / iters as f64,
+        pt_ms as f64 / en_ms.max(1) as f64);
+
+    assert!(pt_ms < en_ms * 3 + 100, "PT-BR too slow vs EN: {}ms vs {}ms", pt_ms, en_ms);
+
+    println!();
+    println!("── Before / after example (pt-BR) ────────────────────────────");
+    let demo = "O sistema é basicamente apenas uma ferramenta para configurar o repositório. \
+                De modo geral, você pode considerar que a função principal inicializa a documentação do projeto.";
+    let rf = compress_text_with_locale(demo, Mode::Full,  pt);
+    let ru = compress_text_with_locale(demo, Mode::Ultra, pt);
+    println!("  IN:    {}", demo);
+    println!("  Full:  {}", rf.output.trim());
+    println!("  Ultra: {}", ru.output.trim());
+}
+
+#[allow(dead_code)]
+fn show_example() {
+    let pt = Locale::from_code("pt-BR");
+    let inputs = [
+        ("O sistema é basicamente apenas uma ferramenta para configurar o repositório. De modo geral, você pode considerar que a função principal inicializa a documentação do projeto.", "pt-BR Full"),
+        ("O sistema é basicamente apenas uma ferramenta para configurar o repositório. De modo geral, você pode considerar que a função principal inicializa a documentação do projeto.", "pt-BR Ultra"),
+    ];
+    for (input, label) in inputs {
+        let mode = if label.contains("Ultra") { Mode::Ultra } else { Mode::Full };
+        let r = compress_text_with_locale(input, mode, pt);
+        println!("{}: {}", label, r.output.trim());
+    }
+}
diff --git a/benches/fixtures/en_prose.txt b/benches/fixtures/en_prose.txt
@@ -0,0 +1,27 @@
+# Project Guide
+
+This document describes the architecture of the system. The project is basically a tool for compressing markdown files. It really just provides a simple way to remove the filler words and unnecessary phrases from prose content.
+
+## Overview
+
+The compression pipeline works in several stages. First, the parser identifies code blocks, URLs, headings, and tables that must be preserved verbatim. Then the prose content is processed through a series of filters that drop articles, fillers, and hedges.
+
+## Configuration
+
+You can configure the behavior of the tool with the configuration file. The configuration is loaded from a standard location in the user home directory. Each parameter has a sensible default, so you only need to override the ones that you really care about.
+
+## Usage
+
+To use the tool, simply pass the path to the file that you want to compress. The tool will read the file, run the compression pipeline, and write the output back to the same file. A backup of the original file is created automatically.
+
+Of course, if you just want to preview the output without modifying the file, you can use the dry-run flag. This will print the compressed output to standard output instead. I'd be happy to help you configure the tool if you have any questions.
+
+## Architecture
+
+The architecture of the tool is really quite simple. Each stage of the pipeline is implemented as a separate function that takes the input and returns the transformed output. The stages are composed together in a fixed order, and the result is written to the destination.
+
+In general, you should not need to modify the architecture of the tool. The default configuration is suitable for most use cases. However, if you have special requirements, you can extend the tool by adding new stages to the pipeline.
+
+## Performance
+
+The tool is designed to be fast. It processes a typical markdown file in a few milliseconds. The compression ratio depends on the content, but you can typically expect a reduction of around fifty percent on prose-heavy documents.
diff --git a/benches/fixtures/pt_br_prose.txt b/benches/fixtures/pt_br_prose.txt
@@ -0,0 +1,27 @@
+# Guia do Projeto
+
+Este documento descreve a arquitetura do sistema. O projeto é basicamente uma ferramenta para comprimir arquivos markdown. Ele realmente apenas fornece uma maneira simples de remover as palavras de preenchimento e frases desnecessárias do conteúdo em prosa.
+
+## Visão Geral
+
+O pipeline de compressão funciona em várias etapas. Primeiro, o parser identifica os blocos de código, URLs, títulos e tabelas que devem ser preservados literalmente. Em seguida, o conteúdo em prosa é processado através de uma série de filtros que removem os artigos, preenchimentos e atenuações.
+
+## Configuração
+
+Você pode configurar o comportamento da ferramenta com o arquivo de configuração. A configuração é carregada de um local padrão no diretório home do usuário. Cada parâmetro tem um padrão razoável, então você só precisa substituir aqueles com os quais realmente se importa.
+
+## Uso
+
+Para usar a ferramenta, simplesmente passe o caminho do arquivo que você quer comprimir. A ferramenta lerá o arquivo, executará o pipeline de compressão e escreverá a saída de volta no mesmo arquivo. Um backup do arquivo original é criado automaticamente.
+
+Claro que, se você apenas quer visualizar a saída sem modificar o arquivo, você pode usar a flag dry-run. Isso vai imprimir a saída comprimida na saída padrão. Fico feliz em ajudar você a configurar a ferramenta se tiver dúvidas.
+
+## Arquitetura
+
+A arquitetura da ferramenta é realmente bem simples. Cada estágio do pipeline é implementado como uma função separada que recebe a entrada e retorna a saída transformada. Os estágios são compostos em uma ordem fixa, e o resultado é escrito no destino.
+
+De modo geral, você não precisa modificar a arquitetura da ferramenta. A configuração padrão é adequada para a maioria dos casos. Porém, se você tiver requisitos especiais, pode estender a ferramenta adicionando novos estágios ao pipeline.
+
+## Desempenho
+
+A ferramenta é projetada para ser rápida. Ela processa um arquivo markdown típico em poucos milissegundos. A taxa de compressão depende do conteúdo, mas você pode tipicamente esperar uma redução de cerca de cinquenta por cento em documentos com muita prosa.
diff --git a/src/commands/compress_md/locale.rs b/src/commands/compress_md/locale.rs
@@ -0,0 +1,22 @@
+use crate::commands::compress_md::locales;
+
+#[derive(Copy, Clone, Debug)]
+pub struct Locale {
+    #[allow(dead_code)]
+    pub code:         &'static str,
+    pub fillers:      &'static [&'static str],
+    pub articles:     &'static [&'static str],
+    pub phrases:      &'static [&'static str],
+    pub hedges:       &'static [&'static str],
+    pub conjunctions: &'static [&'static str],
+    pub ultra_subs:   &'static [(&'static str, &'static str)],
+}
+
+impl Locale {
+    pub fn from_code(code: &str) -> &'static Locale {
+        match code {
+            "pt" | "pt-BR" | "pt_BR" | "pt-br" => &locales::PT_BR,
+            _ => &locales::EN,
+        }
+    }
+}
diff --git a/src/commands/compress_md/locales/en.rs b/src/commands/compress_md/locales/en.rs
@@ -0,0 +1,37 @@
+use crate::commands::compress_md::locale::Locale;
+
+pub static EN: Locale = Locale {
+    code: "en",
+    fillers:      &["just","really","basically","actually","simply","sure","certainly"],
+    articles:     &["the","a","an"],
+    phrases:      &[
+        "of course",
+        "i'd be happy to",
+        "let me ",
+        "i'll help you",
+        "i would like to",
+        "please note that",
+        "it might be worth",
+        "you could consider",
+        "in general",
+        "as a rule",
+    ],
+    hedges:       &["perhaps","maybe"],
+    conjunctions: &[" and"," or"," but"," so"],
+    ultra_subs:   &[
+        ("without","w/o"),
+        ("with","w/"),
+        ("because","b/c"),
+        ("function","fn"),
+        ("parameter","param"),
+        ("arguments","args"),
+        ("argument","arg"),
+        ("configuration","config"),
+        ("documentation","docs"),
+        ("directory","dir"),
+        ("repository","repo"),
+        ("between","btw"),
+        ("versus","vs"),
+        ("approximately","~"),
+    ],
+};
diff --git a/src/commands/compress_md/locales/mod.rs b/src/commands/compress_md/locales/mod.rs
@@ -0,0 +1,4 @@
+pub mod en;
+pub mod pt_br;
+pub use en::EN;
+pub use pt_br::PT_BR;
diff --git a/src/commands/compress_md/locales/pt_br.rs b/src/commands/compress_md/locales/pt_br.rs
diff --git a/src/commands/compress_md/mod.rs b/src/commands/compress_md/mod.rs
diff --git a/src/config.rs b/src/config.rs
diff --git a/tests/test_compress_md_i18n.rs b/tests/test_compress_md_i18n.rs