Skip to content

Commit 2e392ec

Browse files
feat(i18n): locale-aware compress-md with pt-BR support (#25)
- Convert compress_md.rs to module dir (mod.rs + locale.rs + locales/) - Add Locale struct with per-locale word lists (articles, fillers, hedges, phrases, conjunctions, ultra_subs); EN and pt-BR ship in v1 - Unicode-correct helpers: is_clean_word (char iter), replace_word_boundary (char-stream + to_lowercase), drop_phrase_ci (dual-cursor invariant), clean_mid_orphan_punct (post-phrase-drop cleanup) - Wire lang= config key and --lang CLI flag; resolution: CLI > config > en - Add 28 i18n integration tests (unit, feature, EN regression, cross-locale contract) + bench_i18n binary (ratio + latency) - PT-BR overhead: ~1.5x vs EN, still sub-millisecond per call - Update README with i18n benchmark table and before/after example Closes #24
1 parent b6810d3 commit 2e392ec

File tree

14 files changed

+829
-132
lines changed

14 files changed

+829
-132
lines changed

CLAUDE.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,17 +37,17 @@ No Makefile — all build tooling is Cargo-native.
3737

3838
Cross-call awareness across 16 recent invocations:
3939
- **cache.rs** — tracks seen outputs, file paths, errors from Read/Glob/Grep/Bash results
40-
- **redundancy.rs** — two-path dedup: exact FNV-1a hash (fast), then fuzzy bottom-k MinHash trigram Jaccard ≥0.85 (whitespace/timestamp changes don't break match). Emits `[squeez: identical to ...]` or `[squeez: ~P% similar to ...]`
40+
- **redundancy.rs** — two-path dedup: exact FNV-1a hash (fast), then fuzzy bottom-k MinHash trigram Jaccard ≥0.85 (whitespace/timestamp changes don't break match). Emits `[squeez: identical to ...]` `[squeez: ~P% similar to ...]`
4141
- **summarize.rs** — triggered at >500 lines; benign outputs (no error markers) get 2× threshold (1000 lines). Produces ≤40-line dense summary (errors, files, test status, verbatim tail)
42-
- **intensity.rs** — truly adaptive: **Full** (×0.6) when used < 80% of budget, **Ultra** (×0.3) when ≥80%. `[adaptive: Full]` or `[adaptive: Ultra]` in header
42+
- **intensity.rs** — truly adaptive: **Full** (×0.6) when used < 80% of budget, **Ultra** (×0.3) when ≥80%. `[adaptive: Full]` `[adaptive: Ultra]` in header
4343
- **hash.rs** — FNV-1a-64 + `shingle_minhash()` (bottom-k=96, whitespace-token trigrams) + `jaccard()` (sorted-merge O(n+m))
4444

4545
### Key files
4646

4747
| File | Role |
4848
|------|------|
4949
| `src/commands/wrap.rs` | Main orchestrator: spawn subprocess, capture, compress, inject header |
50-
| `src/commands/compress_md.rs` | Markdown compressor: preserves code blocks, URLs, tables; compresses prose |
50+
| `src/commands/compress_md/` | Markdown compressor module: `mod.rs` (core logic), `locale.rs` (Locale struct + `from_code`), `locales/en.rs` + `locales/pt_br.rs` (word lists). Exposes `compress_text` (EN default) and `compress_text_with_locale`. Select locale via `lang=` in config or `--lang` CLI flag. |
5151
| `src/commands/init.rs` | Session start: finalize previous session memory, inject persona prompt |
5252
| `src/commands/benchmark.rs` | 19-scenario reproducible benchmark suite |
5353
| `src/config.rs` | Config struct + `~/.claude/squeez/config.ini` parser; all fields have defaults |
@@ -62,7 +62,7 @@ Cross-call awareness across 16 recent invocations:
6262

6363
### Tests
6464

65-
35 integration test files under `tests/`. Each strategy and handler has dedicated test file. Notable new ones: `test_redundancy_shingle.rs` (8 fuzzy-match tests), `test_mcp_server.rs` (10 JSON-RPC tests). Benchmark fixtures live in `bench/fixtures/`; capture new ones w/ `bash bench/capture.sh`.
65+
35 integration test files under `tests/`. Each strategy and handler has dedicated test file. Notable new ones: `test_redundancy_shingle.rs` (8 fuzzy-match tests), `test_mcp_server.rs` (10 JSON-RPC tests). Benchmark fixtures live in `bench/fixtures/`capture new ones w/ `bash bench/capture.sh`.
6666

6767
### Release & distribution
6868

Cargo.toml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,16 @@ license = "MIT"
88
keywords = ["claude-code", "token", "compression", "llm", "cli"]
99
categories = ["command-line-utilities", "development-tools"]
1010
readme = "README.md"
11+
autobenches = false
1112

1213
[[bin]]
1314
name = "squeez"
1415
path = "src/main.rs"
1516

17+
[[bin]]
18+
name = "bench_i18n"
19+
path = "benches/bench_i18n.rs"
20+
1621
[lib]
1722
name = "squeez"
1823
path = "src/lib.rs"

README.md

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ squeez update --insecure # skip checksum (not recommended)
8080
| **MCP server** | `squeez mcp` runs a JSON-RPC 2.0 server over stdio exposing 6 read-only tools so any MCP-compatible LLM can query session memory directly. Hand-rolled, no `mcp.server` dependency. |
8181
| **Auto-teach payload** | `squeez protocol` (or the `squeez_protocol` MCP tool) prints a 2.4 KB self-describing payload — the LLM learns squeez's markers and protocol on first call. |
8282
| **Caveman persona** | Injects an ultra-terse prompt at session start so the model responds with fewer tokens. |
83-
| **Memory-file compression** | `squeez compress-md` compresses CLAUDE.md / AGENTS.md / copilot-instructions.md in-place — pure Rust, zero LLM. |
83+
| **Memory-file compression** | `squeez compress-md` compresses CLAUDE.md / AGENTS.md / copilot-instructions.md in-place — pure Rust, zero LLM. i18n-aware: set `lang = pt` (or `--lang pt`) for pt-BR article/filler/phrase dropping and Unicode-correct matching. |
8484
| **Session memory** | On `SessionStart`, injects a summary of the previous session (files touched, errors, test results, git events). Summaries carry temporal validity (`valid_from`/`valid_to`) so invalidated entries age from `valid_to`. |
8585
| **Token tracking** | Every `PostToolUse` result (Bash, Read, Grep, Glob) feeds a `SessionContext` so squeez knows what the agent has already seen. |
8686

@@ -123,6 +123,28 @@ Measured on macOS (Apple Silicon). Token count = `chars / 4` (matches Claude's ~
123123
| Latency p50 (filter mode) | **< 0.3 ms** |
124124
| Latency p95 (incl. wrap/summarize) | **64 ms** |
125125

126+
### compress-md i18n — EN vs pt-BR (Apple Silicon, release build)
127+
128+
| Locale | Mode | Before | After | Reduction | Latency |
129+
|--------|------|--------|-------|-----------|---------|
130+
| EN | Full | 514 tk | 445 tk | **−14%** | 170 µs |
131+
| EN | Ultra | 514 tk | 434 tk | **−16%** ||
132+
| pt-BR | Full | 558 tk | 488 tk | **−13%** | 256 µs |
133+
| pt-BR | Ultra | 558 tk | 468 tk | **−17%** ||
134+
135+
PT-BR is **~1.5× slower** than EN due to Unicode case folding — still sub-millisecond per call. Both locales produce `result.safe = true`. Run `cargo run --release --bin bench_i18n` to reproduce.
136+
137+
**Before / after — pt-BR Full mode:**
138+
```
139+
IN: O sistema é basicamente apenas uma ferramenta para configurar o repositório.
140+
De modo geral, você pode considerar que a função principal inicializa a documentação do projeto.
141+
142+
Full: sistema é ferramenta para configurar repositório. função principal inicializa documentação projeto.
143+
Ultra: sistema é ferramenta p/ configurar repo. fn principal inicializa docs projeto.
144+
```
145+
146+
Drops: articles (`o`, `a`, `do`), fillers (`basicamente`, `apenas`), phrases (`De modo geral`, `você pode considerar que`). Ultra adds abbreviations (`repositório→repo`, `função→fn`, `documentação→docs`, `para→p/`).
147+
126148
### Estimated cost savings — Claude Sonnet 4.6 · $3.00 / MTok input
127149

128150
| Usage | Baseline / month | Saved / month |
@@ -177,8 +199,9 @@ docker logs mycontainer 2>&1 | squeez filter docker
177199
Pure-Rust, zero-LLM compressor for markdown files. Preserves code blocks, inline code, URLs, headings, file paths, and tables. Compresses prose only. Always writes a backup at `<stem>.original.md`.
178200

179201
```bash
180-
squeez compress-md CLAUDE.md # Full mode
202+
squeez compress-md CLAUDE.md # Full mode (English default)
181203
squeez compress-md --ultra CLAUDE.md # + abbreviations (with→w/, fn, cfg, etc.)
204+
squeez compress-md --lang pt CLAUDE.md # pt-BR locale (articles, fillers, phrases)
182205
squeez compress-md --dry-run CLAUDE.md # preview, no write
183206
squeez compress-md --all # compress all known locations automatically
184207
```
@@ -264,6 +287,7 @@ memory_retention_days = 30
264287
# ── Output / persona ───────────────────────────────────────────
265288
persona = ultra # off | lite | full | ultra
266289
auto_compress_md = true # run compress-md on every session start
290+
lang = en # compress-md locale: en | pt (pt-BR) — more languages extensible
267291
```
268292

269293
### Adaptive intensity — Full / Ultra split
@@ -392,7 +416,7 @@ Requires Rust stable. Windows requires Git Bash.
392416
git clone https://github.com/claudioemmanuel/squeez.git
393417
cd squeez
394418

395-
cargo test # run all tests
419+
cargo test # run all tests (315 tests, 38 suites)
396420
cargo build --release # build release binary
397421

398422
bash bench/run.sh # filter-mode benchmark (14 fixtures)

bench/report.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
FIXTURE BEFORE AFTER REDUCTION LATENCY STATUS
22
──────────────────────────────────────────────────────────────────────────────
3-
docker_logs.txt 665tk 186tk 73% 4ms
3+
docker_logs.txt 665tk 186tk 73% 3ms
44
env_dump.txt 441tk 287tk 35% 3ms ✅
55
find_deep.txt 424tk 134tk 69% 3ms ✅
66
git_copilot_session.txt 639tk 421tk 35% 3ms ✅
77
git_diff.txt 502tk 317tk 37% 3ms ✅
8-
git_log_200.txt 2667tk 819tk 70% 4ms
8+
git_log_200.txt 2667tk 819tk 70% 3ms
99
git_status.txt 50tk 16tk 68% 3ms ✅
10-
intensity_budget80.txt 4418tk 52tk 99% 4ms
10+
intensity_budget80.txt 4418tk 52tk 99% 3ms
1111
ls_la.txt 1782tk 886tk 51% 3ms ✅
1212
mdcompress_claude_md.txt 316tk 246tk 23% 3ms ✅
1313
mdcompress_prose.txt 187tk 138tk 27% 3ms ✅
14-
npm_install.txt 524tk 231tk 56% 4ms
14+
npm_install.txt 524tk 231tk 56% 3ms
1515
ps_aux.txt 40373tk 2352tk 95% 6ms ✅
1616
summarize_huge.txt 82257tk 47tk 100% 12ms ✅
1717

benches/bench_i18n.rs

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
// Benchmark: EN vs PT-BR compression speed + ratio.
2+
// Run with: cargo run --release --bin bench_i18n
3+
use std::hint::black_box;
4+
use squeez::commands::compress_md::{compress_text_with_locale, Locale, Mode};
5+
6+
fn tokens(bytes: usize) -> usize { bytes / 4 }
7+
8+
fn print_ratio(label: &str, input: &str, locale: &'static Locale, mode: Mode) {
9+
let r = compress_text_with_locale(input, mode, locale);
10+
let before_tk = tokens(r.stats.orig_bytes);
11+
let after_tk = tokens(r.stats.new_bytes);
12+
let pct = 100usize.saturating_sub(after_tk * 100 / before_tk.max(1));
13+
println!(" {:<24} {:>6}tk → {:>5}tk -{:>2}% safe={}", label, before_tk, after_tk, pct, r.safe);
14+
}
15+
16+
fn main() {
17+
let en = Locale::from_code("en");
18+
let pt = Locale::from_code("pt-BR");
19+
20+
let en_input = include_str!("fixtures/en_prose.txt");
21+
let pt_input = include_str!("fixtures/pt_br_prose.txt");
22+
23+
println!("── Compression ratio ────────────────────────────────────────");
24+
print_ratio("EN prose / Full", en_input, en, Mode::Full);
25+
print_ratio("EN prose / Ultra", en_input, en, Mode::Ultra);
26+
print_ratio("PT-BR prose / Full", pt_input, pt, Mode::Full);
27+
print_ratio("PT-BR prose / Ultra",pt_input, pt, Mode::Ultra);
28+
println!();
29+
30+
let iters = 1000u32;
31+
32+
println!("── Latency (×{iters} iterations) ──────────────────────────────────");
33+
let start = std::time::Instant::now();
34+
for _ in 0..iters {
35+
black_box(compress_text_with_locale(black_box(en_input), Mode::Full, en));
36+
}
37+
let en_ms = start.elapsed().as_millis();
38+
39+
let start = std::time::Instant::now();
40+
for _ in 0..iters {
41+
black_box(compress_text_with_locale(black_box(pt_input), Mode::Full, pt));
42+
}
43+
let pt_ms = start.elapsed().as_millis();
44+
45+
println!(" EN Full: {}ms ({:.0}µs/call)", en_ms, en_ms as f64 * 1000.0 / iters as f64);
46+
println!(" PT-BR Full: {}ms ({:.0}µs/call) {:.2}× vs EN", pt_ms,
47+
pt_ms as f64 * 1000.0 / iters as f64,
48+
pt_ms as f64 / en_ms.max(1) as f64);
49+
50+
assert!(pt_ms < en_ms * 3 + 100, "PT-BR too slow vs EN: {}ms vs {}ms", pt_ms, en_ms);
51+
52+
println!();
53+
println!("── Before / after example (pt-BR) ────────────────────────────");
54+
let demo = "O sistema é basicamente apenas uma ferramenta para configurar o repositório. \
55+
De modo geral, você pode considerar que a função principal inicializa a documentação do projeto.";
56+
let rf = compress_text_with_locale(demo, Mode::Full, pt);
57+
let ru = compress_text_with_locale(demo, Mode::Ultra, pt);
58+
println!(" IN: {}", demo);
59+
println!(" Full: {}", rf.output.trim());
60+
println!(" Ultra: {}", ru.output.trim());
61+
}
62+
63+
#[allow(dead_code)]
64+
fn show_example() {
65+
let pt = Locale::from_code("pt-BR");
66+
let inputs = [
67+
("O sistema é basicamente apenas uma ferramenta para configurar o repositório. De modo geral, você pode considerar que a função principal inicializa a documentação do projeto.", "pt-BR Full"),
68+
("O sistema é basicamente apenas uma ferramenta para configurar o repositório. De modo geral, você pode considerar que a função principal inicializa a documentação do projeto.", "pt-BR Ultra"),
69+
];
70+
for (input, label) in inputs {
71+
let mode = if label.contains("Ultra") { Mode::Ultra } else { Mode::Full };
72+
let r = compress_text_with_locale(input, mode, pt);
73+
println!("{}: {}", label, r.output.trim());
74+
}
75+
}

benches/fixtures/en_prose.txt

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Project Guide
2+
3+
This document describes the architecture of the system. The project is basically a tool for compressing markdown files. It really just provides a simple way to remove the filler words and unnecessary phrases from prose content.
4+
5+
## Overview
6+
7+
The compression pipeline works in several stages. First, the parser identifies code blocks, URLs, headings, and tables that must be preserved verbatim. Then the prose content is processed through a series of filters that drop articles, fillers, and hedges.
8+
9+
## Configuration
10+
11+
You can configure the behavior of the tool with the configuration file. The configuration is loaded from a standard location in the user home directory. Each parameter has a sensible default, so you only need to override the ones that you really care about.
12+
13+
## Usage
14+
15+
To use the tool, simply pass the path to the file that you want to compress. The tool will read the file, run the compression pipeline, and write the output back to the same file. A backup of the original file is created automatically.
16+
17+
Of course, if you just want to preview the output without modifying the file, you can use the dry-run flag. This will print the compressed output to standard output instead. I'd be happy to help you configure the tool if you have any questions.
18+
19+
## Architecture
20+
21+
The architecture of the tool is really quite simple. Each stage of the pipeline is implemented as a separate function that takes the input and returns the transformed output. The stages are composed together in a fixed order, and the result is written to the destination.
22+
23+
In general, you should not need to modify the architecture of the tool. The default configuration is suitable for most use cases. However, if you have special requirements, you can extend the tool by adding new stages to the pipeline.
24+
25+
## Performance
26+
27+
The tool is designed to be fast. It processes a typical markdown file in a few milliseconds. The compression ratio depends on the content, but you can typically expect a reduction of around fifty percent on prose-heavy documents.

benches/fixtures/pt_br_prose.txt

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Guia do Projeto
2+
3+
Este documento descreve a arquitetura do sistema. O projeto é basicamente uma ferramenta para comprimir arquivos markdown. Ele realmente apenas fornece uma maneira simples de remover as palavras de preenchimento e frases desnecessárias do conteúdo em prosa.
4+
5+
## Visão Geral
6+
7+
O pipeline de compressão funciona em várias etapas. Primeiro, o parser identifica os blocos de código, URLs, títulos e tabelas que devem ser preservados literalmente. Em seguida, o conteúdo em prosa é processado através de uma série de filtros que removem os artigos, preenchimentos e atenuações.
8+
9+
## Configuração
10+
11+
Você pode configurar o comportamento da ferramenta com o arquivo de configuração. A configuração é carregada de um local padrão no diretório home do usuário. Cada parâmetro tem um padrão razoável, então você só precisa substituir aqueles com os quais realmente se importa.
12+
13+
## Uso
14+
15+
Para usar a ferramenta, simplesmente passe o caminho do arquivo que você quer comprimir. A ferramenta lerá o arquivo, executará o pipeline de compressão e escreverá a saída de volta no mesmo arquivo. Um backup do arquivo original é criado automaticamente.
16+
17+
Claro que, se você apenas quer visualizar a saída sem modificar o arquivo, você pode usar a flag dry-run. Isso vai imprimir a saída comprimida na saída padrão. Fico feliz em ajudar você a configurar a ferramenta se tiver dúvidas.
18+
19+
## Arquitetura
20+
21+
A arquitetura da ferramenta é realmente bem simples. Cada estágio do pipeline é implementado como uma função separada que recebe a entrada e retorna a saída transformada. Os estágios são compostos em uma ordem fixa, e o resultado é escrito no destino.
22+
23+
De modo geral, você não precisa modificar a arquitetura da ferramenta. A configuração padrão é adequada para a maioria dos casos. Porém, se você tiver requisitos especiais, pode estender a ferramenta adicionando novos estágios ao pipeline.
24+
25+
## Desempenho
26+
27+
A ferramenta é projetada para ser rápida. Ela processa um arquivo markdown típico em poucos milissegundos. A taxa de compressão depende do conteúdo, mas você pode tipicamente esperar uma redução de cerca de cinquenta por cento em documentos com muita prosa.

src/commands/compress_md/locale.rs

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
use crate::commands::compress_md::locales;
2+
3+
#[derive(Copy, Clone, Debug)]
4+
pub struct Locale {
5+
#[allow(dead_code)]
6+
pub code: &'static str,
7+
pub fillers: &'static [&'static str],
8+
pub articles: &'static [&'static str],
9+
pub phrases: &'static [&'static str],
10+
pub hedges: &'static [&'static str],
11+
pub conjunctions: &'static [&'static str],
12+
pub ultra_subs: &'static [(&'static str, &'static str)],
13+
}
14+
15+
impl Locale {
16+
pub fn from_code(code: &str) -> &'static Locale {
17+
match code {
18+
"pt" | "pt-BR" | "pt_BR" | "pt-br" => &locales::PT_BR,
19+
_ => &locales::EN,
20+
}
21+
}
22+
}
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
use crate::commands::compress_md::locale::Locale;
2+
3+
pub static EN: Locale = Locale {
4+
code: "en",
5+
fillers: &["just","really","basically","actually","simply","sure","certainly"],
6+
articles: &["the","a","an"],
7+
phrases: &[
8+
"of course",
9+
"i'd be happy to",
10+
"let me ",
11+
"i'll help you",
12+
"i would like to",
13+
"please note that",
14+
"it might be worth",
15+
"you could consider",
16+
"in general",
17+
"as a rule",
18+
],
19+
hedges: &["perhaps","maybe"],
20+
conjunctions: &[" and"," or"," but"," so"],
21+
ultra_subs: &[
22+
("without","w/o"),
23+
("with","w/"),
24+
("because","b/c"),
25+
("function","fn"),
26+
("parameter","param"),
27+
("arguments","args"),
28+
("argument","arg"),
29+
("configuration","config"),
30+
("documentation","docs"),
31+
("directory","dir"),
32+
("repository","repo"),
33+
("between","btw"),
34+
("versus","vs"),
35+
("approximately","~"),
36+
],
37+
};
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
pub mod en;
2+
pub mod pt_br;
3+
pub use en::EN;
4+
pub use pt_br::PT_BR;

0 commit comments

Comments
 (0)