diff --git a/CLAUDE.md b/CLAUDE.md index 4aada11..65fb925 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -37,9 +37,9 @@ No Makefile — all build tooling is Cargo-native. Cross-call awareness across 16 recent invocations: - **cache.rs** — tracks seen outputs, file paths, errors from Read/Glob/Grep/Bash results -- **redundancy.rs** — two-path dedup: exact FNV-1a hash (fast), then fuzzy bottom-k MinHash trigram Jaccard ≥0.85 (whitespace/timestamp changes don't break match). Emits `[squeez: identical to ...]` or `[squeez: ~P% similar to ...]` +- **redundancy.rs** — two-path dedup: exact FNV-1a hash (fast), then fuzzy bottom-k MinHash trigram Jaccard ≥0.85 (whitespace/timestamp changes don't break match). Emits `[squeez: identical to ...]` `[squeez: ~P% similar to ...]` - **summarize.rs** — triggered at >500 lines; benign outputs (no error markers) get 2× threshold (1000 lines). Produces ≤40-line dense summary (errors, files, test status, verbatim tail) -- **intensity.rs** — truly adaptive: **Full** (×0.6) when used < 80% of budget, **Ultra** (×0.3) when ≥80%. `[adaptive: Full]` or `[adaptive: Ultra]` in header +- **intensity.rs** — truly adaptive: **Full** (×0.6) when used < 80% of budget, **Ultra** (×0.3) when ≥80%. `[adaptive: Full]` `[adaptive: Ultra]` in header - **hash.rs** — FNV-1a-64 + `shingle_minhash()` (bottom-k=96, whitespace-token trigrams) + `jaccard()` (sorted-merge O(n+m)) ### Key files @@ -47,7 +47,7 @@ Cross-call awareness across 16 recent invocations: | File | Role | |------|------| | `src/commands/wrap.rs` | Main orchestrator: spawn subprocess, capture, compress, inject header | -| `src/commands/compress_md.rs` | Markdown compressor: preserves code blocks, URLs, tables; compresses prose | +| `src/commands/compress_md/` | Markdown compressor module: `mod.rs` (core logic), `locale.rs` (Locale struct + `from_code`), `locales/en.rs` + `locales/pt_br.rs` (word lists). Exposes `compress_text` (EN default) and `compress_text_with_locale`. Select locale via `lang=` in config or `--lang` CLI flag. | | `src/commands/init.rs` | Session start: finalize previous session memory, inject persona prompt | | `src/commands/benchmark.rs` | 19-scenario reproducible benchmark suite | | `src/config.rs` | Config struct + `~/.claude/squeez/config.ini` parser; all fields have defaults | @@ -62,7 +62,7 @@ Cross-call awareness across 16 recent invocations: ### Tests -35 integration test files under `tests/`. Each strategy and handler has dedicated test file. Notable new ones: `test_redundancy_shingle.rs` (8 fuzzy-match tests), `test_mcp_server.rs` (10 JSON-RPC tests). Benchmark fixtures live in `bench/fixtures/`; capture new ones w/ `bash bench/capture.sh`. +35 integration test files under `tests/`. Each strategy and handler has dedicated test file. Notable new ones: `test_redundancy_shingle.rs` (8 fuzzy-match tests), `test_mcp_server.rs` (10 JSON-RPC tests). Benchmark fixtures live in `bench/fixtures/`capture new ones w/ `bash bench/capture.sh`. ### Release & distribution diff --git a/Cargo.toml b/Cargo.toml index 87d55a8..cc4b115 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -8,11 +8,16 @@ license = "MIT" keywords = ["claude-code", "token", "compression", "llm", "cli"] categories = ["command-line-utilities", "development-tools"] readme = "README.md" +autobenches = false [[bin]] name = "squeez" path = "src/main.rs" +[[bin]] +name = "bench_i18n" +path = "benches/bench_i18n.rs" + [lib] name = "squeez" path = "src/lib.rs" diff --git a/README.md b/README.md index 025402e..ff25335 100644 --- a/README.md +++ b/README.md @@ -80,7 +80,7 @@ squeez update --insecure # skip checksum (not recommended) | **MCP server** | `squeez mcp` runs a JSON-RPC 2.0 server over stdio exposing 6 read-only tools so any MCP-compatible LLM can query session memory directly. Hand-rolled, no `mcp.server` dependency. | | **Auto-teach payload** | `squeez protocol` (or the `squeez_protocol` MCP tool) prints a 2.4 KB self-describing payload — the LLM learns squeez's markers and protocol on first call. | | **Caveman persona** | Injects an ultra-terse prompt at session start so the model responds with fewer tokens. | -| **Memory-file compression** | `squeez compress-md` compresses CLAUDE.md / AGENTS.md / copilot-instructions.md in-place — pure Rust, zero LLM. | +| **Memory-file compression** | `squeez compress-md` compresses CLAUDE.md / AGENTS.md / copilot-instructions.md in-place — pure Rust, zero LLM. i18n-aware: set `lang = pt` (or `--lang pt`) for pt-BR article/filler/phrase dropping and Unicode-correct matching. | | **Session memory** | On `SessionStart`, injects a summary of the previous session (files touched, errors, test results, git events). Summaries carry temporal validity (`valid_from`/`valid_to`) so invalidated entries age from `valid_to`. | | **Token tracking** | Every `PostToolUse` result (Bash, Read, Grep, Glob) feeds a `SessionContext` so squeez knows what the agent has already seen. | @@ -123,6 +123,28 @@ Measured on macOS (Apple Silicon). Token count = `chars / 4` (matches Claude's ~ | Latency p50 (filter mode) | **< 0.3 ms** | | Latency p95 (incl. wrap/summarize) | **64 ms** | +### compress-md i18n — EN vs pt-BR (Apple Silicon, release build) + +| Locale | Mode | Before | After | Reduction | Latency | +|--------|------|--------|-------|-----------|---------| +| EN | Full | 514 tk | 445 tk | **−14%** | 170 µs | +| EN | Ultra | 514 tk | 434 tk | **−16%** | — | +| pt-BR | Full | 558 tk | 488 tk | **−13%** | 256 µs | +| pt-BR | Ultra | 558 tk | 468 tk | **−17%** | — | + +PT-BR is **~1.5× slower** than EN due to Unicode case folding — still sub-millisecond per call. Both locales produce `result.safe = true`. Run `cargo run --release --bin bench_i18n` to reproduce. + +**Before / after — pt-BR Full mode:** +``` +IN: O sistema é basicamente apenas uma ferramenta para configurar o repositório. + De modo geral, você pode considerar que a função principal inicializa a documentação do projeto. + +Full: sistema é ferramenta para configurar repositório. função principal inicializa documentação projeto. +Ultra: sistema é ferramenta p/ configurar repo. fn principal inicializa docs projeto. +``` + +Drops: articles (`o`, `a`, `do`), fillers (`basicamente`, `apenas`), phrases (`De modo geral`, `você pode considerar que`). Ultra adds abbreviations (`repositório→repo`, `função→fn`, `documentação→docs`, `para→p/`). + ### Estimated cost savings — Claude Sonnet 4.6 · $3.00 / MTok input | Usage | Baseline / month | Saved / month | @@ -177,8 +199,9 @@ docker logs mycontainer 2>&1 | squeez filter docker Pure-Rust, zero-LLM compressor for markdown files. Preserves code blocks, inline code, URLs, headings, file paths, and tables. Compresses prose only. Always writes a backup at `.original.md`. ```bash -squeez compress-md CLAUDE.md # Full mode +squeez compress-md CLAUDE.md # Full mode (English default) squeez compress-md --ultra CLAUDE.md # + abbreviations (with→w/, fn, cfg, etc.) +squeez compress-md --lang pt CLAUDE.md # pt-BR locale (articles, fillers, phrases) squeez compress-md --dry-run CLAUDE.md # preview, no write squeez compress-md --all # compress all known locations automatically ``` @@ -264,6 +287,7 @@ memory_retention_days = 30 # ── Output / persona ─────────────────────────────────────────── persona = ultra # off | lite | full | ultra auto_compress_md = true # run compress-md on every session start +lang = en # compress-md locale: en | pt (pt-BR) — more languages extensible ``` ### Adaptive intensity — Full / Ultra split @@ -392,7 +416,7 @@ Requires Rust stable. Windows requires Git Bash. git clone https://github.com/claudioemmanuel/squeez.git cd squeez -cargo test # run all tests +cargo test # run all tests (315 tests, 38 suites) cargo build --release # build release binary bash bench/run.sh # filter-mode benchmark (14 fixtures) diff --git a/bench/report.md b/bench/report.md index 4b6defa..f0385dc 100644 --- a/bench/report.md +++ b/bench/report.md @@ -1,17 +1,17 @@ FIXTURE BEFORE AFTER REDUCTION LATENCY STATUS ────────────────────────────────────────────────────────────────────────────── -docker_logs.txt 665tk 186tk 73% 4ms ✅ +docker_logs.txt 665tk 186tk 73% 3ms ✅ env_dump.txt 441tk 287tk 35% 3ms ✅ find_deep.txt 424tk 134tk 69% 3ms ✅ git_copilot_session.txt 639tk 421tk 35% 3ms ✅ git_diff.txt 502tk 317tk 37% 3ms ✅ -git_log_200.txt 2667tk 819tk 70% 4ms ✅ +git_log_200.txt 2667tk 819tk 70% 3ms ✅ git_status.txt 50tk 16tk 68% 3ms ✅ -intensity_budget80.txt 4418tk 52tk 99% 4ms ✅ +intensity_budget80.txt 4418tk 52tk 99% 3ms ✅ ls_la.txt 1782tk 886tk 51% 3ms ✅ mdcompress_claude_md.txt 316tk 246tk 23% 3ms ✅ mdcompress_prose.txt 187tk 138tk 27% 3ms ✅ -npm_install.txt 524tk 231tk 56% 4ms ✅ +npm_install.txt 524tk 231tk 56% 3ms ✅ ps_aux.txt 40373tk 2352tk 95% 6ms ✅ summarize_huge.txt 82257tk 47tk 100% 12ms ✅ diff --git a/benches/bench_i18n.rs b/benches/bench_i18n.rs new file mode 100644 index 0000000..2cac49a --- /dev/null +++ b/benches/bench_i18n.rs @@ -0,0 +1,75 @@ +// Benchmark: EN vs PT-BR compression speed + ratio. +// Run with: cargo run --release --bin bench_i18n +use std::hint::black_box; +use squeez::commands::compress_md::{compress_text_with_locale, Locale, Mode}; + +fn tokens(bytes: usize) -> usize { bytes / 4 } + +fn print_ratio(label: &str, input: &str, locale: &'static Locale, mode: Mode) { + let r = compress_text_with_locale(input, mode, locale); + let before_tk = tokens(r.stats.orig_bytes); + let after_tk = tokens(r.stats.new_bytes); + let pct = 100usize.saturating_sub(after_tk * 100 / before_tk.max(1)); + println!(" {:<24} {:>6}tk → {:>5}tk -{:>2}% safe={}", label, before_tk, after_tk, pct, r.safe); +} + +fn main() { + let en = Locale::from_code("en"); + let pt = Locale::from_code("pt-BR"); + + let en_input = include_str!("fixtures/en_prose.txt"); + let pt_input = include_str!("fixtures/pt_br_prose.txt"); + + println!("── Compression ratio ────────────────────────────────────────"); + print_ratio("EN prose / Full", en_input, en, Mode::Full); + print_ratio("EN prose / Ultra", en_input, en, Mode::Ultra); + print_ratio("PT-BR prose / Full", pt_input, pt, Mode::Full); + print_ratio("PT-BR prose / Ultra",pt_input, pt, Mode::Ultra); + println!(); + + let iters = 1000u32; + + println!("── Latency (×{iters} iterations) ──────────────────────────────────"); + let start = std::time::Instant::now(); + for _ in 0..iters { + black_box(compress_text_with_locale(black_box(en_input), Mode::Full, en)); + } + let en_ms = start.elapsed().as_millis(); + + let start = std::time::Instant::now(); + for _ in 0..iters { + black_box(compress_text_with_locale(black_box(pt_input), Mode::Full, pt)); + } + let pt_ms = start.elapsed().as_millis(); + + println!(" EN Full: {}ms ({:.0}µs/call)", en_ms, en_ms as f64 * 1000.0 / iters as f64); + println!(" PT-BR Full: {}ms ({:.0}µs/call) {:.2}× vs EN", pt_ms, + pt_ms as f64 * 1000.0 / iters as f64, + pt_ms as f64 / en_ms.max(1) as f64); + + assert!(pt_ms < en_ms * 3 + 100, "PT-BR too slow vs EN: {}ms vs {}ms", pt_ms, en_ms); + + println!(); + println!("── Before / after example (pt-BR) ────────────────────────────"); + let demo = "O sistema é basicamente apenas uma ferramenta para configurar o repositório. \ + De modo geral, você pode considerar que a função principal inicializa a documentação do projeto."; + let rf = compress_text_with_locale(demo, Mode::Full, pt); + let ru = compress_text_with_locale(demo, Mode::Ultra, pt); + println!(" IN: {}", demo); + println!(" Full: {}", rf.output.trim()); + println!(" Ultra: {}", ru.output.trim()); +} + +#[allow(dead_code)] +fn show_example() { + let pt = Locale::from_code("pt-BR"); + let inputs = [ + ("O sistema é basicamente apenas uma ferramenta para configurar o repositório. De modo geral, você pode considerar que a função principal inicializa a documentação do projeto.", "pt-BR Full"), + ("O sistema é basicamente apenas uma ferramenta para configurar o repositório. De modo geral, você pode considerar que a função principal inicializa a documentação do projeto.", "pt-BR Ultra"), + ]; + for (input, label) in inputs { + let mode = if label.contains("Ultra") { Mode::Ultra } else { Mode::Full }; + let r = compress_text_with_locale(input, mode, pt); + println!("{}: {}", label, r.output.trim()); + } +} diff --git a/benches/fixtures/en_prose.txt b/benches/fixtures/en_prose.txt new file mode 100644 index 0000000..92d9432 --- /dev/null +++ b/benches/fixtures/en_prose.txt @@ -0,0 +1,27 @@ +# Project Guide + +This document describes the architecture of the system. The project is basically a tool for compressing markdown files. It really just provides a simple way to remove the filler words and unnecessary phrases from prose content. + +## Overview + +The compression pipeline works in several stages. First, the parser identifies code blocks, URLs, headings, and tables that must be preserved verbatim. Then the prose content is processed through a series of filters that drop articles, fillers, and hedges. + +## Configuration + +You can configure the behavior of the tool with the configuration file. The configuration is loaded from a standard location in the user home directory. Each parameter has a sensible default, so you only need to override the ones that you really care about. + +## Usage + +To use the tool, simply pass the path to the file that you want to compress. The tool will read the file, run the compression pipeline, and write the output back to the same file. A backup of the original file is created automatically. + +Of course, if you just want to preview the output without modifying the file, you can use the dry-run flag. This will print the compressed output to standard output instead. I'd be happy to help you configure the tool if you have any questions. + +## Architecture + +The architecture of the tool is really quite simple. Each stage of the pipeline is implemented as a separate function that takes the input and returns the transformed output. The stages are composed together in a fixed order, and the result is written to the destination. + +In general, you should not need to modify the architecture of the tool. The default configuration is suitable for most use cases. However, if you have special requirements, you can extend the tool by adding new stages to the pipeline. + +## Performance + +The tool is designed to be fast. It processes a typical markdown file in a few milliseconds. The compression ratio depends on the content, but you can typically expect a reduction of around fifty percent on prose-heavy documents. diff --git a/benches/fixtures/pt_br_prose.txt b/benches/fixtures/pt_br_prose.txt new file mode 100644 index 0000000..af63e08 --- /dev/null +++ b/benches/fixtures/pt_br_prose.txt @@ -0,0 +1,27 @@ +# Guia do Projeto + +Este documento descreve a arquitetura do sistema. O projeto é basicamente uma ferramenta para comprimir arquivos markdown. Ele realmente apenas fornece uma maneira simples de remover as palavras de preenchimento e frases desnecessárias do conteúdo em prosa. + +## Visão Geral + +O pipeline de compressão funciona em várias etapas. Primeiro, o parser identifica os blocos de código, URLs, títulos e tabelas que devem ser preservados literalmente. Em seguida, o conteúdo em prosa é processado através de uma série de filtros que removem os artigos, preenchimentos e atenuações. + +## Configuração + +Você pode configurar o comportamento da ferramenta com o arquivo de configuração. A configuração é carregada de um local padrão no diretório home do usuário. Cada parâmetro tem um padrão razoável, então você só precisa substituir aqueles com os quais realmente se importa. + +## Uso + +Para usar a ferramenta, simplesmente passe o caminho do arquivo que você quer comprimir. A ferramenta lerá o arquivo, executará o pipeline de compressão e escreverá a saída de volta no mesmo arquivo. Um backup do arquivo original é criado automaticamente. + +Claro que, se você apenas quer visualizar a saída sem modificar o arquivo, você pode usar a flag dry-run. Isso vai imprimir a saída comprimida na saída padrão. Fico feliz em ajudar você a configurar a ferramenta se tiver dúvidas. + +## Arquitetura + +A arquitetura da ferramenta é realmente bem simples. Cada estágio do pipeline é implementado como uma função separada que recebe a entrada e retorna a saída transformada. Os estágios são compostos em uma ordem fixa, e o resultado é escrito no destino. + +De modo geral, você não precisa modificar a arquitetura da ferramenta. A configuração padrão é adequada para a maioria dos casos. Porém, se você tiver requisitos especiais, pode estender a ferramenta adicionando novos estágios ao pipeline. + +## Desempenho + +A ferramenta é projetada para ser rápida. Ela processa um arquivo markdown típico em poucos milissegundos. A taxa de compressão depende do conteúdo, mas você pode tipicamente esperar uma redução de cerca de cinquenta por cento em documentos com muita prosa. diff --git a/src/commands/compress_md/locale.rs b/src/commands/compress_md/locale.rs new file mode 100644 index 0000000..0e797a8 --- /dev/null +++ b/src/commands/compress_md/locale.rs @@ -0,0 +1,22 @@ +use crate::commands::compress_md::locales; + +#[derive(Copy, Clone, Debug)] +pub struct Locale { + #[allow(dead_code)] + pub code: &'static str, + pub fillers: &'static [&'static str], + pub articles: &'static [&'static str], + pub phrases: &'static [&'static str], + pub hedges: &'static [&'static str], + pub conjunctions: &'static [&'static str], + pub ultra_subs: &'static [(&'static str, &'static str)], +} + +impl Locale { + pub fn from_code(code: &str) -> &'static Locale { + match code { + "pt" | "pt-BR" | "pt_BR" | "pt-br" => &locales::PT_BR, + _ => &locales::EN, + } + } +} diff --git a/src/commands/compress_md/locales/en.rs b/src/commands/compress_md/locales/en.rs new file mode 100644 index 0000000..aab029a --- /dev/null +++ b/src/commands/compress_md/locales/en.rs @@ -0,0 +1,37 @@ +use crate::commands::compress_md::locale::Locale; + +pub static EN: Locale = Locale { + code: "en", + fillers: &["just","really","basically","actually","simply","sure","certainly"], + articles: &["the","a","an"], + phrases: &[ + "of course", + "i'd be happy to", + "let me ", + "i'll help you", + "i would like to", + "please note that", + "it might be worth", + "you could consider", + "in general", + "as a rule", + ], + hedges: &["perhaps","maybe"], + conjunctions: &[" and"," or"," but"," so"], + ultra_subs: &[ + ("without","w/o"), + ("with","w/"), + ("because","b/c"), + ("function","fn"), + ("parameter","param"), + ("arguments","args"), + ("argument","arg"), + ("configuration","config"), + ("documentation","docs"), + ("directory","dir"), + ("repository","repo"), + ("between","btw"), + ("versus","vs"), + ("approximately","~"), + ], +}; diff --git a/src/commands/compress_md/locales/mod.rs b/src/commands/compress_md/locales/mod.rs new file mode 100644 index 0000000..ba2f024 --- /dev/null +++ b/src/commands/compress_md/locales/mod.rs @@ -0,0 +1,4 @@ +pub mod en; +pub mod pt_br; +pub use en::EN; +pub use pt_br::PT_BR; diff --git a/src/commands/compress_md/locales/pt_br.rs b/src/commands/compress_md/locales/pt_br.rs new file mode 100644 index 0000000..6e29cc8 --- /dev/null +++ b/src/commands/compress_md/locales/pt_br.rs @@ -0,0 +1,68 @@ +use crate::commands::compress_md::locale::Locale; + +pub static PT_BR: Locale = Locale { + code: "pt-BR", + fillers: &[ + "só", "apenas", "realmente", "basicamente", "literalmente", + "simplesmente", "certamente", "claramente", "obviamente", + "exatamente", "praticamente", "tipo", "meio", + ], + // Includes definite/indefinite articles AND preposition+article contractions + // (do/da/no/na/ao/à/pelo/pela etc.) — all should be dropped in pt-BR prose compression. + articles: &[ + "o", "a", "os", "as", + "um", "uma", "uns", "umas", + "do", "da", "dos", "das", + "no", "na", "nos", "nas", + "ao", "aos", "à", "às", + "pelo", "pela", "pelos", "pelas", + ], + phrases: &[ + "claro que sim", + "com certeza", + "fico feliz em ajudar", + "ficarei feliz em", + "deixa eu ", + "deixe-me ", + "vou te ajudar", + "gostaria de ", + "por favor, note que", + "vale a pena ", + "você pode considerar que", + "você pode considerar", + "de modo geral", + "em geral", + "como regra", + "é importante notar que", + "vale ressaltar que", + "vale lembrar que", + ], + hedges: &[ + "talvez", "possivelmente", "provavelmente", "eventualmente", + "porventura", "quiçá", + ], + conjunctions: &[ + " e", " ou", " mas", " porém", " contudo", " então", " logo", " pois", + ], + ultra_subs: &[ + ("sem", "s/"), + ("com", "c/"), + ("porque", "pq"), + ("por que", "pq"), + ("função", "fn"), + ("funções", "fns"), + ("parâmetro", "param"), + ("parâmetros", "params"), + ("argumento", "arg"), + ("argumentos", "args"), + ("configuração", "config"), + ("configurações", "configs"), + ("documentação", "docs"), + ("diretório", "dir"), + ("repositório", "repo"), + ("aproximadamente","~"), + ("também", "tb"), + ("você", "vc"), + ("para", "p/"), + ], +}; diff --git a/src/commands/compress_md.rs b/src/commands/compress_md/mod.rs similarity index 78% rename from src/commands/compress_md.rs rename to src/commands/compress_md/mod.rs index 68948a7..6e653a4 100644 --- a/src/commands/compress_md.rs +++ b/src/commands/compress_md/mod.rs @@ -2,6 +2,10 @@ // No LLM calls. Preserves code blocks, URLs, headings, file paths, tables. // Compresses natural-language prose only. +mod locale; +mod locales; +pub use locale::Locale; + use std::path::{Path, PathBuf}; use crate::session::home_dir; @@ -39,13 +43,24 @@ pub fn run(args: &[String]) -> i32 { let mut all = false; let mut quiet = false; let mut targets: Vec = Vec::new(); + let mut lang_cli: Option = None; - for a in args { + let mut i = 0; + while i < args.len() { + let a = &args[i]; match a.as_str() { "--ultra" => mode = Mode::Ultra, "--dry-run" => dry_run = true, "--all" => all = true, "--quiet" => quiet = true, + "--lang" => { + if i + 1 >= args.len() { + eprintln!("squeez compress-md: --lang requires a value"); + return 2; + } + i += 1; + lang_cli = Some(args[i].clone()); + } "-h" | "--help" => { print_help(); return 0; @@ -56,8 +71,14 @@ pub fn run(args: &[String]) -> i32 { } s => targets.push(s.to_string()), } + i += 1; } + let locale = { + let code = lang_cli.unwrap_or_else(|| crate::config::Config::load().lang); + Locale::from_code(&code) + }; + let files: Vec = if all { all_targets() } else if targets.is_empty() { @@ -78,7 +99,7 @@ pub fn run(args: &[String]) -> i32 { continue; } any_processed = true; - match process_file(f, mode, dry_run, quiet) { + match process_file(f, mode, dry_run, quiet, locale) { Ok(()) => {} Err(e) => { eprintln!("squeez compress-md: {} — {}", f.display(), e); @@ -101,12 +122,14 @@ pub fn run(args: &[String]) -> i32 { /// Quiet bulk-compression entry used by `init` when auto_compress_md=true. /// Never errors out the caller; failures are silent. pub fn run_all_quietly() -> i32 { + let cfg = crate::config::Config::load(); + let locale = Locale::from_code(&cfg.lang); let files = all_targets(); for f in &files { if !f.exists() { continue; } - let _ = process_file(f, Mode::Ultra, false, true); + let _ = process_file(f, Mode::Ultra, false, true, locale); } 0 } @@ -126,6 +149,7 @@ fn print_help() { println!(" $PWD/CLAUDE.md, $PWD/AGENTS.md,"); println!(" $PWD/.github/copilot-instructions.md"); println!(" --quiet Suppress informational output"); + println!(" --lang Locale: en (default), pt-BR. Overrides config 'lang'."); println!(); println!("Preserved verbatim: code blocks (```...```), inline `code`,"); println!("URLs, file paths, headings, tables, list markers, version numbers."); @@ -147,9 +171,15 @@ fn all_targets() -> Vec { v } -fn process_file(path: &Path, mode: Mode, dry_run: bool, quiet: bool) -> Result<(), String> { +fn process_file( + path: &Path, + mode: Mode, + dry_run: bool, + quiet: bool, + locale: &'static Locale, +) -> Result<(), String> { let original = std::fs::read_to_string(path).map_err(|e| e.to_string())?; - let result = compress_text(&original, mode); + let result = compress_text_with_locale(&original, mode, locale); if !result.safe { return Err(format!( @@ -232,11 +262,21 @@ enum State { } pub fn compress_text(input: &str, mode: Mode) -> CompressResult { - let mut stats = Stats::default(); - stats.orig_bytes = input.len(); - stats.orig_code_blocks = count_code_blocks(input); - stats.orig_urls = count_urls(input); - stats.orig_headings = count_headings(input); + compress_text_with_locale(input, mode, Locale::from_code("en")) +} + +pub fn compress_text_with_locale( + input: &str, + mode: Mode, + locale: &'static Locale, +) -> CompressResult { + let mut stats = Stats { + orig_bytes: input.len(), + orig_code_blocks: count_code_blocks(input), + orig_urls: count_urls(input), + orig_headings: count_headings(input), + ..Default::default() + }; let mut out = String::with_capacity(input.len()); let mut state = State::Text; @@ -280,7 +320,7 @@ pub fn compress_text(input: &str, mode: Mode) -> CompressResult { out.push('\n'); i += 1; } else { - let compressed = compress_prose_line(line, mode); + let compressed = compress_prose_line(line, mode, locale); out.push_str(&compressed); out.push('\n'); i += 1; @@ -444,7 +484,7 @@ fn split_protected_spans(line: &str) -> Vec> { spans } -fn compress_prose_line(line: &str, mode: Mode) -> String { +fn compress_prose_line(line: &str, mode: Mode, locale: &Locale) -> String { // Preserve leading whitespace + list markers let leading_ws_len = line.len() - line.trim_start().len(); let leading = &line[..leading_ws_len]; @@ -458,7 +498,7 @@ fn compress_prose_line(line: &str, mode: Mode) -> String { for span in spans { match span { Span::Verbatim(v) => out.push_str(v), - Span::Prose(p) => out.push_str(&compress_prose_span(p, mode)), + Span::Prose(p) => out.push_str(&compress_prose_span(p, mode, locale)), } } @@ -497,58 +537,14 @@ fn split_list_marker(s: &str) -> (&str, &str) { ("", s) } -const FILLERS: &[&str] = &[ - "just", - "really", - "basically", - "actually", - "simply", - "sure", - "certainly", -]; - -const ARTICLES: &[&str] = &["the", "a", "an"]; - -const PHRASES: &[&str] = &[ - "of course", - "i'd be happy to", - "let me ", - "i'll help you", - "i would like to", - "please note that", - "it might be worth", - "you could consider", - "in general", - "as a rule", -]; - -const HEDGES: &[&str] = &["perhaps", "maybe"]; - -const ULTRA_SUBS: &[(&str, &str)] = &[ - ("without", "w/o"), - ("with", "w/"), - ("because", "b/c"), - ("function", "fn"), - ("parameter", "param"), - ("arguments", "args"), - ("argument", "arg"), - ("configuration", "config"), - ("documentation", "docs"), - ("directory", "dir"), - ("repository", "repo"), - ("between", "btw"), - ("versus", "vs"), - ("approximately", "~"), -]; - -fn compress_prose_span(text: &str, mode: Mode) -> String { +fn compress_prose_span(text: &str, mode: Mode, locale: &Locale) -> String { if text.trim().is_empty() { return text.to_string(); } let mut s = text.to_string(); // Drop multi-word phrases (case-insensitive substring) - for phrase in PHRASES { + for phrase in locale.phrases { s = drop_phrase_ci(&s, phrase); } @@ -580,9 +576,9 @@ fn compress_prose_span(text: &str, mode: Mode) -> String { // like brackets/parens/braces). Allow trailing comma/period only. if is_clean_word(tok) { let lower = strip_punct(&tok.to_lowercase()); - if FILLERS.contains(&lower.as_str()) - || HEDGES.contains(&lower.as_str()) - || ARTICLES.contains(&lower.as_str()) + if locale.fillers.contains(&lower.as_str()) + || locale.hedges.contains(&lower.as_str()) + || locale.articles.contains(&lower.as_str()) { // drop the following whitespace too if matches!(kept.last().map(|s| s.as_str()), Some(s) if s.chars().all(|c| c.is_whitespace())) { @@ -610,15 +606,18 @@ fn compress_prose_span(text: &str, mode: Mode) -> String { } // Trim trailing dangling conjunctions - let trimmed = trim_trailing_conjunction(out.trim_end()); + let trimmed = trim_trailing_conjunction(out.trim_end(), locale); // Strip stray leading punctuation left behind by dropped phrases // (e.g. "In general, you could…" → ", you could…" → "you could…"). let cleaned = strip_leading_orphan_punct(&trimmed); + // Also clean mid-string orphan commas after sentence boundaries + // (e.g. "end. , next" → "end. next" when a phrase starting mid-sentence is dropped). + let cleaned = clean_mid_orphan_punct(cleaned); // Ultra: word substitutions outside protected spans (we are inside one) let final_out = if mode == Mode::Ultra { - ultra_subs(cleaned) + ultra_subs(cleaned, locale) } else { cleaned }; @@ -634,6 +633,41 @@ fn compress_prose_span(text: &str, mode: Mode) -> String { } } +/// Remove orphan commas/semicolons that appear after sentence-boundary punctuation +/// (`. ,` → `. `) or after double-spaces introduced by phrase drops (` , ` → ` `). +fn clean_mid_orphan_punct(s: String) -> String { + let mut out = String::with_capacity(s.len()); + let chars: Vec = s.chars().collect(); + let mut i = 0; + while i < chars.len() { + let c = chars[i]; + // Pattern: sentence-end punct+space then comma — `. ,` or `! ,` or `? ,` + if matches!(c, '.' | '!' | '?') + && chars.get(i + 1) == Some(&' ') + && matches!(chars.get(i + 2), Some(&',') | Some(&';')) + { + out.push(c); // keep the sentence-end punct + out.push(' '); // keep one space + i += 3; // skip the orphan comma/semicolon + // also skip any space that follows the skipped comma + while i < chars.len() && chars[i] == ' ' { i += 1; } + continue; + } + // Pattern: space + orphan comma/semicolon + space → single space + if c == ' ' + && matches!(chars.get(i + 1), Some(&',') | Some(&';')) + && chars.get(i + 2) == Some(&' ') + { + out.push(' '); + i += 3; + continue; + } + out.push(c); + i += 1; + } + out +} + fn strip_leading_orphan_punct(s: &str) -> String { let trimmed = s.trim_start(); let mut chars = trimmed.chars().peekable(); @@ -664,61 +698,69 @@ fn strip_punct(s: &str) -> String { /// other structural punctuation are NEVER dropped (they may be link /// brackets or markup). fn is_clean_word(tok: &str) -> bool { - let bytes = tok.as_bytes(); - let mut i = 0; - // body: alphanumeric or apostrophe - while i < bytes.len() { - let c = bytes[i] as char; + let mut chars = tok.chars().peekable(); + let mut body_len = 0; + while let Some(&c) = chars.peek() { if c.is_alphanumeric() || c == '\'' { - i += 1; + chars.next(); + body_len += 1; } else { break; } } - if i == 0 { + if body_len == 0 { return false; } - // optional trailing punctuation - while i < bytes.len() { - let c = bytes[i] as char; - if matches!(c, ',' | '.' | ';' | ':' | '!' | '?') { - i += 1; - } else { + for c in chars { + if !matches!(c, ',' | '.' | ';' | ':' | '!' | '?') { return false; } } true } +/// Drop all case-insensitive occurrences of `needle` (and any immediately trailing spaces) +/// from `s`. `needle` must be pre-lowercased. +/// +/// Uses dual `(s_i, l_i)` byte cursors advanced one `s`-char at a time so that Unicode +/// case expansion (e.g. ß→ss) never desyncs the cursors. fn drop_phrase_ci(s: &str, needle: &str) -> String { - let mut result = String::with_capacity(s.len()); - let lower = s.to_lowercase(); - let mut i = 0; - while i < s.len() { - if lower[i..].starts_with(needle) { - // skip following whitespace too - let mut end = i + needle.len(); - while end < s.len() && s.as_bytes()[end] == b' ' { - end += 1; + // Build lowercase mirror of s for matching. + let lower: String = s.chars().flat_map(char::to_lowercase).collect(); + + let mut out = String::with_capacity(s.len()); + let mut s_i = 0usize; // byte cursor in s + let mut l_i = 0usize; // byte cursor in lower (invariant: lower[0..l_i] == lowercase(s[0..s_i])) + + while s_i < s.len() { + debug_assert!(l_i <= lower.len(), "l_i cursor must not exceed lower.len()"); + if lower[l_i..].starts_with(needle) { + // Advance both cursors together through the matched chars. + let l_end = l_i + needle.len(); + while l_i < l_end { + let ch = s[s_i..].chars().next().unwrap(); + s_i += ch.len_utf8(); + l_i += ch.to_lowercase().map(|c| c.len_utf8()).sum::(); + } + // Skip trailing ASCII spaces in both (space → space: 1 byte each). + while s_i < s.len() && s.as_bytes()[s_i] == b' ' { + s_i += 1; + l_i += 1; } - i = end; } else { - // copy one char - let next_boundary = s[i..] - .char_indices() - .nth(1) - .map(|(b, _)| i + b) - .unwrap_or(s.len()); - result.push_str(&s[i..next_boundary]); - i = next_boundary; + // Copy one char from s, advance both cursors. + let ch = s[s_i..].chars().next().unwrap(); + out.push(ch); + s_i += ch.len_utf8(); + l_i += ch.to_lowercase().map(|c| c.len_utf8()).sum::(); } } - result + out } -fn trim_trailing_conjunction(s: &str) -> String { +fn trim_trailing_conjunction(s: &str, locale: &Locale) -> String { let lower = s.to_lowercase(); - for c in &[" and", " or", " but", " so"] { + for c in locale.conjunctions { if lower.ends_with(c) { return s[..s.len() - c.len()].trim_end().to_string(); } @@ -726,46 +768,55 @@ fn trim_trailing_conjunction(s: &str) -> String { s.to_string() } -fn ultra_subs(mut s: String) -> String { - for (long, short) in ULTRA_SUBS { +fn ultra_subs(mut s: String, locale: &Locale) -> String { + for (long, short) in locale.ultra_subs { s = replace_word_boundary(&s, long, short); } s } +fn is_word_char_unicode(c: char) -> bool { + c.is_alphanumeric() || c == '_' +} + fn replace_word_boundary(s: &str, needle: &str, repl: &str) -> String { + let needle_lower: String = needle.chars().flat_map(char::to_lowercase).collect(); + let chars: Vec<(usize, char)> = s.char_indices().collect(); let mut out = String::with_capacity(s.len()); - let bytes = s.as_bytes(); - let nbytes = needle.as_bytes(); let mut i = 0; - while i < bytes.len() { - if i + nbytes.len() <= bytes.len() - && bytes[i..i + nbytes.len()].eq_ignore_ascii_case(nbytes) - { - let prev_ok = i == 0 || !is_word_char(bytes[i - 1] as char); - let next_ok = i + nbytes.len() == bytes.len() - || !is_word_char(bytes[i + nbytes.len()] as char); + while i < chars.len() { + // Try to match needle_lower starting at chars[i] + let mut buf = String::new(); + let mut j = i; + let mut matched = false; + while j < chars.len() { + for lc in chars[j].1.to_lowercase() { + buf.push(lc); + } + j += 1; + if buf == needle_lower { + matched = true; + break; + } + if !needle_lower.starts_with(&buf as &str) { + break; + } + } + if matched { + let prev_ok = i == 0 || !is_word_char_unicode(chars[i - 1].1); + let next_ok = j == chars.len() || !is_word_char_unicode(chars[j].1); if prev_ok && next_ok { out.push_str(repl); - i += nbytes.len(); + i = j; continue; } } - let next_boundary = s[i..] - .char_indices() - .nth(1) - .map(|(b, _)| i + b) - .unwrap_or(s.len()); - out.push_str(&s[i..next_boundary]); - i = next_boundary; + out.push(chars[i].1); + i += 1; } out } -fn is_word_char(c: char) -> bool { - c.is_alphanumeric() || c == '_' -} - #[cfg(test)] mod tests { use super::*; diff --git a/src/config.rs b/src/config.rs index 075ec0a..fae3f99 100644 --- a/src/config.rs +++ b/src/config.rs @@ -21,6 +21,7 @@ pub struct Config { // ── Output / memory-file flags ────────────────────────────────────── pub persona: Persona, pub auto_compress_md: bool, + pub lang: String, } impl Default for Config { @@ -48,6 +49,7 @@ impl Default for Config { summarize_threshold_lines: 500, persona: Persona::Ultra, auto_compress_md: true, + lang: "en".to_string(), } } } @@ -95,6 +97,7 @@ impl Config { } "persona" => c.persona = crate::commands::persona::from_str(v), "auto_compress_md" => c.auto_compress_md = v == "true", + "lang" => c.lang = v.to_string(), _ => {} } } diff --git a/tests/test_compress_md_i18n.rs b/tests/test_compress_md_i18n.rs new file mode 100644 index 0000000..ca85104 --- /dev/null +++ b/tests/test_compress_md_i18n.rs @@ -0,0 +1,354 @@ +//! i18n integration tests for compress_md. + +use squeez::commands::compress_md::{compress_text, compress_text_with_locale, Locale, Mode}; +use squeez::config::Config; + +// ── Unit: Locale resolution ──────────────────────────────────────────────── + +#[test] +fn locale_from_code_aliases() { + for code in &["pt", "pt-BR", "pt_BR", "pt-br"] { + assert_eq!( + Locale::from_code(code).code, "pt-BR", + "alias '{}' should resolve to pt-BR", code + ); + } + for code in &["en", "", "xx", "fr", "de", "ja"] { + assert_eq!( + Locale::from_code(code).code, "en", + "unknown code '{}' should fall back to en", code + ); + } +} + +#[test] +fn config_lang_default_en() { + assert_eq!(Config::default().lang, "en"); +} + +#[test] +fn config_lang_parsed() { + let c = Config::from_str("lang=pt\n"); + assert_eq!(c.lang, "pt"); + let c2 = Config::from_str("lang = pt-BR\n"); + assert_eq!(c2.lang, "pt-BR"); + let c3 = Config::from_str("lang = en\n"); + assert_eq!(c3.lang, "en"); +} + +// ── Unit: Unicode-correct helpers ───────────────────────────────────────── + +#[test] +fn is_clean_word_accepts_accented_via_behavior() { + let pt = Locale::from_code("pt-BR"); + let r = compress_text_with_locale("apenas um teste\n", Mode::Full, pt); + assert!(!r.output.contains("apenas"), "filler 'apenas' must be dropped"); + assert!(r.output.contains("teste")); +} + +#[test] +fn replace_word_boundary_unicode_correct() { + let pt = Locale::from_code("pt-BR"); + let r = compress_text_with_locale( + "a função e o funcionário trabalham juntos\n", + Mode::Ultra, + pt, + ); + assert!(r.safe); + assert!(r.output.contains("fn"), "'função' must be abbreviated to 'fn'"); + assert!( + r.output.contains("funcionário"), + "'funcionário' must not be corrupted" + ); + assert!( + !r.output.contains("fnário"), + "partial word match 'fnário' must not occur" + ); +} + +#[test] +fn drop_phrase_ci_unicode_accented_haystack() { + let pt = Locale::from_code("pt-BR"); + let r = compress_text_with_locale( + "De modo geral, o sistema funciona bem\n", + Mode::Full, + pt, + ); + assert!(r.safe); + assert!(!r.output.to_lowercase().contains("de modo geral")); + assert!(r.output.contains("sistema")); +} + +// ── Feature: pt-BR locale behavior ──────────────────────────────────────── + +#[test] +fn pt_br_articles_dropped() { + let pt = Locale::from_code("pt-BR"); + let r = compress_text_with_locale( + "o gato e a casa do João são bonitos\n", + Mode::Full, + pt, + ); + assert!(r.safe); + assert!(r.output.contains("gato")); + assert!(r.output.contains("João")); + assert!(!r.output.starts_with("o "), "leading article 'o' must be dropped"); +} + +#[test] +fn pt_br_fillers_dropped() { + let pt = Locale::from_code("pt-BR"); + let r = compress_text_with_locale( + "isso é basicamente apenas um teste simples\n", + Mode::Full, + pt, + ); + assert!(r.safe); + assert!(!r.output.contains("basicamente")); + assert!(!r.output.contains("apenas")); + assert!(r.output.contains("teste")); +} + +#[test] +fn pt_br_hedges_dropped() { + let pt = Locale::from_code("pt-BR"); + let r = compress_text_with_locale( + "talvez isso seja possível de implementar\n", + Mode::Full, + pt, + ); + assert!(r.safe); + assert!(!r.output.contains("talvez")); +} + +#[test] +fn pt_br_phrase_com_certeza_dropped() { + let pt = Locale::from_code("pt-BR"); + let r = compress_text_with_locale("Com certeza, posso ajudar você\n", Mode::Full, pt); + assert!(r.safe); + assert!(!r.output.to_lowercase().contains("com certeza")); +} + +#[test] +fn pt_br_phrase_de_modo_geral_dropped() { + let pt = Locale::from_code("pt-BR"); + let r = compress_text_with_locale("De modo geral, o sistema funciona\n", Mode::Full, pt); + assert!(r.safe); + assert!(!r.output.to_lowercase().contains("de modo geral")); +} + +#[test] +fn pt_br_ultra_subs_applied() { + let pt = Locale::from_code("pt-BR"); + let r = compress_text_with_locale( + "a configuração da função sem parâmetros\n", + Mode::Ultra, + pt, + ); + assert!(r.safe); + assert!(r.output.contains("config")); + assert!(r.output.contains("fn")); + assert!(r.output.contains("s/")); + assert!(r.output.contains("param")); +} + +#[test] +fn pt_br_preserves_accents_not_in_ultra_subs() { + let pt = Locale::from_code("pt-BR"); + let r_full = compress_text_with_locale("a nação precisa disso\n", Mode::Full, pt); + let r_ultra = compress_text_with_locale("a nação precisa disso\n", Mode::Ultra, pt); + assert!(r_full.output.contains("nação")); + assert!(r_ultra.output.contains("nação")); +} + +#[test] +fn pt_br_word_boundary_no_false_match() { + let pt = Locale::from_code("pt-BR"); + let r = compress_text_with_locale( + "o funcionário usa a função principal\n", + Mode::Ultra, + pt, + ); + assert!(r.safe); + assert!(r.output.contains("funcionário")); + assert!(!r.output.contains("fnário")); + assert!(r.output.contains("fn")); +} + +#[test] +fn pt_br_trim_trailing_conjunction() { + let pt = Locale::from_code("pt-BR"); + let r = compress_text_with_locale("compila o código e\n", Mode::Full, pt); + assert!(r.safe); + let trimmed = r.output.trim_end(); + assert!(!trimmed.ends_with(" e")); + assert!(!trimmed.ends_with(" ou")); +} + +#[test] +fn pt_br_url_preserved() { + let pt = Locale::from_code("pt-BR"); + let r = compress_text_with_locale( + "veja a documentação em https://example.com/docs para mais\n", + Mode::Full, + pt, + ); + assert!(r.safe); + assert!(r.output.contains("https://example.com/docs")); +} + +#[test] +fn pt_br_safety_check_realistic_fixture() { + let pt = Locale::from_code("pt-BR"); + let fixture = "\ +# Guia de Uso + +Este é um guia básico de configuração do sistema. \ +A função principal inicializa o repositório. + +```bash +cargo build --release +``` + +Veja https://example.com/docs para mais detalhes sobre a documentação. + +## Instalação + +Basicamente, você precisa apenas executar o comando acima. \ +Talvez seja necessário instalar as dependências primeiro. + +| Comando | Descrição | +|---------------|---------------------| +| cargo build | Compila o projeto | +| cargo test | Executa os testes | +"; + let r = compress_text_with_locale(fixture, Mode::Full, pt); + assert!(r.safe, "safety check failed: {:?}", r.stats); + assert_eq!(r.stats.orig_headings, r.stats.new_headings); + assert_eq!(r.stats.orig_code_blocks, r.stats.new_code_blocks); + assert!(r.stats.new_urls >= r.stats.orig_urls); +} + +// ── Feature: code/table/heading preservation ────────────────────────────── + +#[test] +fn pt_br_code_block_untouched() { + let pt = Locale::from_code("pt-BR"); + let input = "Use a função\n```\nfn configuração() {}\n```\n"; + let r = compress_text_with_locale(input, Mode::Ultra, pt); + assert!(r.safe); + assert!(r.output.contains("fn configuração() {}")); +} + +#[test] +fn pt_br_table_preserved() { + let pt = Locale::from_code("pt-BR"); + let input = "Intro.\n\n| coluna | valor |\n|--------|-------|\n| a | 1 |\n\nFim.\n"; + let r = compress_text_with_locale(input, Mode::Full, pt); + assert!(r.safe); + assert!(r.output.contains("| coluna | valor |")); + assert!(r.output.contains("| a | 1 |")); +} + +#[test] +fn pt_br_headings_preserved() { + let pt = Locale::from_code("pt-BR"); + let input = "# Título\n\nconteúdo\n\n## Seção\n\nmais conteúdo\n"; + let r = compress_text_with_locale(input, Mode::Full, pt); + assert_eq!(r.stats.orig_headings, r.stats.new_headings); + assert!(r.safe); +} + +#[test] +fn pt_br_idempotent_second_pass() { + let pt = Locale::from_code("pt-BR"); + let input = "# Título\n\nO sistema funciona bem com esta configuração.\n"; + let r1 = compress_text_with_locale(input, Mode::Full, pt); + let r2 = compress_text_with_locale(&r1.output, Mode::Full, pt); + assert!(r2.safe); + assert_eq!(r2.stats.new_headings, r1.stats.new_headings); + assert_eq!(r2.stats.new_code_blocks, r1.stats.new_code_blocks); +} + +// ── EN locale regression ─────────────────────────────────────────────────── + +#[test] +fn en_compress_text_matches_with_locale() { + let en = Locale::from_code("en"); + let inputs = [ + "The quick brown fox really just jumps.\n", + "Configure the function with these parameters.\n", + "# Title\n\nSome prose with the article.\n```rust\nfn main() {}\n```\n", + ]; + for input in &inputs { + let legacy = compress_text(input, Mode::Full); + let with_locale = compress_text_with_locale(input, Mode::Full, en); + assert_eq!(legacy.output, with_locale.output, "input: {:?}", input); + } +} + +#[test] +fn en_articles_still_dropped() { + let en = Locale::from_code("en"); + let r = compress_text_with_locale("The quick brown fox jumped over the lazy dog.\n", Mode::Full, en); + assert!(!r.output.to_lowercase().contains(" the ")); + assert!(r.output.contains("fox")); +} + +#[test] +fn en_ultra_subs_still_work() { + let en = Locale::from_code("en"); + let r = compress_text_with_locale( + "Configure the function with these parameters.\n", + Mode::Ultra, + en, + ); + assert!(r.output.contains("fn")); + assert!(r.output.contains("w/")); + assert!(r.output.contains("param")); +} + +// ── Cross-locale contract ───────────────────────────────────────────────── + +fn assert_locale_contract(locale: &'static Locale, label: &str) { + let fixture = "# Title\n\nSome prose content here.\n\n```bash\necho hello\n```\n\nSee https://example.com for details.\n\n| col1 | col2 |\n|------|------|\n| a | b |\n"; + let r = compress_text_with_locale(fixture, Mode::Full, locale); + assert!(r.safe, "[{}] safety check failed", label); + assert_eq!(r.stats.orig_headings, r.stats.new_headings, "[{}] headings", label); + assert_eq!(r.stats.orig_code_blocks, r.stats.new_code_blocks, "[{}] code blocks", label); + assert!(r.stats.new_urls >= r.stats.orig_urls, "[{}] urls", label); + assert!(r.stats.new_bytes > 0, "[{}] output not empty", label); +} + +#[test] +fn contract_en_locale() { + assert_locale_contract(Locale::from_code("en"), "en"); +} + +#[test] +fn contract_pt_br_locale() { + assert_locale_contract(Locale::from_code("pt-BR"), "pt-BR"); +} + +#[test] +fn contract_unknown_locale_falls_back_to_en() { + for code in &["fr", "de", "ja", "zh", "ar", "ru", "es", "it"] { + let locale = Locale::from_code(code); + assert_eq!(locale.code, "en", "unknown '{}' should fall back to en", code); + assert_locale_contract(locale, code); + } +} + +#[test] +fn ultra_mode_contract_both_locales() { + let input = "# Section\n\nThis is some prose content with details.\n\n```rust\nfn main() {}\n```\n"; + let pt_input = "# Seção\n\nEste é o conteúdo com detalhes da configuração.\n\n```rust\nfn main() {}\n```\n"; + + let r_en = compress_text_with_locale(input, Mode::Ultra, Locale::from_code("en")); + let r_pt = compress_text_with_locale(pt_input, Mode::Ultra, Locale::from_code("pt-BR")); + + assert!(r_en.safe); + assert!(r_pt.safe); + assert!(r_en.output.contains("fn main() {}")); + assert!(r_pt.output.contains("fn main() {}")); +}