πΈ Sakura Sumi - Visual Token Arbitrage Engine transforms raw source code into high-density visual tokens for LLM context maximization. Achieve 7β20x token compression, letting entire repositories fit comfortably inside a 2M-token context window with perfect retrieval.
The engine discovers files, applies deterministic arbitrage algorithms to maximize information density, and generates vision-optimized PDFs. Features include multi-worker parallelization, a beautiful Sakura-themed Web UI, DeepSeek-OCR integration, and a dedicated "Prompt Collector" for text fragments. See Arbitrage Methodology for technical details.
Only text-based files with known extensions are processed. Everything else is skipped so we donβt blow up on binaries or random junk.
Source Code:
- TypeScript/JavaScript:
.ts,.tsx,.js,.jsx - Python:
.py,.pyx - Java/Kotlin:
.java,.kt - Systems Languages:
.go,.rs,.cpp,.c,.h,.hpp - Other Languages:
.rb,.php,.swift,.dart
Configuration Files:
.json,.yaml,.yml,.toml,.ini,.cfg,.conf.xml,.properties,.env,.config
Stylesheets:
.css,.scss,.sass,.less,.styl
Markup & Documentation:
.html,.htm,.xhtml,.md,.markdown,.txt
Document Formats:
.docx- Microsoft Word documents (text extraction supported)
The following are automatically excluded and will not be processed:
- Binary files: Any file without a supported extension (images, executables, archives, etc.)
- Examples:
.jpg,.png,.pdf,.exe,.dll,.so,.zip,.tar,.xlsx
- Examples:
- Build artifacts:
node_modules,dist,build,out,bin,obj - Version control:
.git,.svn,.hg - Cache directories:
__pycache__,.pytest_cache,.mypy_cache,.next,.nuxt - Virtual environments:
.venv,venv,env - IDE directories:
.idea,.vscode,.vs - Other:
.gradle,gradle,coverage,.nyc_output,*.log,*.tmp,*.swp,*.swo,.DS_Store
You can add additional exclusion patterns using the --exclude CLI option.
The system handles problematic files gracefully without crashing:
- Encoding issues: Attempts multiple encodings (UTF-8, latin-1, cp1252, iso-8859-1). If all fail, the file is skipped with a warning.
- Permission errors: Files that cannot be read are skipped with a warning.
- Very large files: Files >100MB may cause issues (see Troubleshooting section).
- Empty files: Files with 0 bytes are automatically skipped.
So: we only convert files that match the list above. If your repo has 72k files, weβll only touch the ones that look like source, config, or docs. No images, no binaries, no archives.
You need Python (from python.org if you donβt have it; check βAdd Python to PATHβ when installing). Then:
git clone https://github.com/MichaelWeed/sakura-sumi.git
cd sakura-sumi
python3 -m venv venv
source venv/bin/activate # macOS/Linux; on Windows: venv\Scripts\activate
pip install -r requirements.txt
python scripts/compress.py "/path/to/your/codebase" -vPDFs show up in a folder named {your_codebase}_ocr_ready/.
No terminal needed. Double-click the right file for your OS; it will create a venv and install deps if needed, then open the web portal (and your browser).
| OS | File |
|---|---|
| macOS | Sakura Sumi.command |
| Windows | Sakura Sumi.bat |
| Linux | Sakura Sumi.sh |
On Linux, make sure the script is executable (chmod +x "Sakura Sumi.sh"). To add it to your application menu, run ./install-linux-launcher.sh once.
CLI usage:
# Simple compression
python scripts/compress.py "/path/to/codebase" -v
# Parallel processing (faster for large codebases)
python scripts/compress.py "/path/to/codebase" --parallel --workers 4 -v
# Resume from checkpoint (if interrupted)
python scripts/compress.py "/path/to/codebase" --resume -v
# Generate metrics report
python scripts/compress.py "/path/to/codebase" --generate-report markdown -v
# Generate visualization charts
python scripts/compress.py "/path/to/codebase" --generate-charts -vFull options are in CLI Options. For OCR modes see Compression Modes.
If youβd rather not use the CLI:
# Start the web server
python scripts/run_web.py
# Open http://localhost:5001 in your browserYou get a browser UI with folder picker, progress, token estimates, and job history. Same features as the CLI, just through the UI.
No local Python? Use Docker.
# Build the image
docker build -t sakura-sumi .
# Run the web portal (maps port 5001 and persists build artifacts)
docker run --rm -p 5001:5001 \
-v "$(pwd)/build:/app/build" \
-v "$(pwd)/output:/app/output" \
sakura-sumiOr docker compose up --build. The compose setup mounts ./build and ./output so job history and results stick around. Override with FLASK_SECRET_KEY or SAKURA_TELEMETRY in .env if you need to.
In the web portal, the Prompt Collector lets you paste long text (prompts, docs, etc.) and compress it to PDF without pointing at a directory. Click the button, add prompts, then βCompress Prompts.β
Diagnostic info (timing, file counts, errors, metrics) is written locally to telemetry.log in the output directory. Nothing is sent off-machine. JSONL, one object per line.
To turn it off:
# Disable telemetry logging
export SAKURA_TELEMETRY=0
python scripts/compress.py "/path/to/codebase" -v
# Or set to false/off
export SAKURA_TELEMETRY=false
# OR
export SAKURA_TELEMETRY=offRequired:
source_dir: Source codebase directory to compress
Output:
-o, --output: Output directory for PDFs (default:{source_dir}_ocr_ready)-v, --verbose: Verbose output with progress and statistics
Processing:
--parallel: Enable parallel processing (faster for large codebases)--workers N: Number of parallel workers (default: CPU count - 1)--batch-size N: Batch size for sequential processing (default: 10)--resume: Resume from checkpoint if available (useful if interrupted)--retry N: Number of retries for failed files (default: 3)
Compression:
--ocr: Enable DeepSeek-OCR compression (requires additional dependencies)--ocr-mode {small|medium|large|maximum}: OCR compression mode (default: small)small: 7x compression, 97% accuracy (recommended)medium: 10x compression, 97% accuracylarge: 15x compression, 85-90% accuracymaximum: 20x compression, 60% accuracy (not recommended for code)
Reporting:
--generate-report {json|markdown|html}: Generate metrics report--generate-charts: Generate visualization charts
Filtering:
--exclude: Additional exclusion patterns (space-separated, .gitignore-style)
- Converts source files to dense PDFs
- Minimal margins, small fonts for maximum density
- Suitable for most use cases
Requires additional dependencies: pip install vllm transformers
- Small: 7x compression, 97% accuracy (recommended for code)
- Medium: 10x compression, 97% accuracy
- Large: 15x compression, 85-90% accuracy
- Maximum: 20x compression, 60% accuracy (not recommended for code)
OCR Compression/
βββ src/ # Source code (Python package)
β βββ compression/ # Core compression modules
β βββ utils/ # Utility modules (discovery, metrics)
β βββ web/ # Web portal (Flask app)
β βββ main.py # CLI entry point
βββ tests/ # Test suite (mirrors src structure)
βββ scripts/ # Executable scripts
β βββ compress.py # CLI convenience script
β βββ run_web.py # Web portal launcher
βββ config/ # Configuration files
β βββ deepseek_ocr.json
β βββ love_oracle_ai.yaml
βββ docs/ # Documentation (organized by audience)
β βββ user/ # User documentation
β βββ developer/ # Developer documentation
β βββ api/ # API documentation
β βββ design/ # Design documentation
βββ bugs/ # Bug tracking (BUG-XXX.yaml format)
βββ archive/ # Historical records and completed items
βββ build/ # Build artifacts (gitignored)
βββ dist/ # Distribution packages (gitignored)
βββ setup.py # Package setup configuration
βββ requirements.txt # Python dependencies
βββ README.md # This file
For Contributors: We maintain comprehensive test coverage (currently 81%). All tests must pass before merging PRs.
# Run all tests
pytest tests/
# Run specific test file
pytest tests/test_file_discovery.py
# Run with coverage report
pytest --cov=src tests/
# Run with detailed output
pytest tests/ -vTest Structure:
- Unit tests for individual components
- Integration tests for pipeline workflows
- End-to-end tests for CLI and web portal
- All tests use pytest fixtures for clean setup/teardown
Why compress? Gemini has a 2M token context window. Large codebases can exceed this limit. Sakura Sumi compresses your code to fit within this window.
Step-by-step:
-
Compress your codebase:
python scripts/compress.py "/path/to/codebase" -v -
Check if it fits:
python scripts/compress.py "/path/to/codebase" --generate-report markdown -vLook for "Gemini Compatibility" section - it will show β if PDFs fit within 2M tokens.
-
Upload to Gemini:
- Go to gemini.google.com
- Click the upload button
- Select PDFs from your output directory (usually
{codebase}_ocr_ready/) - Gemini processes PDFs as images at ~258 tokens per image
Pro Tip: Use the web portal's token estimation feature to check compatibility before compressing!
Simple compression:
python scripts/compress.py "/Users/johndoe/my-project" -vCompress and see detailed report:
python scripts/compress.py "/Users/johndoe/my-project" \
--generate-report markdown \
-vLarge codebase with parallel processing:
python scripts/compress.py "/path/to/large-codebase" \
--parallel --workers 8 \
--generate-report markdown \
--generate-charts \
-vResume interrupted compression:
# First run (gets interrupted)
python scripts/compress.py "/path/to/codebase" -v
# Resume from where it left off
python scripts/compress.py "/path/to/codebase" --resume -vMaximum compression with OCR:
python scripts/compress.py "/path/to/codebase" \
--ocr --ocr-mode medium \
--parallel --workers 4 \
-vExclude specific patterns:
python scripts/compress.py "/path/to/codebase" \
--exclude "*.test.js" "docs/" "*.spec.ts" \
-vβNo files foundβ β The directory has no supported extensions. Use something that has .py, .js, .ts, etc. (see File Type Support).
No PDFs / weird failures β Look at failed_files.json in the output folder and at telemetry.log for details. Permissions and encoding can trip things up; ls -la on the source dir is a quick check.
Parallel runs failing β Try --workers 2 or drop --parallel.
OCR not available β Optional. Install with pip install vllm transformers if you want it. Plain PDF compression doesnβt need it.
Out of memory β Lower --batch-size and --workers, or run on a smaller subset of the tree.
Permission denied β Fix read access on the source files; failed_files.json will list what failed.
More: docs/user/TROUBLESHOOTING.md. Right-click / Quick Action issues: docs/user/RIGHT_CLICK_COMPRESS.md.
This project is licensed under the MIT License - see the LICENSE file for details.
See CONTRIBUTING.md for setup, style, and how to send a PR. Run the test suite before submitting. More for people hacking on the code: docs/developer/DEVELOPER.md.
Bugs and questions: look in docs/ and bugs/, or open an issue on GitHub.
