Ultra-low latency, highly accurate RAG context compression pipeline.
Context Sieve v8.5 uses a 2-stage hierarchical pipeline to compress retrieved context tokens by 60-75% before sending them to the Main LLM.
- Stage 1: Parent-Child retrieval with 480-token parent chunks.
- Stage 2: NanoPruner (MiniLM-L6) token classification with adaptive POS-based dilation and entropy-based confidence fallback.
flowchart TB
subgraph OFFLINE["Offline Training Pipeline"]
direction TB
C["Corpus\n(query + document pairs)"]
C --> DG["Data Generator\n(Proxy LLM extracts key spans)"]
DG --> AL["Label Aligner\n(RapidFuzz char-level matching)"]
AL --> TD["Training Data\n(token-level 0/1 labels)"]
TD --> TR["Trainer\n(Fine-tune MiniLM-L6)"]
TR --> OX["ONNX Export + INT8 Quantize"]
OX --> CAL["Threshold Calibrator\n(Binary search on val set)"]
CAL --> MODEL["model_int8.onnx\n+ calibration.json"]
end
subgraph RL["RL Fine-tuning (Contextual Bandit)"]
direction TB
NIAH["NIAH Data Generator\n(needle-in-haystack)"] --> ENV["Environment\n(T=1 Bandit)"]
CHK_PT["Cold-Start Model"] --> PPO["PPO Trainer\n(Bernoulli Action)"]
ENV <-->|"keep_mask / reward"| PPO
PPO --> JUDGE["LLM Judge proxy\n(+ R_linkage, R_size)"]
JUDGE --> PPO
PPO --> BEST_RL["best_rl_model.pt"]
end
subgraph INDEXER["Offline Indexer"]
direction TB
DOC["Raw Documents"] --> CHK["480-Token\nParent Chunks"]
CHK --> POS["spaCy POS Tagger"]
POS --> DIL["Boolean Dilation Array\n(Noun/Number/Negation)"]
end
subgraph HOTPATH["Real-Time Inference Hotpath"]
direction TB
Q["User Query"] --> FWD
CTX["Retrieved Chunks"] --> FWD
DIL -.->|"zero-cost lookup"| DILATE
MODEL -.->|"load once"| FWD
FWD["NanoPruner Forward Pass\n(ONNX INT8)"]
FWD --> ENT{"Entropy\nCheck"}
ENT -->|"High uncertainty"| BYPASS["Bypass → Return Raw"]
ENT -->|"Confident"| THRESH["Apply Calibrated\nThreshold"]
THRESH --> DILATE["Adaptive Dilation\n(Protect nouns/numbers)"]
DILATE --> CAP["Sentence-Aware\nHard Cap"]
CAP --> RECON["Offset-Based\nReconstruction"]
RECON --> OUT["Compressed Text\n(60-75% smaller)"]
end
style OFFLINE fill:#1a1a2e,stroke:#e94560,color:#eee
style RL fill:#301b3f,stroke:#fca311,color:#eee
style INDEXER fill:#16213e,stroke:#0f3460,color:#eee
style HOTPATH fill:#0f3460,stroke:#53d769,color:#eee
style MODEL fill:#e94560,stroke:#e94560,color:#fff
style OUT fill:#53d769,stroke:#53d769,color:#000
python -m venv .venv
.\.venv\Scripts\activate
pip install pysbd rapidfuzz sentence-transformers transformers spacy onnxruntime onnx onnxscript numpy torch python-dotenv
python -m spacy download en_core_web_smEdit .env with your proxy credentials:
API_URL=https://your-proxy/v1/chat/completions
API_KEY=your-key-here# Generate 250 synthetic examples using 10 parallel threads
python tools/expand_corpus.py --count 250 --workers 10Generates synthetic (query, document) pairs using your proxy LLM and appends to data/corpus.json.
# Generate labels for the corpus using 10 parallel threads
# Use 10-15 workers to make it fast
python -m context_sieve.trainer.data_generator --corpus data/corpus.json --output data/training_data.json --workers 10
Supports resuming: if the process stops, simply restart to pick up from where you left off.
# Full pipeline (Train + ONNX + INT8 + Calibrate)
python -m context_sieve.trainer.train --data data/training_data.json --full-pipeline --epochs 15
# Resume after crash (Skip training, just Export/Calibrate)
python -m context_sieve.trainer.train --data data/training_data.json --full-pipeline --skip-trainingOnce the cold-start model is trained (best_checkpoint.pt), start the RL pipeline to optimize for complex Multi-Turn coreferences using the Contextual Bandit PPO loop:
python -m context_sieve.rl.train --cold_start models/nanopruner/best_checkpoint.ptpython benchmark_runner.pyAuto-detects the trained ONNX model. Results logged to result.log.
| Module | Purpose |
|---|---|
context_sieve/indexer.py |
Offline indexing with POS metadata generation |
context_sieve/inference.py |
Real-time compression engine |
context_sieve/trainer/aligner.py |
GPT-4o extractive label alignment (RapidFuzz) |
context_sieve/trainer/calibrate.py |
INT8 threshold calibration |
context_sieve/trainer/data_generator.py |
LLM-powered training data generation |
context_sieve/trainer/train.py |
Fine-tuning + ONNX export pipeline |
context_sieve/rl/ |
PPO Contextual Bandit, LLM Judge, and NIAH Env |
tools/expand_corpus.py |
Synthetic corpus expansion |
- Zero-Latency Dilation: POS tagging (Noun/Number/Negation) is performed offline during indexing. The inference engine uses a pure boolean array lookup.
- Entropy Fallback: If the model is uncertain, pruning is bypassed to preserve semantic safety.
- Sum-of-Prob Sentence Scoring: Avoids length bias in hard-capping by scoring sentences based on total token probabilities.
- Offset-Based Reconstruction: Compressed text preserves original spacing, punctuation, and casing.