Skip to content

bawfng04/ContextSieve

Repository files navigation

Context Sieve v8.5

Ultra-low latency, highly accurate RAG context compression pipeline.

Overview

Context Sieve v8.5 uses a 2-stage hierarchical pipeline to compress retrieved context tokens by 60-75% before sending them to the Main LLM.

  • Stage 1: Parent-Child retrieval with 480-token parent chunks.
  • Stage 2: NanoPruner (MiniLM-L6) token classification with adaptive POS-based dilation and entropy-based confidence fallback.

Architecture

flowchart TB
    subgraph OFFLINE["Offline Training Pipeline"]
        direction TB
        C["Corpus\n(query + document pairs)"]
        C --> DG["Data Generator\n(Proxy LLM extracts key spans)"]
        DG --> AL["Label Aligner\n(RapidFuzz char-level matching)"]
        AL --> TD["Training Data\n(token-level 0/1 labels)"]
        TD --> TR["Trainer\n(Fine-tune MiniLM-L6)"]
        TR --> OX["ONNX Export + INT8 Quantize"]
        OX --> CAL["Threshold Calibrator\n(Binary search on val set)"]
        CAL --> MODEL["model_int8.onnx\n+ calibration.json"]
    end

    subgraph RL["RL Fine-tuning (Contextual Bandit)"]
        direction TB
        NIAH["NIAH Data Generator\n(needle-in-haystack)"] --> ENV["Environment\n(T=1 Bandit)"]
        CHK_PT["Cold-Start Model"] --> PPO["PPO Trainer\n(Bernoulli Action)"]
        ENV <-->|"keep_mask / reward"| PPO
        PPO --> JUDGE["LLM Judge proxy\n(+ R_linkage, R_size)"]
        JUDGE --> PPO
        PPO --> BEST_RL["best_rl_model.pt"]
    end

    subgraph INDEXER["Offline Indexer"]
        direction TB
        DOC["Raw Documents"] --> CHK["480-Token\nParent Chunks"]
        CHK --> POS["spaCy POS Tagger"]
        POS --> DIL["Boolean Dilation Array\n(Noun/Number/Negation)"]
    end

    subgraph HOTPATH["Real-Time Inference Hotpath"]
        direction TB
        Q["User Query"] --> FWD
        CTX["Retrieved Chunks"] --> FWD
        DIL -.->|"zero-cost lookup"| DILATE
        MODEL -.->|"load once"| FWD
        FWD["NanoPruner Forward Pass\n(ONNX INT8)"]
        FWD --> ENT{"Entropy\nCheck"}
        ENT -->|"High uncertainty"| BYPASS["Bypass → Return Raw"]
        ENT -->|"Confident"| THRESH["Apply Calibrated\nThreshold"]
        THRESH --> DILATE["Adaptive Dilation\n(Protect nouns/numbers)"]
        DILATE --> CAP["Sentence-Aware\nHard Cap"]
        CAP --> RECON["Offset-Based\nReconstruction"]
        RECON --> OUT["Compressed Text\n(60-75% smaller)"]
    end

    style OFFLINE fill:#1a1a2e,stroke:#e94560,color:#eee
    style RL fill:#301b3f,stroke:#fca311,color:#eee
    style INDEXER fill:#16213e,stroke:#0f3460,color:#eee
    style HOTPATH fill:#0f3460,stroke:#53d769,color:#eee
    style MODEL fill:#e94560,stroke:#e94560,color:#fff
    style OUT fill:#53d769,stroke:#53d769,color:#000
Loading

Quick Start

1. Setup Environment

python -m venv .venv
.\.venv\Scripts\activate
pip install pysbd rapidfuzz sentence-transformers transformers spacy onnxruntime onnx onnxscript numpy torch python-dotenv
python -m spacy download en_core_web_sm

2. Configure Proxy API

Edit .env with your proxy credentials:

API_URL=https://your-proxy/v1/chat/completions
API_KEY=your-key-here

3. Expand Training Corpus (Parallel)

# Generate 250 synthetic examples using 10 parallel threads
python tools/expand_corpus.py --count 250 --workers 10

Generates synthetic (query, document) pairs using your proxy LLM and appends to data/corpus.json.

4. Generate Training Labels (Parallel)

# Generate labels for the corpus using 10 parallel threads
# Use 10-15 workers to make it fast
python -m context_sieve.trainer.data_generator --corpus data/corpus.json --output data/training_data.json --workers 10

Supports resuming: if the process stops, simply restart to pick up from where you left off.

5. Train + Export + Calibrate (Phase 1: Cold Start)

# Full pipeline (Train + ONNX + INT8 + Calibrate)
python -m context_sieve.trainer.train --data data/training_data.json --full-pipeline --epochs 15

# Resume after crash (Skip training, just Export/Calibrate)
python -m context_sieve.trainer.train --data data/training_data.json --full-pipeline --skip-training

6. RL Fine-tuning (Phase 2: Coreference-Aware PPO)

Once the cold-start model is trained (best_checkpoint.pt), start the RL pipeline to optimize for complex Multi-Turn coreferences using the Contextual Bandit PPO loop:

python -m context_sieve.rl.train --cold_start models/nanopruner/best_checkpoint.pt

7. Run Benchmark

python benchmark_runner.py

Auto-detects the trained ONNX model. Results logged to result.log.

Core Modules

Module Purpose
context_sieve/indexer.py Offline indexing with POS metadata generation
context_sieve/inference.py Real-time compression engine
context_sieve/trainer/aligner.py GPT-4o extractive label alignment (RapidFuzz)
context_sieve/trainer/calibrate.py INT8 threshold calibration
context_sieve/trainer/data_generator.py LLM-powered training data generation
context_sieve/trainer/train.py Fine-tuning + ONNX export pipeline
context_sieve/rl/ PPO Contextual Bandit, LLM Judge, and NIAH Env
tools/expand_corpus.py Synthetic corpus expansion

Architecture Highlights

  • Zero-Latency Dilation: POS tagging (Noun/Number/Negation) is performed offline during indexing. The inference engine uses a pure boolean array lookup.
  • Entropy Fallback: If the model is uncertain, pruning is bypassed to preserve semantic safety.
  • Sum-of-Prob Sentence Scoring: Avoids length bias in hard-capping by scoring sentences based on total token probabilities.
  • Offset-Based Reconstruction: Compressed text preserves original spacing, punctuation, and casing.

About

Enterprise-grade Context Pruning for RAG. Hierarchical 2-stage architecture (MiniLM-L6 ONNX INT8) achieving 60-75% token compression at <20ms latency.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages