A zero-trust AI architecture utilizing a multi-agent system to dynamically forge local tools, featuring persistent sessions, WSL2 isolation, GPU acceleration, and auto-healing debugging loops.
Rather than relying on a static set of pre-programmed tools, this framework empowers an AI to write, debug, and execute its own Python tools on the fly using the Model Context Protocol (MCP). By separating the "Brain" (logical reasoning) from the "Coder" (code generation), the "Summarizer" (memory management), and the "Adviser" (strategic correction), the system acts as a persistent, self-expanding AI operating system capable of acting as an autonomous researcher—scraping the web, downloading massive datasets, testing hypotheses, training machine learning models, and rendering publication-ready reports while safely bridging massive LLMs with your local hardware.
For ultimate security, this framework should be run inside a dedicated WSL2 (Windows Subsystem for Linux) instance under a heavily restricted user account. The steps below create a hardened WSL environment where the AI is completely isolated from your Windows host system.
1. Install a Dedicated WSL Instance
Install a fresh Ubuntu instance specifically for this project (named Agents). You can install from a downloaded file:
wsl --install --from-file "C:\path\to\your\download\ubuntu-24.04-wsl-amd64.wsl" --name Agents
2. Create the Restricted User (agent)
Log into your new instance as your initial setup user (who has administrative sudo rights). Then, create a restricted, non-sudo user named agent. This guarantees that the process running the AI has zero administrative rights to the underlying Linux system:
wsl -d Agents -u <your-initial-sudo-user>
sudo useradd -m -s /bin/bash agent
sudo passwd agent
3. Harden the WSL Configuration We must sever the default connections between WSL and your Windows host. Open the WSL configuration file:
sudo nano /etc/wsl.conf
Paste the following configuration. This does three critical things: it sets agent as the default user, blocks WSL from automatically mounting your Windows hard drives (disabling /mnt/c/), and prevents WSL from executing Windows binaries (like cmd.exe or powershell.exe).
[user]
default=agent
[automount]
enabled = false
[interop]
enabled = false
appendWindowsPath = false
4. Reboot and Initialize Save the file (Ctrl+O, Enter, Ctrl+X) and exit the terminal. Open a standard Windows Command Prompt or PowerShell and restart the WSL instance to apply the hardened settings:
wsl --terminate Agents
wsl -d Agents
You will now automatically be logged in as the restricted agent user, trapped inside a secure, air-gapped Linux bubble. You are now ready to proceed to Step 1!
This project utilizes multiple distinct Large Language Models (LLMs) to maximize efficiency and accuracy:
- The Brain (The Overseer): A fast, high-parameter reasoning model. It communicates with the user, plans out the necessary steps, and orchestrates the execution of tools.
- The Coder (The Hands): A specialized coding model operating inside the Forge sandbox. It operates invisibly in the background. When the Brain needs a new tool, the Coder writes the Python script, validates its syntax, and saves it.
- The Summarizer (The Memory Manager): A background agent responsible for context compression using strict JSON schemas.
- The Adviser (The Strategist): A slow, extremely heavy reasoning model (e.g., DeepSeek-R1, Qwen 397B). It is invoked exclusively when the Brain encounters critical bottlenecks or requires strategic redirection.
- The Analyst (The Eyes & Data Miner): A specialized context-isolation and vision agent. When the Brain needs to read a massive 50,000-line error log, parse a raw data dump, or look at an image, it delegates the file to the Analyst. The Analyst processes the heavy payload and returns a highly concentrated summary, keeping the Brain's context window clean, fast, and immune to data-bloat.
- Universal Sub-Agents: The Brain has the ability to dynamically spawn its own temporary LLM sub-agents to delegate isolated tasks, test prompt engineering, or request second opinions using custom temperature and parameters.
LLMs sometimes make syntax errors. Instead of burdening the Brain's context window with Python tracebacks, the system contains an internal validation loop. If the newly written code fails a py_compile check, it automatically increments the generation seed and asks the Coder model to try again, completely hiding this debugging process from the main chat.
To prevent overwhelming the AI with hundreds of tools and memories, knowledge is split into compact JSON indexes and detailed physical files:
- Tool Registry: Forged tools are categorized in
tool_registry.json. The Brain navigates this in two steps: checking high-level categories, then diving into specific category descriptions to learn how to use its tools. - Memory Registry: Long-term facts and summaries are indexed in
memory_registry.jsonwith short descriptions and timestamps. The exhaustive details are saved as individual Markdown files (.md) in thememories/folder. The Brain can browse the lightweight index and only spend tokens to read the full file when explicitly needed.
Workspaces are strictly isolated. The system dynamically generates unique Session_ID folders.
- State-Driven Restart: The framework operates entirely on state files, not temporary RAM. When you resume a session, the Brain instantly re-loads its
current_history.jsonand perfectly remembers exactly where it left off, alongside all its forged tools, stored memories, and master plans. - Layered Micro-Environments (Base + Delta): The heavy, complex dependencies (like C++ libraries and Chromium) are locked safely in the container's base Pixi image. When the AI installs a new third-party library for a specific tool, it uses a background
pip --targetinjection. This places the tiny Python files directly into acustom_packagesfolder inside that specific session's mapped workspace, keeping your hard drive completely free of 2GB redundant bloat while retaining perfect session persistence. - Bioconda Injection: Every new workspace is automatically configured to pull from the
biocondaconda channel, giving the AI immediate access to thousands of pre-compiled bioinformatics tools.
The system utilizes AsyncOpenAI for fully asynchronous streaming across all agents. Both the main Brain on the host and the background FastMCP tools (Coder, Summarizer, Adviser) use async/await. When a massive model takes up to 90 seconds to "think," the Python event loop remains unblocked. This ensures the background stdio connection to the Podman container never starves or drops, keeping the environment perfectly stable during long execution cycles.
To guarantee system stability, background agents (like the Summarizer) do not rely on standard prompting. They use OpenAI's native Structured Outputs ("strict": True). This enforces mathematical compliance at the API level, ensuring the AI cannot hallucinate a broken comma or invalid schema that would corrupt the current_history.json or registry files.
To prevent the AI from reinventing the wheel with complex Python scripts, the sandbox natively includes powerful system-level capabilities:
- Hardware Acceleration: Native GPU passthrough (via NVIDIA CDI) allows the AI to dynamically install CUDA toolkits via Pixi and execute hardware-accelerated machine learning scripts.
- Native Vector Database (Context-Bypass Architecture): Equipped with
sqlite3and the C-basedsqlite-vecextension. The framework uses a strict Two-Table relational schema to link metadata directly to high-speed vector tables.- Zero Context Bloat: When the Brain wants to insert or search semantic data, it simply passes a
text_to_embedparameter to the SQL tool. The host proxy automatically routes the text to a local embedding model (e.g., Qwen 4096-dim), packs it into a binary BLOB, and injects it into the database. The massive floating-point arrays never touch the LLM's context window, drastically reducing token costs and preventing hallucination.
- Zero Context Bloat: When the Brain wants to insert or search semantic data, it simply passes a
- Native Web Scraping: Features a built-in
fetch_webpagetool powered by headless Chromium and Playwright. It automatically renders JavaScript, bypasses basic bot protections, and strips HTML bloat, returning clean Markdown without the Brain writing a single line of scraper code. - Native Multi-Modal Vision (VLM Support): The framework isn't just limited to text. By pointing the Analyst profile to a local Vision-Language Model (like LLaVA or Qwen-VL), the AI framework can "see." The Brain can autonomously ask the Analyst to review generated charts, describe UI screenshots, or read scanned documents, fully integrating visual feedback into its reasoning loop.
- Context-Isolation Pattern: Large files (codebases, logs, datasets) are strictly airlocked away from the Overseer's working memory. Through the
analyze_filestool, the system utilizes a delegator pattern where heavy read-operations are offloaded to the Analyst, completely eliminating the "Context Blowout" problem common in standard LLM agents. - Massive Data Processing: Equipped with
aria2cfor high-speed parallel downloads andpigz/zstdfor instantaneous decompression of massive genomic/data payloads. - Media & Documents: Pre-loaded with
tesseract-ocr,poppler-utils(for direct PDF extraction), andffmpeg. - Scientific Rendering & Reporting: Uses
graphvizto render complex networks andpandocto compile the AI's markdown findings directly into HTML, Word, or scientific PDF formats.
To prevent the Brain from getting lost in the weeds of complex multi-step tasks, the framework utilizes persistent strategic states:
- The Master Plan: The Brain maintains a persistent Markdown file (
active_plan.md) in its state folder to track overall objectives and checklists. It reads this plan upon waking up and seamlessly updates it when milestones are reached. - The Red Team Loop (Adviser): If the Brain gets stuck on a coding error or logical dead end, it uses the
consult_advisertool. This packages the current plan, registries, and the encountered problem, and sends it to a Senior Adviser LLM. The Adviser generates a timestamped strategic advisory report saved to disk, enabling true autonomous self-correction without user hand-holding.
To prevent the AI from burning tokens in infinite error loops, the Overseer script tracks consecutive, uninterrupted tool calls. Every 100 tool calls, it triggers a "Soft Pause" by injecting a system prompt that forces the Brain to self-evaluate its progress. If the AI hits 300 consecutive tool chains, the system executes a "Hard Stop," protecting API limits and forcing the AI to await human intervention.
Traditional agents lose track of time during long workflows. This framework utilizes a "Sweep & Replace" dynamic time injection. On every single API turn, the AI's context is injected with a Live System Clock down to the exact second. This allows the AI to accurately timestamp files and memories, without bloating the permanent current_history.json log with stale timestamps.
Smaller or quantized models occasionally hallucinate malformed JSON when calling tools. The MCP server sits between the LLM and the tools, intercepting JSONDecodeError failures. Instead of crashing the script, it instantly returns a SYSTEM ERROR directly to the LLM, prompting it to fix its syntax and try again autonomously.
The Overseer natively supports multiple visual modes via the command line. Run it in formatted mode for a rich, color-coded terminal dashboard, text mode for raw streaming, or silent mode to run the agent entirely in the background. With piped STDIN support, you can feed massive files directly into the AI through the bash pipeline and have it automatically exit (-x) when finished, turning the framework into a pure, headless Unix utility.
CRITICAL WARNING: This AI has the ability to execute bash commands and write files. To prevent it from modifying your host system or deleting crucial files, the MCP environment runs inside a heavily hardened, rootless Podman Container.
- Non-Root Execution: The AI runs as a restricted, non-root
agentuser inside the container. Podman handles UID mapping dynamically (via theUvolume flag) to allow safe read/write access without root privileges. - Stripped Capabilities & Resource Limits: The container drops all Linux capabilities (
--cap-drop=ALL) and strictly limits the AI to 4 CPUs, 16GB RAM, and a hard limit of 1000 PIDs (--pids-limit=1000) to instantly neutralize bash fork bombs and resource-exhaustion attacks. - Anti-Zombie Process Hooks: The container boots with the
--initflag to perfectly manage PID 1 signals, and the host Python script usesatexithooks. Even if you hard-crash the script viaCtrl+D, all nested bash processes, headless browsers, and stray daemons are instantly and cleanly slaughtered. - Stream Protection (Daemon Trapping): The FastMCP server natively traps grandchild daemons (using
start_new_session=Truein subprocesses) and aggressively gags Python background logging. This ensures that stray warnings or detached background servers never corrupt the JSON-RPC communication streams. - The I/O Airlock (Surgical Mounts): The container cannot see your host filesystem. It communicates via strict bind mounts:
/app/host_input: Mapped to your local input folder. Strictly Read-Only (ro). The AI cannot delete or corrupt your source data. You must manually drop files (e.g., CSVs, scripts, or documents) into this host folder before asking the AI to analyze them./app/workspace: The unified execution directory mapped to the activeSession_IDfolder.
- Workspace Isolation: The
/app/workspaceis split to protect the AI's "mind" from its "hands":/sandbox: The scratchpad where the AI actually executes bash commands and tests tools./outputs: The dedicated folder where the AI saves finished artifacts and generated files./state: Contains the critical JSON registries, the master plan, adviser reports, and the livecurrent_history.jsonfile./forged_tools: Where the generated Python scripts live./memories: Where the detailed Markdown files live./archive: The dedicated "Soft-Delete" trash bin.
- Zero-Trust Architecture: The container contains absolutely NO sensitive data, API keys, or
.envfiles. The AI uses hardcoded dummy keys (e.g.,sk-sandbox-fake-key) and routes all requests to a localhost.containers.internalgateway. Authentication and routing are securely handled by a LiteLLM proxy running safely on the Windows/WSL host, making credential theft mathematically impossible. - Context Preservation (I/O Truncation): If the AI executes a bash command that floods the terminal with thousands of lines, the system automatically intercepts and truncates the output at 5,000 characters. It warns the AI to gracefully redirect large data to files (
> output.txt), preventing server crashes and context-window exhaustion. - Network Isolation: The container runs on an isolated bridge network (
slirp4netns). It communicates with local/tunneled LLMs strictly via thehost.containers.internalgateway. - Anti-Deletion Protocol (Soft-Deletes): The AI is strictly prompted against using destructive commands like
rmorDROP TABLE. Instead, it is trained to use a robust "Soft-Delete" protocol. If it needs to rewrite a Python script or rebuild a database schema, it automatically uses the nativewrite_filetool to create timestamped.bakcopies, or uses bashmvto send old databases to the/archivefolder. This provides a flawless audit trail and guarantees zero data loss during autonomous operations.
This project uses pixi for environment management and podman for containerized execution.
Your freshly installed Ubuntu instance needs a few core tools before we can begin. Log into your hardened WSL terminal and run the following commands to install Podman, Pixi, and the NVIDIA toolkit for GPU passthrough:
# Update the system and install Podman with rootless networking tools
sudo apt update && sudo apt install -y curl podman slirp4netns uidmap
# Add NVIDIA Container Toolkit repository and install (For GPU Passthrough)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
# Generate the CDI configuration so Podman can find the GPU
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# Install Pixi
curl -fsSL https://pixi.sh/install.sh | bash
# Restart your terminal or source your profile to apply Pixi to your path
source ~/.bashrc
Ensure your local LLM server (like Ollama or vLLM) is running and you have pulled your designated Brain and Coder models.
(If using SSH tunnels to access remote GPUs, bind the tunnel to 0.0.0.0 so the Podman bridge can see it, e.g., ssh -L 0.0.0.0:64100:127.0.0.1:8000 user@remote).
Create a new directory for your project, initialize Pixi, and add the required libraries. Run these commands in your terminal:
mkdir ai_workspace
cd ai_workspace
pixi init
pixi add python openai mcp fastmcp prompt_toolkit rich
To keep secrets completely out of the sandbox, we run a LiteLLM proxy on the host machine. The AI uses fake keys, and the proxy attaches the real ones.
Open a terminal on your host machine (outside the sandbox) and initialize a dedicated routing environment:
cd ~
mkdir litellm_proxy && cd litellm_proxy
pixi init
pixi add python=3.12 litellm pip
pixi run pip install 'litellm[proxy]'
Create a config.yaml file in that folder to map your models to their actual endpoints (Ollama, vLLM, a remote HPC cluster) and pull your real API keys from the host's environment variables.
Example (includes option to remove unsupported flags):
litellm_settings:
drop_params: true
model_list:
# [1] Local Model (Fast Brain / Coder)
- model_name: Qwen/Qwen3.6-35B-A3B-FP8
litellm_params:
model: openai/Qwen/Qwen3.6-35B-A3B-FP8
api_base: http://localhost:64100/v1
api_key: local-vllm-key # vLLM just needs a dummy string
# [2] Remote Model (Heavy Adviser / Strategist)
- model_name: qwen35-397b-a17b-fp8
litellm_params:
model: openai/qwen35-397b-a17b-fp8
# LiteLLM automatically pulls these from your ~/.bashrc exports!
api_base: os.environ/LITELLM_API_BASE
api_key: os.environ/LITELLM_API_KEY
Run this in your standard WSL2 terminal:
nano ~/.bashrc
Add your secret variables like this:
export LITELLM_API_KEY="<your-key>"
export LITELLM_API_BASE="<your-address>"
Add this auto-start script to the bottom of the .bashrc file:
# ==========================================
# ZERO-TRUST AI PROXY AUTO-START
# ==========================================
# Check if the litellm proxy is already running
if ! pgrep -f "litellm --config config.yaml" > /dev/null; then
echo "🛡️ Starting Zero-Trust LiteLLM Proxy in the background..."
# Move to the proxy folder, start it silently, and drop the logs into proxy.log
cd ~/litellm_proxy
nohup pixi run litellm --config config.yaml --port 4000 > proxy.log 2>&1 &
# Return to the home directory silently
cd ~
fi
Then, reload your profile by running this in your standard WSL2 terminal:
source ~/.bashrc
Save the core scripts into the root of your ai_workspace folder:
config.py(Your settings, system prompts, folder paths, and Session ID).god_tools.py(The FastMCP server handling tool forging and execution).chat_overseer.py(The main interactive async loop).
Create a file named Containerfile in your workspace root. Notice that it contains no secrets, environment variables, or python files.
(Note: We deliberately do not COPY the Python scripts into this image. They are mapped dynamically via volumes at runtime. This allows you to tweak your prompts and Python code on the host machine endlessly without ever needing to rebuild the container cache!)
FROM ghcr.io/prefix-dev/pixi:latest
# ROOT LEVEL: Install system-level scientific & media dependencies
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
chromium xvfb git git-lfs curl wget unzip aria2 file jq pigz zstd \
poppler-utils tesseract-ocr ffmpeg imagemagick graphviz pandoc sqlite3 \
build-essential cmake gfortran libgl1 libglib2.0-0 libxml2-dev libxslt-dev \
&& rm -rf /var/lib/apt/lists/*
# USER SETUP
RUN useradd -m -s /bin/bash agent
WORKDIR /app
# Disable the FastMCP ASCII Banner ---
ENV FASTMCP_SHOW_SERVER_BANNER=0
RUN mkdir /app/workspace && chown -R agent:agent /app
# Switch to non-root user before installing Python tools
USER agent
# We add conda-forge and bioconda to give the AI maximum scientific reach
RUN pixi init && \
pixi project channel add conda-forge && \
pixi project channel add bioconda && \
pixi add python pip openai mcp fastmcp \
pandas numpy scipy matplotlib pyarrow \
requests beautifulsoup4 lxml \
pypdf2 python-docx pillow tiktoken \
biopython rdkit sqlalchemy networkx && \
pixi add --pypi sqlite-vec playwright
# PRE-FETCH BROWSER BINARIES
# This downloads the headless chromium binary into the agent's cache permanently
RUN pixi run playwright install chromium
Build the rootless image by running:
podman build -t ai-forge .
The Overseer can be launched as a standard interactive chat or driven entirely via command-line arguments. To start a basic interactive session:
pixi run python chat_overseer.py
CLI Arguments & Headless Execution: You can override configurations, pipe in files, and trigger headless background tasks using the built-in CLI flags:
-p,--prompt: Pass an initial prompt directly (e.g.,-p "Analyze the data").-f,--format: Override the UI mode. Options:formatted(Rich markdown panels),text(classic streaming),silent(pure background execution with zero console output).-s,--session: Override theSESSION_IDto load or create a specific workspace (e.g.,-s "Project_X").-x,--exit: Auto-exit back to your bash terminal when the AI finishes its tasks (perfect for single-shot scripts).--brain,--coder,--summarizer,--adviser,--analyst: Pass the index number of your LLM profiles to override the active models dynamically.
Advanced Usage Examples:
- Rich Interactive Mode with Specific Models:
pixi run python chat_overseer.py -f formatted -s "Test_2" --brain 2 --coder 3 - Headless Background Task (Auto-Exits when done):
pixi run python chat_overseer.py -f silent -s "Research" -x -p "Scrape the latest papers on CRISPR and save to outputs/summary.md" - Unix Pipeline (Pipe files directly into the AI):
cat error_log.txt | pixi run python chat_overseer.py -f text -x -p "Find the bug and write a python fix to outputs/fix.py"
If you don't use the -s flag, the framework falls back to your config.py. Always use strings for Session IDs.
SESSION_ID = None-> Starts a brand new, uniquely timestamped workspace.SESSION_ID = "20260429180500"-> Restores that specific session, loading all previously forged tools and state memory.
Once the chat starts, try issuing a complex command:
YOU: "Create a tool in the 'math' category that calculates the Fibonacci sequence up to a given number. Then use it to find the sequence up to 15."
What happens behind the scenes:
- The Brain checks
view_tool_registryand realizes the tool doesn't exist. - The Brain calls
forge_and_register_toolvia the secure Podman MCP connection. - The MCP Server pings the Coder LLM through the
host.containers.internalgateway to writefibonacci.py. - The system sanitizes the filename and runs a syntax check. If it passes, it updates
tool_registry.json. - The Brain reads the usage instructions, then uses
execute_bashto run it:python /app/workspace/forged_tools/fibonacci.py. - The result is streamed back to your color-coded console.
If the session grows too long, the system will inject a dynamic warning prompting the Brain to use the compress_and_store_context tool.
- The Background Summarizer LLM activates.
- It executes a strict 2-step JSON schema pipeline.
- Step 1: Extracts facts into the Memory Registry and saves detailed
.mdfiles. - Step 2: Compresses the chat history and injects an active to-do list.
- The system backs up the bloated history to the
histories/folder for your observation, and overwrites the activecurrent_history.json. - The Brain reloads instantly with a clean context window, reads its next steps, and continues working automatically.
Because the framework strictly isolates state, memory, and Podman execution environments (dynamically generating container names via Session IDs and OS Process IDs), you can run multiple autonomous agents simultaneously on the same machine without them colliding.
You can use tmux to launch a multi-agent swarm in a 2x2 split-pane dashboard. Paste this chained command into your terminal to boot 4 independent researchers at once:
tmux new-session -d -s forge_swarm \; \
send-keys "pixi run python chat_overseer.py -f text -s 'Swarm_Text_1'" C-m \; \
split-window -h \; \
send-keys "pixi run python chat_overseer.py -f text -s 'Swarm_Text_2'" C-m \; \
split-window -v \; \
send-keys "pixi run python chat_overseer.py -f formatted -s 'Swarm_MD_2'" C-m \; \
select-pane -t 0 \; \
split-window -v \; \
send-keys "pixi run python chat_overseer.py -f formatted -s 'Swarm_MD_1'" C-m \; \
select-layout tiled \; \
attach-session -t forge_swarm
Pro-Tip for Swarm Management: Enable mouse support in your tmux configuration (echo "set -g mouse on" >> ~/.tmux.conf) so you can simply click between your agents, drag the window borders to resize their panels, and use your mouse wheel to independently scroll through their individual reasoning logs!
Every interaction is mirrored to a timestamped log file inside your active session folder (e.g., sessions/Session_ID_.../logs/). This includes the Brain's reasoning tokens, tool payloads, and the Coder's intercepted internal thoughts which are hidden from the Brain to save context space.
You can safely type /exit or quit at any time to shut down the system and instantly destroy the temporary container.