🔮 Precog

Sub-500ms voice agent with speculative pre-generation. The more you talk, the faster it gets.

Precog is an open-source framework for building ultra-low-latency phone agents. It uses speculative response generation — predicting what you'll say while you're still talking — to achieve near-zero perceived latency. Combined with TTS phrase caching, common responses become instant.

pip install -e .
cp config.example.yaml config.yaml  # Edit with your API keys
python main.py

That's it. Point your Twilio number's webhook at https://your-server/twilio/inbound and call it.

Why Precog?

The median human-to-human turn delay is 0ms. Our brains predict what the other person will say and pre-compute responses. Precog does the same thing with LLMs.

Feature	Precog	Vapi	Bland	Retell	Shuo
Speculative pre-generation	✅	❌	❌	❌	❌
TTS phrase caching	✅	❌	❌	❌	❌
Multi-provider LLM	✅	✅	❌	✅	❌
Tool calling	✅	✅	✅	✅	❌
Conversation memory	✅	✅	✅	✅	❌
Open source	✅	❌	❌	❌	✅
Self-hosted	✅	❌	❌	❌	✅
Pure state machine	✅	?	?	?	✅
Local TTS fallback	✅	❌	❌	❌	❌
Per-call cost	~$0	~$0.10/min	~$0.09/min	~$0.10/min	~$0

Architecture

                                    ┌─────────────────────────────────┐
                                    │        Precog Server            │
                                    │                                 │
  ┌──────────┐   WebSocket    ┌─────┴─────┐                          │
  │  Twilio  │◄──────────────►│ Twilio WS │                          │
  │  (Phone) │   µ-law audio  │  Handler  │                          │
  └──────────┘                └─────┬─────┘                          │
                                    │                                 │
                              ┌─────▼─────┐    ┌──────────────────┐  │
                              │ Deepgram  │───►│  State Machine   │  │
                              │   Flux    │    │  (pure function) │  │
                              │   (STT)   │    └────────┬─────────┘  │
                              └───────────┘             │ actions    │
                                                  ┌─────▼─────┐     │
                         ┌──────────┐             │  Dispatch  │     │
                         │ TTS Cache│◄────────────│   Layer    │     │
                         │ (SQLite) │  cache hit? │  (I/O)     │     │
                         └──────────┘             └──┬──┬──┬───┘     │
                                                     │  │  │         │
                    ┌────────────────────────────────┘  │  └──────┐  │
                    │                                   │         │  │
              ┌─────▼──────┐                ┌──────────▼┐  ┌─────▼┐ │
              │ LLM Service│                │ElevenLabs │  │Tools │ │
              │ Groq/OpenAI│                │  TTS Pool │  │Engine│ │
              │ Claude/Olla│                │(warm conn)│  └──────┘ │
              └────────────┘                └───────────┘           │
                    │                                               │
              ┌─────▼──────┐    Speculative Pre-Generation          │
              │ Speculation│    ─────────────────────────           │
              │   Engine   │    While user talks, predict           │
              │(fast model)│    intent → pre-generate reply         │
              └────────────┘                                        │
                                                                    │
                              ┌──────────────┐  ┌──────────────┐   │
                              │  Memory DB   │  │  Prometheus  │   │
                              │  (SQLite)    │  │   Metrics    │   │
                              └──────────────┘  └──────────────┘   │
                                    └─────────────────────────────────┘

Pure State Machine

The core of Precog is a pure functional state machine. Given a state and an event, it returns a new state and a list of actions. No side effects, no I/O, no exceptions for control flow.

(ConversationState, Event) → (ConversationState, list[Action])

This means:

100% testable — no mocks needed for the core logic
Deterministic — same input always produces same output
Easy to reason about — all I/O lives in the dispatch layer

Key Features

🔮 Speculative Pre-Generation

The headline feature. While the user is still speaking, Precog:

Receives interim transcripts from Deepgram (partial, real-time)
Feeds them to a fast, cheap LLM (e.g., Groq Llama 3.1 8B at ~100ms TTFT)
Generates a speculative response based on predicted intent
When the user finishes speaking:
- Hit → The speculative response matches! Play it immediately. Near-0ms latency.
- Miss → Discard and generate normally. No worse than not speculating.

The state machine adds a SPECULATING phase between LISTENING and RESPONDING:

LISTENING → interim transcript → SPECULATING → end of turn → RESPONDING
                                      ↓                          ↑
                              speculation ready ──── matches? ────┘
                                      ↓ no
                              discard, generate normally

Configure in config.yaml:

speculation:
  enabled: true
  provider: "groq"
  model: "llama-3.1-8b-instant"
  confidence_threshold: 0.7
  min_transcript_length: 15

💾 TTS Phrase Caching

The more you talk, the faster Precog gets.

Common phrases like "Sure, let me check that for you" or "Is there anything else I can help with?" get synthesized by TTS over and over. Precog caches the audio output:

Cache hit → Skip TTS entirely. Instant playback. Zero API cost.
Cache miss → Normal TTS, then cache the result for next time.
Frequency-aware eviction — high-frequency phrases survive LRU eviction
Pre-warm on startup — synthesize common phrases before any calls arrive

The cache grows organically with usage. After a few hundred calls, most filler phrases are cached, and your effective TTS latency drops toward zero for a significant portion of responses.

tts_cache:
  enabled: true
  db_path: "./tts_cache.db"
  max_size_mb: 500
  pre_warm:
    - "Sure, let me check that for you."
    - "One moment please."
    - "Is there anything else I can help with?"

Monitor cache performance:

curl http://localhost:3040/cache/stats
# {"entries": 847, "size_mb": 23.4, "hit_rate": 43.2, "estimated_savings_seconds": 127.5}

🔧 Multi-Provider LLM

Swap between providers without changing code. Fallback chain for resilience.

Provider	TTFT (median)	Cost	Best For
Groq	~100ms	Free tier	Speed, speculation
OpenAI	~300ms	$$$	Quality, tool calling
Anthropic	~400ms	$$$	Quality, safety
Ollama	~200ms*	Free	Privacy, offline

*Local hardware dependent.

llm:
  provider: "groq"
  model: "llama-3.3-70b-versatile"
  fallback:
    - provider: "openai"
      model: "gpt-4o-mini"
    - provider: "ollama"
      model: "llama3.1:8b"

🛠️ Tool Calling

Agents can execute tools mid-conversation. Tool results stream back into the response naturally.

tools:
  - name: "get_weather"
    description: "Get current weather for a location"
    handler: "precog.tools:get_weather"
    parameters:
      location: { type: "string", required: true }

Built-in tools: get_weather, get_time. Add your own:

from precog.tools import register_tool

@register_tool("check_order")
async def check_order(order_id: str) -> str:
    # Your logic here
    return f"Order {order_id} ships tomorrow"

🧠 Conversation Memory

Knows who's calling and what you discussed before. SQLite-backed, persists across calls.

memory:
  enabled: true
  context_turns: 20
  caller_identification: true

📊 Metrics

Prometheus-compatible endpoint for monitoring:

precog_ttft_seconds — Time to first LLM token
precog_turn_latency_seconds — End-to-end turn latency
precog_speculation_hits_total / precog_speculation_misses_total
precog_tts_cache_hits_total / precog_tts_cache_misses_total
precog_active_calls — Current concurrent calls
precog_call_duration_seconds

📝 Call Transcripts

Auto-save full conversation transcripts with timestamps:

{
  "call_sid": "CA123...",
  "caller": "+1555...",
  "duration_seconds": 47.3,
  "entries": [
    {"role": "user", "text": "What's the weather?", "timestamp_ms": 1234567890},
    {"role": "assistant", "text": "It's 15°C and sunny in London.", "timestamp_ms": 1234567891}
  ]
}

Configuration Reference

See config.example.yaml for the full reference with comments.

Section	Key Settings
`agent`	`name`, `system_prompt`, `personality`
`voice`	`provider` (elevenlabs/piper), `voice_id`, `stability`
`llm`	`provider`, `model`, `temperature`, `fallback` chain
`speculation`	`enabled`, `model`, `confidence_threshold`
`tts_cache`	`enabled`, `max_size_mb`, `pre_warm` phrases
`tools`	Array of tool definitions with handlers
`memory`	`enabled`, `context_turns`, `caller_identification`
`recording`	`enabled`, `save_path`
`webhooks`	`call_start`, `call_end`, `tool_used` URLs
`server`	`port`, `api_key`
`metrics`	`enabled`, `port`

Deployment

Docker

FROM python:3.12-slim
WORKDIR /app
COPY . .
RUN pip install -e .
EXPOSE 3040
CMD ["python", "main.py"]

Railway / Render / Fly.io

Set environment variables from .env.example
Set PRECOG_CONFIG=config.yaml
Start command: python main.py

Twilio Setup

Get a Twilio phone number
Set the Voice webhook URL to https://your-server/twilio/inbound (POST)
Call the number

For outbound calls:

curl -X POST https://your-server/calls \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"to": "+1555123456"}'

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests (no API keys needed!)
pytest

# Run with coverage
pytest --cov=precog

# Lint
ruff check .

# Type check
mypy precog/

# Benchmark TTFT across providers
python scripts/bench_ttft.py --providers groq,openai --rounds 10

How It Works: Latency Breakdown

Traditional voice agent:

User stops speaking → STT final → LLM generate (300-800ms) → TTS (200-500ms) → Play
Total: 500-1300ms

Precog with speculation hit + cache hit:

User stops speaking → STT final → speculation matches! → cache hit! → Play
Total: ~50ms (STT finalization only)

Precog with speculation miss + cache miss (worst case = same as traditional):

User stops speaking → STT final → discard speculation → LLM generate → TTS → Play
Total: 500-1300ms

The key insight: speculation is free when it misses (we just discard), but eliminates latency when it hits. Over time, the TTS cache absorbs more and more common phrases, making even cache-miss speculation faster.

Inspired By

Shuo — Pure functional state machine architecture, streaming pipeline design
Pipecat — Multi-provider approach
The insight from an ex-Amazon Alexa engineer: "Median human-to-human turn delay is 0ms"

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔮 Precog

Why Precog?

Architecture

Pure State Machine

Key Features

🔮 Speculative Pre-Generation

💾 TTS Phrase Caching

🔧 Multi-Provider LLM

🛠️ Tool Calling

🧠 Conversation Memory

📊 Metrics

📝 Call Transcripts

Configuration Reference

Deployment

Docker

Railway / Render / Fly.io

Twilio Setup

Development

How It Works: Latency Breakdown

Inspired By

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🔮 Precog

Why Precog?

Architecture

Pure State Machine

Key Features

🔮 Speculative Pre-Generation

💾 TTS Phrase Caching

🔧 Multi-Provider LLM

🛠️ Tool Calling

🧠 Conversation Memory

📊 Metrics

📝 Call Transcripts

Configuration Reference

Deployment

Docker

Railway / Render / Fly.io

Twilio Setup

Development

How It Works: Latency Breakdown

Inspired By

License