Skip to content

jhammant/precog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ”ฎ Precog

Sub-500ms voice agent with speculative pre-generation. The more you talk, the faster it gets.

Precog is an open-source framework for building ultra-low-latency phone agents. It uses speculative response generation โ€” predicting what you'll say while you're still talking โ€” to achieve near-zero perceived latency. Combined with TTS phrase caching, common responses become instant.

pip install -e .
cp config.example.yaml config.yaml  # Edit with your API keys
python main.py

That's it. Point your Twilio number's webhook at https://your-server/twilio/inbound and call it.


Why Precog?

The median human-to-human turn delay is 0ms. Our brains predict what the other person will say and pre-compute responses. Precog does the same thing with LLMs.

Feature Precog Vapi Bland Retell Shuo
Speculative pre-generation โœ… โŒ โŒ โŒ โŒ
TTS phrase caching โœ… โŒ โŒ โŒ โŒ
Multi-provider LLM โœ… โœ… โŒ โœ… โŒ
Tool calling โœ… โœ… โœ… โœ… โŒ
Conversation memory โœ… โœ… โœ… โœ… โŒ
Open source โœ… โŒ โŒ โŒ โœ…
Self-hosted โœ… โŒ โŒ โŒ โœ…
Pure state machine โœ… ? ? ? โœ…
Local TTS fallback โœ… โŒ โŒ โŒ โŒ
Per-call cost ~$0 ~$0.10/min ~$0.09/min ~$0.10/min ~$0

Architecture

                                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                    โ”‚        Precog Server            โ”‚
                                    โ”‚                                 โ”‚
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   WebSocket    โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”                          โ”‚
  โ”‚  Twilio  โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ Twilio WS โ”‚                          โ”‚
  โ”‚  (Phone) โ”‚   ยต-law audio  โ”‚  Handler  โ”‚                          โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜                          โ”‚
                                    โ”‚                                 โ”‚
                              โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
                              โ”‚ Deepgram  โ”‚โ”€โ”€โ”€โ–บโ”‚  State Machine   โ”‚  โ”‚
                              โ”‚   Flux    โ”‚    โ”‚  (pure function) โ”‚  โ”‚
                              โ”‚   (STT)   โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
                              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚ actions    โ”‚
                                                  โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”     โ”‚
                         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”‚  Dispatch  โ”‚     โ”‚
                         โ”‚ TTS Cacheโ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚   Layer    โ”‚     โ”‚
                         โ”‚ (SQLite) โ”‚  cache hit? โ”‚  (I/O)     โ”‚     โ”‚
                         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ””โ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜     โ”‚
                                                     โ”‚  โ”‚  โ”‚         โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
                    โ”‚                                   โ”‚         โ”‚  โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ” โ”‚
              โ”‚ LLM Serviceโ”‚                โ”‚ElevenLabs โ”‚  โ”‚Tools โ”‚ โ”‚
              โ”‚ Groq/OpenAIโ”‚                โ”‚  TTS Pool โ”‚  โ”‚Engineโ”‚ โ”‚
              โ”‚ Claude/Ollaโ”‚                โ”‚(warm conn)โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜           โ”‚
                    โ”‚                                               โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”    Speculative Pre-Generation          โ”‚
              โ”‚ Speculationโ”‚    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€           โ”‚
              โ”‚   Engine   โ”‚    While user talks, predict           โ”‚
              โ”‚(fast model)โ”‚    intent โ†’ pre-generate reply         โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                        โ”‚
                                                                    โ”‚
                              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
                              โ”‚  Memory DB   โ”‚  โ”‚  Prometheus  โ”‚   โ”‚
                              โ”‚  (SQLite)    โ”‚  โ”‚   Metrics    โ”‚   โ”‚
                              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
                                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Pure State Machine

The core of Precog is a pure functional state machine. Given a state and an event, it returns a new state and a list of actions. No side effects, no I/O, no exceptions for control flow.

(ConversationState, Event) โ†’ (ConversationState, list[Action])

This means:

  • 100% testable โ€” no mocks needed for the core logic
  • Deterministic โ€” same input always produces same output
  • Easy to reason about โ€” all I/O lives in the dispatch layer

Key Features

๐Ÿ”ฎ Speculative Pre-Generation

The headline feature. While the user is still speaking, Precog:

  1. Receives interim transcripts from Deepgram (partial, real-time)
  2. Feeds them to a fast, cheap LLM (e.g., Groq Llama 3.1 8B at ~100ms TTFT)
  3. Generates a speculative response based on predicted intent
  4. When the user finishes speaking:
    • Hit โ†’ The speculative response matches! Play it immediately. Near-0ms latency.
    • Miss โ†’ Discard and generate normally. No worse than not speculating.

The state machine adds a SPECULATING phase between LISTENING and RESPONDING:

LISTENING โ†’ interim transcript โ†’ SPECULATING โ†’ end of turn โ†’ RESPONDING
                                      โ†“                          โ†‘
                              speculation ready โ”€โ”€โ”€โ”€ matches? โ”€โ”€โ”€โ”€โ”˜
                                      โ†“ no
                              discard, generate normally

Configure in config.yaml:

speculation:
  enabled: true
  provider: "groq"
  model: "llama-3.1-8b-instant"
  confidence_threshold: 0.7
  min_transcript_length: 15

๐Ÿ’พ TTS Phrase Caching

The more you talk, the faster Precog gets.

Common phrases like "Sure, let me check that for you" or "Is there anything else I can help with?" get synthesized by TTS over and over. Precog caches the audio output:

  • Cache hit โ†’ Skip TTS entirely. Instant playback. Zero API cost.
  • Cache miss โ†’ Normal TTS, then cache the result for next time.
  • Frequency-aware eviction โ€” high-frequency phrases survive LRU eviction
  • Pre-warm on startup โ€” synthesize common phrases before any calls arrive

The cache grows organically with usage. After a few hundred calls, most filler phrases are cached, and your effective TTS latency drops toward zero for a significant portion of responses.

tts_cache:
  enabled: true
  db_path: "./tts_cache.db"
  max_size_mb: 500
  pre_warm:
    - "Sure, let me check that for you."
    - "One moment please."
    - "Is there anything else I can help with?"

Monitor cache performance:

curl http://localhost:3040/cache/stats
# {"entries": 847, "size_mb": 23.4, "hit_rate": 43.2, "estimated_savings_seconds": 127.5}

๐Ÿ”ง Multi-Provider LLM

Swap between providers without changing code. Fallback chain for resilience.

Provider TTFT (median) Cost Best For
Groq ~100ms Free tier Speed, speculation
OpenAI ~300ms $$$ Quality, tool calling
Anthropic ~400ms $$$ Quality, safety
Ollama ~200ms* Free Privacy, offline

*Local hardware dependent.

llm:
  provider: "groq"
  model: "llama-3.3-70b-versatile"
  fallback:
    - provider: "openai"
      model: "gpt-4o-mini"
    - provider: "ollama"
      model: "llama3.1:8b"

๐Ÿ› ๏ธ Tool Calling

Agents can execute tools mid-conversation. Tool results stream back into the response naturally.

tools:
  - name: "get_weather"
    description: "Get current weather for a location"
    handler: "precog.tools:get_weather"
    parameters:
      location: { type: "string", required: true }

Built-in tools: get_weather, get_time. Add your own:

from precog.tools import register_tool

@register_tool("check_order")
async def check_order(order_id: str) -> str:
    # Your logic here
    return f"Order {order_id} ships tomorrow"

๐Ÿง  Conversation Memory

Knows who's calling and what you discussed before. SQLite-backed, persists across calls.

memory:
  enabled: true
  context_turns: 20
  caller_identification: true

๐Ÿ“Š Metrics

Prometheus-compatible endpoint for monitoring:

  • precog_ttft_seconds โ€” Time to first LLM token
  • precog_turn_latency_seconds โ€” End-to-end turn latency
  • precog_speculation_hits_total / precog_speculation_misses_total
  • precog_tts_cache_hits_total / precog_tts_cache_misses_total
  • precog_active_calls โ€” Current concurrent calls
  • precog_call_duration_seconds

๐Ÿ“ Call Transcripts

Auto-save full conversation transcripts with timestamps:

{
  "call_sid": "CA123...",
  "caller": "+1555...",
  "duration_seconds": 47.3,
  "entries": [
    {"role": "user", "text": "What's the weather?", "timestamp_ms": 1234567890},
    {"role": "assistant", "text": "It's 15ยฐC and sunny in London.", "timestamp_ms": 1234567891}
  ]
}

Configuration Reference

See config.example.yaml for the full reference with comments.

Section Key Settings
agent name, system_prompt, personality
voice provider (elevenlabs/piper), voice_id, stability
llm provider, model, temperature, fallback chain
speculation enabled, model, confidence_threshold
tts_cache enabled, max_size_mb, pre_warm phrases
tools Array of tool definitions with handlers
memory enabled, context_turns, caller_identification
recording enabled, save_path
webhooks call_start, call_end, tool_used URLs
server port, api_key
metrics enabled, port

Deployment

Docker

FROM python:3.12-slim
WORKDIR /app
COPY . .
RUN pip install -e .
EXPOSE 3040
CMD ["python", "main.py"]

Railway / Render / Fly.io

  1. Set environment variables from .env.example
  2. Set PRECOG_CONFIG=config.yaml
  3. Start command: python main.py

Twilio Setup

  1. Get a Twilio phone number
  2. Set the Voice webhook URL to https://your-server/twilio/inbound (POST)
  3. Call the number

For outbound calls:

curl -X POST https://your-server/calls \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"to": "+1555123456"}'

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests (no API keys needed!)
pytest

# Run with coverage
pytest --cov=precog

# Lint
ruff check .

# Type check
mypy precog/

# Benchmark TTFT across providers
python scripts/bench_ttft.py --providers groq,openai --rounds 10

How It Works: Latency Breakdown

Traditional voice agent:

User stops speaking โ†’ STT final โ†’ LLM generate (300-800ms) โ†’ TTS (200-500ms) โ†’ Play
Total: 500-1300ms

Precog with speculation hit + cache hit:

User stops speaking โ†’ STT final โ†’ speculation matches! โ†’ cache hit! โ†’ Play
Total: ~50ms (STT finalization only)

Precog with speculation miss + cache miss (worst case = same as traditional):

User stops speaking โ†’ STT final โ†’ discard speculation โ†’ LLM generate โ†’ TTS โ†’ Play
Total: 500-1300ms

The key insight: speculation is free when it misses (we just discard), but eliminates latency when it hits. Over time, the TTS cache absorbs more and more common phrases, making even cache-miss speculation faster.


Inspired By

  • Shuo โ€” Pure functional state machine architecture, streaming pipeline design
  • Pipecat โ€” Multi-provider approach
  • The insight from an ex-Amazon Alexa engineer: "Median human-to-human turn delay is 0ms"

License

MIT โ€” see LICENSE.

About

๐Ÿ”ฎ Sub-500ms voice agent with speculative pre-generation. The more you talk, the faster it gets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages