Sub-500ms voice agent with speculative pre-generation. The more you talk, the faster it gets.
Precog is an open-source framework for building ultra-low-latency phone agents. It uses speculative response generation — predicting what you'll say while you're still talking — to achieve near-zero perceived latency. Combined with TTS phrase caching, common responses become instant.
pip install -e .
cp config.example.yaml config.yaml # Edit with your API keys
python main.py
That's it. Point your Twilio number's webhook at https://your-server/twilio/inbound and call it.
The median human-to-human turn delay is 0ms. Our brains predict what the other person will say and pre-compute responses. Precog does the same thing with LLMs.
| Feature | Precog | Vapi | Bland | Retell | Shuo |
|---|---|---|---|---|---|
| Speculative pre-generation | ✅ | ❌ | ❌ | ❌ | ❌ |
| TTS phrase caching | ✅ | ❌ | ❌ | ❌ | ❌ |
| Multi-provider LLM | ✅ | ✅ | ❌ | ✅ | ❌ |
| Tool calling | ✅ | ✅ | ✅ | ✅ | ❌ |
| Conversation memory | ✅ | ✅ | ✅ | ✅ | ❌ |
| Open source | ✅ | ❌ | ❌ | ❌ | ✅ |
| Self-hosted | ✅ | ❌ | ❌ | ❌ | ✅ |
| Pure state machine | ✅ | ? | ? | ? | ✅ |
| Local TTS fallback | ✅ | ❌ | ❌ | ❌ | ❌ |
| Per-call cost | ~$0 | ~$0.10/min | ~$0.09/min | ~$0.10/min | ~$0 |
┌─────────────────────────────────┐
│ Precog Server │
│ │
┌──────────┐ WebSocket ┌─────┴─────┐ │
│ Twilio │◄──────────────►│ Twilio WS │ │
│ (Phone) │ µ-law audio │ Handler │ │
└──────────┘ └─────┬─────┘ │
│ │
┌─────▼─────┐ ┌──────────────────┐ │
│ Deepgram │───►│ State Machine │ │
│ Flux │ │ (pure function) │ │
│ (STT) │ └────────┬─────────┘ │
└───────────┘ │ actions │
┌─────▼─────┐ │
┌──────────┐ │ Dispatch │ │
│ TTS Cache│◄────────────│ Layer │ │
│ (SQLite) │ cache hit? │ (I/O) │ │
└──────────┘ └──┬──┬──┬───┘ │
│ │ │ │
┌────────────────────────────────┘ │ └──────┐ │
│ │ │ │
┌─────▼──────┐ ┌──────────▼┐ ┌─────▼┐ │
│ LLM Service│ │ElevenLabs │ │Tools │ │
│ Groq/OpenAI│ │ TTS Pool │ │Engine│ │
│ Claude/Olla│ │(warm conn)│ └──────┘ │
└────────────┘ └───────────┘ │
│ │
┌─────▼──────┐ Speculative Pre-Generation │
│ Speculation│ ───────────────────────── │
│ Engine │ While user talks, predict │
│(fast model)│ intent → pre-generate reply │
└────────────┘ │
│
┌──────────────┐ ┌──────────────┐ │
│ Memory DB │ │ Prometheus │ │
│ (SQLite) │ │ Metrics │ │
└──────────────┘ └──────────────┘ │
└─────────────────────────────────┘
The core of Precog is a pure functional state machine. Given a state and an event, it returns a new state and a list of actions. No side effects, no I/O, no exceptions for control flow.
(ConversationState, Event) → (ConversationState, list[Action])This means:
- 100% testable — no mocks needed for the core logic
- Deterministic — same input always produces same output
- Easy to reason about — all I/O lives in the dispatch layer
The headline feature. While the user is still speaking, Precog:
- Receives interim transcripts from Deepgram (partial, real-time)
- Feeds them to a fast, cheap LLM (e.g., Groq Llama 3.1 8B at ~100ms TTFT)
- Generates a speculative response based on predicted intent
- When the user finishes speaking:
- Hit → The speculative response matches! Play it immediately. Near-0ms latency.
- Miss → Discard and generate normally. No worse than not speculating.
The state machine adds a SPECULATING phase between LISTENING and RESPONDING:
LISTENING → interim transcript → SPECULATING → end of turn → RESPONDING
↓ ↑
speculation ready ──── matches? ────┘
↓ no
discard, generate normally
Configure in config.yaml:
speculation:
enabled: true
provider: "groq"
model: "llama-3.1-8b-instant"
confidence_threshold: 0.7
min_transcript_length: 15The more you talk, the faster Precog gets.
Common phrases like "Sure, let me check that for you" or "Is there anything else I can help with?" get synthesized by TTS over and over. Precog caches the audio output:
- Cache hit → Skip TTS entirely. Instant playback. Zero API cost.
- Cache miss → Normal TTS, then cache the result for next time.
- Frequency-aware eviction — high-frequency phrases survive LRU eviction
- Pre-warm on startup — synthesize common phrases before any calls arrive
The cache grows organically with usage. After a few hundred calls, most filler phrases are cached, and your effective TTS latency drops toward zero for a significant portion of responses.
tts_cache:
enabled: true
db_path: "./tts_cache.db"
max_size_mb: 500
pre_warm:
- "Sure, let me check that for you."
- "One moment please."
- "Is there anything else I can help with?"Monitor cache performance:
curl http://localhost:3040/cache/stats
# {"entries": 847, "size_mb": 23.4, "hit_rate": 43.2, "estimated_savings_seconds": 127.5}Swap between providers without changing code. Fallback chain for resilience.
| Provider | TTFT (median) | Cost | Best For |
|---|---|---|---|
| Groq | ~100ms | Free tier | Speed, speculation |
| OpenAI | ~300ms | $$$ | Quality, tool calling |
| Anthropic | ~400ms | $$$ | Quality, safety |
| Ollama | ~200ms* | Free | Privacy, offline |
*Local hardware dependent.
llm:
provider: "groq"
model: "llama-3.3-70b-versatile"
fallback:
- provider: "openai"
model: "gpt-4o-mini"
- provider: "ollama"
model: "llama3.1:8b"Agents can execute tools mid-conversation. Tool results stream back into the response naturally.
tools:
- name: "get_weather"
description: "Get current weather for a location"
handler: "precog.tools:get_weather"
parameters:
location: { type: "string", required: true }Built-in tools: get_weather, get_time. Add your own:
from precog.tools import register_tool
@register_tool("check_order")
async def check_order(order_id: str) -> str:
# Your logic here
return f"Order {order_id} ships tomorrow"Knows who's calling and what you discussed before. SQLite-backed, persists across calls.
memory:
enabled: true
context_turns: 20
caller_identification: truePrometheus-compatible endpoint for monitoring:
precog_ttft_seconds— Time to first LLM tokenprecog_turn_latency_seconds— End-to-end turn latencyprecog_speculation_hits_total/precog_speculation_misses_totalprecog_tts_cache_hits_total/precog_tts_cache_misses_totalprecog_active_calls— Current concurrent callsprecog_call_duration_seconds
Auto-save full conversation transcripts with timestamps:
{
"call_sid": "CA123...",
"caller": "+1555...",
"duration_seconds": 47.3,
"entries": [
{"role": "user", "text": "What's the weather?", "timestamp_ms": 1234567890},
{"role": "assistant", "text": "It's 15°C and sunny in London.", "timestamp_ms": 1234567891}
]
}See config.example.yaml for the full reference with comments.
| Section | Key Settings |
|---|---|
agent |
name, system_prompt, personality |
voice |
provider (elevenlabs/piper), voice_id, stability |
llm |
provider, model, temperature, fallback chain |
speculation |
enabled, model, confidence_threshold |
tts_cache |
enabled, max_size_mb, pre_warm phrases |
tools |
Array of tool definitions with handlers |
memory |
enabled, context_turns, caller_identification |
recording |
enabled, save_path |
webhooks |
call_start, call_end, tool_used URLs |
server |
port, api_key |
metrics |
enabled, port |
FROM python:3.12-slim
WORKDIR /app
COPY . .
RUN pip install -e .
EXPOSE 3040
CMD ["python", "main.py"]- Set environment variables from
.env.example - Set
PRECOG_CONFIG=config.yaml - Start command:
python main.py
- Get a Twilio phone number
- Set the Voice webhook URL to
https://your-server/twilio/inbound(POST) - Call the number
For outbound calls:
curl -X POST https://your-server/calls \
-H "X-API-Key: your-key" \
-H "Content-Type: application/json" \
-d '{"to": "+1555123456"}'# Install dev dependencies
pip install -e ".[dev]"
# Run tests (no API keys needed!)
pytest
# Run with coverage
pytest --cov=precog
# Lint
ruff check .
# Type check
mypy precog/
# Benchmark TTFT across providers
python scripts/bench_ttft.py --providers groq,openai --rounds 10Traditional voice agent:
User stops speaking → STT final → LLM generate (300-800ms) → TTS (200-500ms) → Play
Total: 500-1300ms
Precog with speculation hit + cache hit:
User stops speaking → STT final → speculation matches! → cache hit! → Play
Total: ~50ms (STT finalization only)
Precog with speculation miss + cache miss (worst case = same as traditional):
User stops speaking → STT final → discard speculation → LLM generate → TTS → Play
Total: 500-1300ms
The key insight: speculation is free when it misses (we just discard), but eliminates latency when it hits. Over time, the TTS cache absorbs more and more common phrases, making even cache-miss speculation faster.
- Shuo — Pure functional state machine architecture, streaming pipeline design
- Pipecat — Multi-provider approach
- The insight from an ex-Amazon Alexa engineer: "Median human-to-human turn delay is 0ms"
MIT — see LICENSE.