Sub-500ms voice agent with speculative pre-generation. The more you talk, the faster it gets.
Precog is an open-source framework for building ultra-low-latency phone agents. It uses speculative response generation โ predicting what you'll say while you're still talking โ to achieve near-zero perceived latency. Combined with TTS phrase caching, common responses become instant.
pip install -e .
cp config.example.yaml config.yaml # Edit with your API keys
python main.py
That's it. Point your Twilio number's webhook at https://your-server/twilio/inbound and call it.
The median human-to-human turn delay is 0ms. Our brains predict what the other person will say and pre-compute responses. Precog does the same thing with LLMs.
| Feature | Precog | Vapi | Bland | Retell | Shuo |
|---|---|---|---|---|---|
| Speculative pre-generation | โ | โ | โ | โ | โ |
| TTS phrase caching | โ | โ | โ | โ | โ |
| Multi-provider LLM | โ | โ | โ | โ | โ |
| Tool calling | โ | โ | โ | โ | โ |
| Conversation memory | โ | โ | โ | โ | โ |
| Open source | โ | โ | โ | โ | โ |
| Self-hosted | โ | โ | โ | โ | โ |
| Pure state machine | โ | ? | ? | ? | โ |
| Local TTS fallback | โ | โ | โ | โ | โ |
| Per-call cost | ~$0 | ~$0.10/min | ~$0.09/min | ~$0.10/min | ~$0 |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Precog Server โ
โ โ
โโโโโโโโโโโโ WebSocket โโโโโโโดโโโโโโ โ
โ Twilio โโโโโโโโโโโโโโโโโบโ Twilio WS โ โ
โ (Phone) โ ยต-law audio โ Handler โ โ
โโโโโโโโโโโโ โโโโโโโฌโโโโโโ โ
โ โ
โโโโโโโผโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ Deepgram โโโโโบโ State Machine โ โ
โ Flux โ โ (pure function) โ โ
โ (STT) โ โโโโโโโโโโฌโโโโโโโโโโ โ
โโโโโโโโโโโโโ โ actions โ
โโโโโโโผโโโโโโ โ
โโโโโโโโโโโโ โ Dispatch โ โ
โ TTS Cacheโโโโโโโโโโโโโโโ Layer โ โ
โ (SQLite) โ cache hit? โ (I/O) โ โ
โโโโโโโโโโโโ โโโโฌโโโฌโโโฌโโโโ โ
โ โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโ โ
โ โ โ โ
โโโโโโโผโโโโโโโ โโโโโโโโโโโโผโ โโโโโโโผโ โ
โ LLM Serviceโ โElevenLabs โ โTools โ โ
โ Groq/OpenAIโ โ TTS Pool โ โEngineโ โ
โ Claude/Ollaโ โ(warm conn)โ โโโโโโโโ โ
โโโโโโโโโโโโโโ โโโโโโโโโโโโโ โ
โ โ
โโโโโโโผโโโโโโโ Speculative Pre-Generation โ
โ Speculationโ โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Engine โ While user talks, predict โ
โ(fast model)โ intent โ pre-generate reply โ
โโโโโโโโโโโโโโ โ
โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ Memory DB โ โ Prometheus โ โ
โ (SQLite) โ โ Metrics โ โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The core of Precog is a pure functional state machine. Given a state and an event, it returns a new state and a list of actions. No side effects, no I/O, no exceptions for control flow.
(ConversationState, Event) โ (ConversationState, list[Action])This means:
- 100% testable โ no mocks needed for the core logic
- Deterministic โ same input always produces same output
- Easy to reason about โ all I/O lives in the dispatch layer
The headline feature. While the user is still speaking, Precog:
- Receives interim transcripts from Deepgram (partial, real-time)
- Feeds them to a fast, cheap LLM (e.g., Groq Llama 3.1 8B at ~100ms TTFT)
- Generates a speculative response based on predicted intent
- When the user finishes speaking:
- Hit โ The speculative response matches! Play it immediately. Near-0ms latency.
- Miss โ Discard and generate normally. No worse than not speculating.
The state machine adds a SPECULATING phase between LISTENING and RESPONDING:
LISTENING โ interim transcript โ SPECULATING โ end of turn โ RESPONDING
โ โ
speculation ready โโโโ matches? โโโโโ
โ no
discard, generate normally
Configure in config.yaml:
speculation:
enabled: true
provider: "groq"
model: "llama-3.1-8b-instant"
confidence_threshold: 0.7
min_transcript_length: 15The more you talk, the faster Precog gets.
Common phrases like "Sure, let me check that for you" or "Is there anything else I can help with?" get synthesized by TTS over and over. Precog caches the audio output:
- Cache hit โ Skip TTS entirely. Instant playback. Zero API cost.
- Cache miss โ Normal TTS, then cache the result for next time.
- Frequency-aware eviction โ high-frequency phrases survive LRU eviction
- Pre-warm on startup โ synthesize common phrases before any calls arrive
The cache grows organically with usage. After a few hundred calls, most filler phrases are cached, and your effective TTS latency drops toward zero for a significant portion of responses.
tts_cache:
enabled: true
db_path: "./tts_cache.db"
max_size_mb: 500
pre_warm:
- "Sure, let me check that for you."
- "One moment please."
- "Is there anything else I can help with?"Monitor cache performance:
curl http://localhost:3040/cache/stats
# {"entries": 847, "size_mb": 23.4, "hit_rate": 43.2, "estimated_savings_seconds": 127.5}Swap between providers without changing code. Fallback chain for resilience.
| Provider | TTFT (median) | Cost | Best For |
|---|---|---|---|
| Groq | ~100ms | Free tier | Speed, speculation |
| OpenAI | ~300ms | $$$ | Quality, tool calling |
| Anthropic | ~400ms | $$$ | Quality, safety |
| Ollama | ~200ms* | Free | Privacy, offline |
*Local hardware dependent.
llm:
provider: "groq"
model: "llama-3.3-70b-versatile"
fallback:
- provider: "openai"
model: "gpt-4o-mini"
- provider: "ollama"
model: "llama3.1:8b"Agents can execute tools mid-conversation. Tool results stream back into the response naturally.
tools:
- name: "get_weather"
description: "Get current weather for a location"
handler: "precog.tools:get_weather"
parameters:
location: { type: "string", required: true }Built-in tools: get_weather, get_time. Add your own:
from precog.tools import register_tool
@register_tool("check_order")
async def check_order(order_id: str) -> str:
# Your logic here
return f"Order {order_id} ships tomorrow"Knows who's calling and what you discussed before. SQLite-backed, persists across calls.
memory:
enabled: true
context_turns: 20
caller_identification: truePrometheus-compatible endpoint for monitoring:
precog_ttft_secondsโ Time to first LLM tokenprecog_turn_latency_secondsโ End-to-end turn latencyprecog_speculation_hits_total/precog_speculation_misses_totalprecog_tts_cache_hits_total/precog_tts_cache_misses_totalprecog_active_callsโ Current concurrent callsprecog_call_duration_seconds
Auto-save full conversation transcripts with timestamps:
{
"call_sid": "CA123...",
"caller": "+1555...",
"duration_seconds": 47.3,
"entries": [
{"role": "user", "text": "What's the weather?", "timestamp_ms": 1234567890},
{"role": "assistant", "text": "It's 15ยฐC and sunny in London.", "timestamp_ms": 1234567891}
]
}See config.example.yaml for the full reference with comments.
| Section | Key Settings |
|---|---|
agent |
name, system_prompt, personality |
voice |
provider (elevenlabs/piper), voice_id, stability |
llm |
provider, model, temperature, fallback chain |
speculation |
enabled, model, confidence_threshold |
tts_cache |
enabled, max_size_mb, pre_warm phrases |
tools |
Array of tool definitions with handlers |
memory |
enabled, context_turns, caller_identification |
recording |
enabled, save_path |
webhooks |
call_start, call_end, tool_used URLs |
server |
port, api_key |
metrics |
enabled, port |
FROM python:3.12-slim
WORKDIR /app
COPY . .
RUN pip install -e .
EXPOSE 3040
CMD ["python", "main.py"]- Set environment variables from
.env.example - Set
PRECOG_CONFIG=config.yaml - Start command:
python main.py
- Get a Twilio phone number
- Set the Voice webhook URL to
https://your-server/twilio/inbound(POST) - Call the number
For outbound calls:
curl -X POST https://your-server/calls \
-H "X-API-Key: your-key" \
-H "Content-Type: application/json" \
-d '{"to": "+1555123456"}'# Install dev dependencies
pip install -e ".[dev]"
# Run tests (no API keys needed!)
pytest
# Run with coverage
pytest --cov=precog
# Lint
ruff check .
# Type check
mypy precog/
# Benchmark TTFT across providers
python scripts/bench_ttft.py --providers groq,openai --rounds 10Traditional voice agent:
User stops speaking โ STT final โ LLM generate (300-800ms) โ TTS (200-500ms) โ Play
Total: 500-1300ms
Precog with speculation hit + cache hit:
User stops speaking โ STT final โ speculation matches! โ cache hit! โ Play
Total: ~50ms (STT finalization only)
Precog with speculation miss + cache miss (worst case = same as traditional):
User stops speaking โ STT final โ discard speculation โ LLM generate โ TTS โ Play
Total: 500-1300ms
The key insight: speculation is free when it misses (we just discard), but eliminates latency when it hits. Over time, the TTS cache absorbs more and more common phrases, making even cache-miss speculation faster.
- Shuo โ Pure functional state machine architecture, streaming pipeline design
- Pipecat โ Multi-provider approach
- The insight from an ex-Amazon Alexa engineer: "Median human-to-human turn delay is 0ms"
MIT โ see LICENSE.