Flink-runtime showcase — exercises Flink-only capabilities (Flink-state keyed HNSW vector memory). The portable cores ship their own in-process HNSW index, so RAG itself runs everywhere (see
banking-rag.yamland the banking agent on every runtime); what's Flink-specific here is binding the index to Flink's checkpointed state.
Source:
src/main/java/org/agentic/flink/example/rag/RagResearchExample.javaInline README:src/main/java/org/agentic/flink/example/rag/README.md
A retrieval-augmented assistant has four levers that all matter:
- Embedder quality — bad embeddings ≡ retrieving the wrong passages.
- Vector store latency — every query embeds + searches; cost compounds.
- Rerank precision — embedding similarity is a coarse signal. A cross-encoder over the top-k catches "looks similar, is wrong" cases.
- Citations — the LLM has to ground its answer or the system is just a confident summarizer.
The example wires all four with the framework's primitives:
Query
│
▼ keyBy(topic) — each topic gets its own KB scope
│
▼ EmbeddingClient.embed(question)
│
▼ VectorMemory.search(queryVec, k=4) — brute-force over MapState
│
▼ Scorer.scorePair(passage, question) — cross-encoder rerank
│
▼ LLM answers using top-3 passages, cites [1] [2] [3]
The framework ships brute-force KNN backed by MapState<String, VectorEntry>.
For a few thousand vectors per key (the typical "conversation-local recall"
use case), exact search at d=384 is sub-millisecond. The state itself rides
in Flink checkpoints, so a job restart picks up where it left off — no
"rebuild the index" boot sequence.
When you outgrow that (10⁵+ vectors per key), drop in an HNSW-backed
VectorMemorySpec via the SPI. The agent code doesn't change.
Topics are independent corpora. Keying by topic gives each topic its own
VectorMemory scope, which means:
- A noisy topic doesn't drown a clean one in the same brute-force scan.
- Different topics can use different embedders or dimensions if you want (one operator per topic, different specs).
- Eviction policies (TTL, max-items) apply per topic.
The demo seeds 4 documents per topic on the first event using a
ValueState<Boolean> flag. In production you'd either:
- Hydrate from a
LongTermMemoryStoreon cold-start — the framework's standard pattern. - Stream new documents in via a
Channel<KeyedContextItem>—KafkaContextChannel,PostgresChangeChannel,RedisPubSubChannel, or any custom transport. The channel producesKeyedContextItemrecords that the agent operator union-connects with its main input.
Both approaches are covered in docs/memory.md. For the
unified channel SPI see docs/channels.md. For a more
ambitious example with two inputs sharing a corpus, see
live-research.
The demo keeps the runtime self-contained; to let the LLM call out to file or
web tools through Model Context Protocol, register an McpServerSpec and
discover its tools at job startup:
McpServerSpec everything = McpServerSpec.stdio(
"everything", "npx", "-y", "@modelcontextprotocol/server-everything");
Agent agent = Agent.builder()
.withId("rag")
.withChatConnection(chat)
.withChatSetup(chatSetup)
.withMcpServer(everything)
.build();
// In the operator's open():
try (McpClient client = new McpClient(everything)) {
client.initialize();
for (McpToolExecutor t : McpToolRegistry.discover(everything, client)) {
toolRegistry.register(t.getToolId(), t);
}
}Each discovered tool surfaces as a regular ToolExecutor — the LLM doesn't
know or care that it's MCP under the hood.
On a recent laptop with the suggested models:
- Embed: ~12 ms (MiniLM-L6-v2, batch 1, CPU)
- Brute-force search (k=4 over 4 vectors): ~0.1 ms
- Rerank top-4: ~80 ms (cross-encoder, batch 4)
- LLM answer (qwen2.5:3b): ~600 ms
The LLM dominates; everything else is overhead. Scale the embedder + vector search up to 10k vectors per key before the search side starts to matter.
- Cold first run: first call downloads model weights (~250 MB across the three HF models). Subsequent runs are cache-warm.
- Dimension mismatch:
FlinkStateVectorMemory.spec(384)must match the embedder's actual dimension. The framework throwsIllegalArgumentExceptionat insert time if they disagree. - No-op rerank when k=1: with a single hit, the cross-encoder still runs. Skip it conditionally if you care about the savings.