Skip to content

Evaluate database for Metis (move off JSON files) #8

@0b00101111

Description

@0b00101111

Context

Metis currently stores atoms, entities, graph index, and embeddings as JSON files on disk. This works for the pipeline/batch-processing phase but won't scale for retrieval APIs or multi-user access.

Requirements

Requirement Source
Store atoms with variable-length roles (JSONB) 17 core frame types, each with different role schemas
Vector search over atom embeddings Hybrid retrieval: BM25 + cosine similarity
Graph traversal Entity links, reinforces/contradicts/extends edges
Full-text search (EN + CJK) BM25 tokenizer already handles CJK bigrams
Batch writes, read-heavy after ingestion Pipeline processes books in bulk, then serves queries
Multi-source knowledge graphs 8 books now, growing — cross-domain entity resolution
Checkpoint/resumable state Pipeline stages save intermediate results

Candidates Evaluated

1. Supabase (Postgres + pgvector) — Recommended

  • JSONB for atoms, pgvector for embeddings, recursive CTEs for graph traversal, tsvector for full-text search (including CJK)
  • Row Level Security if we go multi-user
  • Mature, documented, free tier generous, self-hostable
  • Includes auth, storage, realtime — batteries we'll want later
  • Gap: graph traversal via recursive CTEs works but gets awkward at depth. No built-in embedding generation.

2. DB9 (db9.ai)

  • Built-in embedding() eliminates external embedding pipeline. CHUNK_TEXT() for RAG. CJK full-text search supported. Sub-second provisioning.
  • Gaps: single-node, no SERIALIZABLE isolation, no realtime, no auth. Young product, limited community. No GIN index support. Extensions compiled in, not loadable.
  • Interesting for the embedding integration. Risky as sole storage layer.

3. Neon (serverless Postgres)

  • Same Postgres story as Supabase. Branching for dev/test. Scale-to-zero.
  • Less batteries-included than Supabase. No bundled auth/storage.
  • Good if we want Postgres without Supabase's platform overhead.

4. SQLite / Turso

  • Wrong tool. No native vector search, weak JSONB querying, CJK full-text search is painful.

5. Neo4j

  • Best graph queries (Cypher), worst everything else. Overkill unless graph traversal becomes the primary access pattern. Expensive hosting.

6. Postgres + pgvector + Apache AGE

  • Architecturally perfect — relational + graph (Cypher) + vector in one DB. AGE is less mature than Neo4j. Not available on Supabase (self-host or find compatible managed provider).

Recommendation: Supabase

Why:

  1. Pipeline iteration is the bottleneck, not the database — DB should stay out of the way
  2. Query patterns are 80% relational, 20% graph — recursive CTEs handle current graph needs
  3. pgvector is good enough — embedding pipeline is ~20 lines in embedding-service.ts
  4. CJK matters — processing Chinese books, Supabase Postgres supports CJK full-text configs
  5. Auth needed eventually — Supabase includes it, DB9 doesn't

Proposed schema:

atoms        — id, frame, roles (jsonb), conditions, confidence,
               source (jsonb), domain[], flags[], embedding vector(1536)
entities     — id, canonical_name, aliases[], domain[], atom_ids[]
edges        — source_atom, target_atom, type, confidence
checkpoints  — book_slug, stage, status, data (jsonb), completed_at

Migration Path

Don't migrate everything at once. Start with the retrieval layer:

  1. Move atoms + embeddings into Supabase
  2. Keep pipeline writing JSON locally (checkpoints and resumability already work)
  3. Add a load-to-db step after pipeline completion
  4. Query side gets a real database; pipeline stays file-based

Future consideration

If graph traversal becomes central, Apache AGE can be added to Postgres later — same engine, adds Cypher query language.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions