Skip to content

Latest commit

 

History

History
68 lines (53 loc) · 2.35 KB

File metadata and controls

68 lines (53 loc) · 2.35 KB

PRD: MVP Vector Store on Raft-Based Distributed System

Objective

Build a minimalist Cassandra-inspired vector store system using the Nexus-Mind consensus implementation for metadata coordination while supporting efficient vector similarity search.

Core Features

  1. Vector Storage & Retrieval

    • Store embedding vectors (float arrays) with unique IDs
    • Support efficient similarity search (cosine, dot product, Euclidean)
    • Optional document chunk storage alongside vectors
  2. Distributed Architecture

    • Raft consensus for cluster metadata and topology
    • Consistent hashing for vector partitioning
    • Configurable replication factor
  3. Query Capabilities

    • Vector similarity search with k-nearest neighbors
    • Basic filtering on metadata
    • Batch vector operations
  4. Client Interface

    • Simple REST API for CRUD operations
    • Bulk import/export functionality
    • Basic authentication

Technical Components

  • Vector index module (HNSW or similar algorithm)
  • Raft-based metadata coordination
  • Storage engine for vectors and documents
  • Query processor with similarity calculation

Implementation Phases

MVP Phase (v0.1)

  • Data Consistency: Simple Quorum-based consistency mechanism for read/write operations

    • Configurable read/write quorum values
    • Basic conflict resolution using timestamps
  • Vector Indexing: HNSW implementation for efficient similarity search

    • Focus on single index per node initially
    • Optimize for 1536-dimension vectors common in embedding models
  • Scalability: Support basic 3-node cluster configuration

    • Manual node addition/removal process
    • Simple consistent hashing implementation
  • Security: Basic authentication with API keys

    • Simple token validation
    • No encryption in initial phase
  • Monitoring: Fundamental logging system

    • Operation logs for debugging
    • Basic performance metrics collection

Future Enhancements (Post-MVP)

  • Advanced data consistency with CRDT support
  • Multiple vector index algorithm options (IVF, PQ, etc.)
  • Automatic cluster scaling and rebalancing
  • Data encryption and fine-grained access control
  • Comprehensive monitoring with alerts and dashboards

Metrics

  • Query latency < 100ms for 10K vectors
  • Linear scalability to 3+ nodes
  • Support for vectors up to 1536 dimensions
  • 99.9% availability target for MVP