xdb: Distributed lite vector database built from *scratch.
- full-text search using proximity ranking
- semantic search via HNSW + Cosine distance
- integrated basic text embedding service (Python http API around a sentence transformer)
- (naive) weighted full-text + semantic search hybrid
- in-memory serving + disk persistence
- fault-tolerance with segment replication using Raft
- production-ready (code is pretty sus right now)
- perfect (because enemy of good)
Install the dependencies for the basic text embedding service in third_party
folder using pip.
pip install -r requirements.txt
Then start the service like so:
uvicorn main:app
Set an environment variable EmbeddingHost
which points to the address of the embedding service
export EmbeddingHost="http://127.0.0.1:8000/embeddings"
Proceed to start instance(s) of the vector db
- httpAddr: address of HTTP API service
- joinAddr: HTTP API service address of primary node to join
- nodeId: unique identifier for node
- raftAddr: raft address for node
go run cmd/server/main.go -httpAddr 127.0.0.1:8111 -nodeId 0 -raftAddr 127.0.0.1:9000
do a search
curl --location --request GET '127.0.0.1:8111/search' \
--header 'Content-Type: application/json' \
--data '{"query": "some text"}'
index a document
curl --location '127.0.0.1:8111/index' --header 'Content-Type: application/json' --data '{"text": "some text"}'
Run the commands below on different machines (at least different instances of the project to simulate)
go run cmd/server/main.go -httpAddr 127.0.0.1:8111 -nodeId 0 -raftAddr 127.0.0.1:9000
Replicas join the primary on 127.0.0.1:9000
go run cmd/server/main.go -httpAddr 127.0.0.1:8112 -nodeId 1 -raftAddr 127.0.0.1:9001 -joinAddr 127.0.0.1:8111
go run cmd/server/main.go -httpAddr 127.0.0.1:8113 -nodeId 2 -raftAddr 127.0.0.1:9002 -joinAddr 127.0.0.1:8111
- Indexing
- Concurrent indexing using goroutines to process terms
- Retrieval
- Boolean queries
- Concurrent memtable search
- Ranking
- API
- Bulk index
- Document deletion
- Storage
- Segment compaction
- Replication
- Snapshot working?
- Deployment
- Containerisation
- Code quality
- Penance for all the atrocities I committed.