Skip to content

Conversation

@joelamming
Copy link

Hugely appreciative of the Dgraph team’s work. Native vector search integrated directly into a graph database is kind of a no brainer today. Deployed Dgraph (both vanilla and customised) in systems with 1M+ vectors guiding deep traversal queries across 10M+ nodes -- tight coupling of vector search with graph traversal at massive scale gets us closer to something that could represent the fuzzy nuances of everything in an enterprise. Certainly not the biggest deployment your team will have seen, but this PR fixes an under‑recall edge case in HNSW and introduces opt‑in, per‑query controls that let users dial recall vs latency safely and predictably. I’ve had this running in production for a while and thought it worth proposing to main.

  • Summary

    • Fix incorrect early termination in the HNSW bottom layer that could stop before collecting k neighbours.
    • Extend similar_to with optional per‑query ef and distance_threshold (string or JSON‑like fourth argument).
    • Backwards compatible: default 3‑arg behaviour of similar_to is unchanged.
  • Motivation

    • In narrow probes, the bottom‑layer search could exit at a local minimum before collecting k, hurting recall.
    • No per‑query ef meant recall vs latency trade‑offs required global tuning or inflating k (and downstream work).
    • This PR corrects the termination logic and adds opt‑in knobs so users can increase exploration only when needed.
  • Changes (key files)

    • tok/hnsw/persistent_hnsw.go: fix early termination, add SearchWithOptions/SearchWithUidAndOptions, apply ef override at upper layers and max(k, ef) at bottom layer, apply distance_threshold in the metric domain (Euclidean squared internally, cosine as 1 − sim).
    • tok/index/index.go: add VectorIndexOptions and OptionalSearchOptions (non‑breaking).
    • worker/task.go: parse optional fourth argument to similar_to (ef, distance_threshold), thread options, route to optional methods when provided, guard zero/negative k.
    • tok/index/search_path.go: add SearchPathResult helper.
    • Tests: tok/hnsw/ef_recall_test.go adds
      • TestHNSWSearchEfOverrideImprovesRecall
      • TestHNSWDistanceThreshold_Euclidean
      • TestHNSWDistanceThreshold_Cosine
    • CHANGELOG.md: Unreleased entry for HNSW fix and per‑query options.
  • Backwards compatibility

    • No default behaviour changes. The three‑argument similar_to(attr, k, vector_or_uid) is unchanged.
    • ef and distance_threshold are optional, unsupported metrics safely ignore the threshold.
  • Performance

    • No overhead without options.
    • With ef, bottom‑layer candidate size becomes max(k, ef) (as in HNSW), cost scales accordingly.
    • Threshold filtering is a cheap pass over candidates, squaring Euclidean thresholds avoids extra roots.
  • Rationale and alignment

    • Matches HNSW semantics: ef_search controls exploration/recall, k controls output size.
    • Aligns with Typesense’s per‑query ef and distance_threshold semantics for familiarity.

Checklist

  • Code compiles correctly and linting passes locally
  • For all code changes, an entry added to the CHANGELOG.md describing this PR
  • Tests added for new functionality / regression tests for the bug fix
  • For public APIs/new features, docs PR will be prepared and linked here after initial review

@joelamming joelamming requested a review from a team October 19, 2025 20:30
@darkcoderrises
Copy link

Hey, Could you send a copy of the PR to https://github.com/predictable-labs/dgraph/pulls also? I am the author of the bugs that you have fixed. I have made a fork of the repo to be able to make changes myself. I have fixed some bugs in the vector index that makes it much faster.

@joelamming
Copy link
Author

Hey, Could you send a copy of the PR to https://github.com/predictable-labs/dgraph/pulls also? I am the author of the bugs that you have fixed. I have made a fork of the repo to be able to make changes myself. I have fixed some bugs in the vector index that makes it much faster.

Cheers for the quick look. I’ve opened a mirror PR against your fork here: predictable-labs#17

Your branch is ahead of upstream so it shows conflicts in tok/hnsw/persistent_hnsw.go and worker/task.go. You're welcome to cherrypick from my branch into your tree. I could otherwise rebase onto predictable-labs/dgraph:main and resolve the conflicts on my side (will be tomorrow).

Tests for both parts are in tok/hnsw/ef_recall_test.go, and CHANGELOG.md has an Unreleased entry. Let me know which route you prefer and I'll see about implementing tomorrow.

@darkcoderrises
Copy link

@joelamming Thanks a lot for sending your changes here also. I can take a look and convert your changes. Basically we have introduced a new Partioned HNSW which is supposed to be much faster. We also figured out a race condition in the hnsw tree which led to less recall. We were able to improve accuracy significantly even without your changes in the new hnsw.

@matthewmcneely
Copy link
Contributor

@joelamming Thanks for this contribution, truly amazing. Could I ask that you create an integration test in /query/vector/vector_test.go that illustrates the new arguments/functionality? This will help with documentation and general understanding of this new functionality.

Also, a few nits:

  • Can you not use the term "neighbour", instead "neighbor"? I know we do it wrong on this side of the pond, but other places in the codebase use the North American version
  • Can you install/run trunk fmt against your changes? That's the only (required) failing check

Thanks!

@joelamming
Copy link
Author

Thanks for the nudge -- I’ve pushed a few follow-up commits:

  • Added TestSimilarToOptionsIntegration in query/vector/vector_test.go. It tries both the "ef=…" string syntax and the JSON-style {distance_threshold: …} options and passes against a fresh compose cluster (go test ./query/vector -run TestSimilarToOptionsIntegration -tags=integration -count=1)
  • Normalised the spelling to “neighbor” in the new vector/HNSW paths (tok/hnsw/persistent_hnsw.go, tok/index/search_path.go)
  • Ran trunk fmt over the tree, checks are clean now

While tracking down the integration failure I also found that Euclidean distance_threshold was being compared to the squared distance. Never surfaced in production as we've rarely used it but it can be an important part of full featured vector search. I fixed the filtering to use the raw metric distance and refreshed the regression tests (TestHNSWDistanceThreshold_Euclidean). Everything including the integration test now lines up with the expected semantics

@joelamming
Copy link
Author

joelamming commented Oct 26, 2025

@matthewmcneely I see the CI failures from the GitHub Actions runs. I've identified and fixed the parser regression introduced by the similar_to JSON argument support.

The lexer changes to support similar_to(pred, k, vec, {ef: 64, distance_threshold: 0.45}) altered error handling for a specific class of malformed queries, causing three parser tests to fail. Not one that surfaced for us in production but makes sense to keep in dgraph main.

Yesterday:

  • Fixed the lexer/parser interaction to properly validate braces
  • Updated affected tests with documentation explaining the (minor) error message trade-off
  • Added comprehensive test coverage for edge cases
  • All local tests passing: go test ./dql

Today I'm running the full CI-equivalent suite on an ubuntu-noble-24.04-amd64 EC2 runner (same config as the Actions hosts). For each harness run I'm doing the standard ./t -r setup/teardown and recopying dgraph into $(go env GOPATH)/bin. As we speak:

  • go clean -testcache
  • ./t -r
  • cp dgraph/dgraph "$(go env GOPATH)/bin/dgraph"
  • ./t --suite=core
  • ./t --suite=vector
  • ./t -r
  • go clean -testcache
  • go test -v -timeout=90m -failfast -tags=integration2 ./...

Please hold off on reviewing until I've confirmed everything is clean on the EC2 instance and I've finished pushing the fixes. Should have results later today.

Will provide detailed docs on the architectural tradeoff in detail in the next push -- happy to discuss the approach once you have a chance to review.

Thanks for your patience!

@joelamming
Copy link
Author

@matthewmcneely CI tests all pass on my end after bumping to the 32GB instance

  • Guarded similar_to's JSON literal parsing in the parser so only the 4th argument can hold
    {...} while the lexer stays strict elsewhere
  • Restored legacy “Unrecognized character…” errors for non-similar_to stray braces
  • Updated parser tests to reflect the restored messaging and added coverage for the
    supported/unsupported call shapes

Ran tests:

  • go test ./dql
  • ./t -r && cp dgraph/dgraph "$(go env GOPATH)/bin/dgraph"
  • ./t --suite=core
  • ./t --suite=vector
  • ./t --suite=load
  • ./t --suite=systest
  • ./t --suite=ldbc
  • go test -v -timeout=90m -failfast -tags=integration2 ./...
  • go test -v -timeout=120m -failfast -tags=upgrade
    github.com/hypermodeinc/dgraph/v25/{acl,worker,query}
  • go clean -testcache between suites and recopying dgraph each time per EC2 workflow

Given the runner matches Actions (Ubuntu 24.04, 16 vCPU/32 GiB) I don’t anticipate surprises on CI

Many thanks!

@joelamming
Copy link
Author

Hi @matthewmcneely

Just to follow up on this -- the new commits are pushed and it looks like the CI workflows are just waiting for approval to run

Thanks for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants