feat(vector/hnsw): add per‑query ef and distance_threshold to similar_to, fix early termination #9514

joelamming · 2025-10-19T20:30:12Z

Hugely appreciative of the Dgraph team’s work. Native vector search integrated directly into a graph database is kind of a no brainer today. Deployed Dgraph (both vanilla and customised) in systems with 1M+ vectors guiding deep traversal queries across 10M+ nodes -- tight coupling of vector search with graph traversal at massive scale gets us closer to something that could represent the fuzzy nuances of everything in an enterprise. Certainly not the biggest deployment your team will have seen, but this PR fixes an under‑recall edge case in HNSW and introduces opt‑in, per‑query controls that let users dial recall vs latency safely and predictably. I’ve had this running in production for a while and thought it worth proposing to main.

Summary
- Fix incorrect early termination in the HNSW bottom layer that could stop before collecting k neighbours.
- Extend similar_to with optional per‑query ef and distance_threshold (string or JSON‑like fourth argument).
- Backwards compatible: default 3‑arg behaviour of similar_to is unchanged.
Motivation
- In narrow probes, the bottom‑layer search could exit at a local minimum before collecting k, hurting recall.
- No per‑query ef meant recall vs latency trade‑offs required global tuning or inflating k (and downstream work).
- This PR corrects the termination logic and adds opt‑in knobs so users can increase exploration only when needed.
Changes (key files)
- tok/hnsw/persistent_hnsw.go: fix early termination, add SearchWithOptions/SearchWithUidAndOptions, apply ef override at upper layers and max(k, ef) at bottom layer, apply distance_threshold in the metric domain (Euclidean squared internally, cosine as 1 − sim).
- tok/index/index.go: add VectorIndexOptions and OptionalSearchOptions (non‑breaking).
- worker/task.go: parse optional fourth argument to similar_to (ef, distance_threshold), thread options, route to optional methods when provided, guard zero/negative k.
- tok/index/search_path.go: add SearchPathResult helper.
- Tests: tok/hnsw/ef_recall_test.go adds
  - TestHNSWSearchEfOverrideImprovesRecall
  - TestHNSWDistanceThreshold_Euclidean
  - TestHNSWDistanceThreshold_Cosine
- CHANGELOG.md: Unreleased entry for HNSW fix and per‑query options.
Backwards compatibility
- No default behaviour changes. The three‑argument similar_to(attr, k, vector_or_uid) is unchanged.
- ef and distance_threshold are optional, unsupported metrics safely ignore the threshold.
Performance
- No overhead without options.
- With ef, bottom‑layer candidate size becomes max(k, ef) (as in HNSW), cost scales accordingly.
- Threshold filtering is a cheap pass over candidates, squaring Euclidean thresholds avoids extra roots.
Rationale and alignment
- Matches HNSW semantics: ef_search controls exploration/recall, k controls output size.
- Aligns with Typesense’s per‑query ef and distance_threshold semantics for familiarity.

Checklist

Code compiles correctly and linting passes locally
For all code changes, an entry added to the CHANGELOG.md describing this PR
Tests added for new functionality / regression tests for the bug fix
For public APIs/new features, docs PR will be prepared and linked here after initial review

darkcoderrises · 2025-10-19T20:38:52Z

Hey, Could you send a copy of the PR to https://github.com/predictable-labs/dgraph/pulls also? I am the author of the bugs that you have fixed. I have made a fork of the repo to be able to make changes myself. I have fixed some bugs in the vector index that makes it much faster.

joelamming · 2025-10-19T21:10:39Z

Hey, Could you send a copy of the PR to https://github.com/predictable-labs/dgraph/pulls also? I am the author of the bugs that you have fixed. I have made a fork of the repo to be able to make changes myself. I have fixed some bugs in the vector index that makes it much faster.

Cheers for the quick look. I’ve opened a mirror PR against your fork here: predictable-labs#17

Your branch is ahead of upstream so it shows conflicts in tok/hnsw/persistent_hnsw.go and worker/task.go. You're welcome to cherrypick from my branch into your tree. I could otherwise rebase onto predictable-labs/dgraph:main and resolve the conflicts on my side (will be tomorrow).

Tests for both parts are in tok/hnsw/ef_recall_test.go, and CHANGELOG.md has an Unreleased entry. Let me know which route you prefer and I'll see about implementing tomorrow.

darkcoderrises · 2025-10-20T09:10:19Z

@joelamming Thanks a lot for sending your changes here also. I can take a look and convert your changes. Basically we have introduced a new Partioned HNSW which is supposed to be much faster. We also figured out a race condition in the hnsw tree which led to less recall. We were able to improve accuracy significantly even without your changes in the new hnsw.

matthewmcneely · 2025-10-23T21:33:23Z

@joelamming Thanks for this contribution, truly amazing. Could I ask that you create an integration test in /query/vector/vector_test.go that illustrates the new arguments/functionality? This will help with documentation and general understanding of this new functionality.

Also, a few nits:

Can you not use the term "neighbour", instead "neighbor"? I know we do it wrong on this side of the pond, but other places in the codebase use the North American version
Can you install/run trunk fmt against your changes? That's the only (required) failing check

Thanks!

joelamming · 2025-10-24T23:16:09Z

Thanks for the nudge -- I’ve pushed a few follow-up commits:

Added TestSimilarToOptionsIntegration in query/vector/vector_test.go. It tries both the "ef=…" string syntax and the JSON-style {distance_threshold: …} options and passes against a fresh compose cluster (go test ./query/vector -run TestSimilarToOptionsIntegration -tags=integration -count=1)
Normalised the spelling to “neighbor” in the new vector/HNSW paths (tok/hnsw/persistent_hnsw.go, tok/index/search_path.go)
Ran trunk fmt over the tree, checks are clean now

While tracking down the integration failure I also found that Euclidean distance_threshold was being compared to the squared distance. Never surfaced in production as we've rarely used it but it can be an important part of full featured vector search. I fixed the filtering to use the raw metric distance and refreshed the regression tests (TestHNSWDistanceThreshold_Euclidean). Everything including the integration test now lines up with the expected semantics

joelamming · 2025-10-26T10:03:34Z

@matthewmcneely I see the CI failures from the GitHub Actions runs. I've identified and fixed the parser regression introduced by the similar_to JSON argument support.

The lexer changes to support similar_to(pred, k, vec, {ef: 64, distance_threshold: 0.45}) altered error handling for a specific class of malformed queries, causing three parser tests to fail. Not one that surfaced for us in production but makes sense to keep in dgraph main.

Yesterday:

Fixed the lexer/parser interaction to properly validate braces
Updated affected tests with documentation explaining the (minor) error message trade-off
Added comprehensive test coverage for edge cases
All local tests passing: go test ./dql

Today I'm running the full CI-equivalent suite on an ubuntu-noble-24.04-amd64 EC2 runner (same config as the Actions hosts). For each harness run I'm doing the standard ./t -r setup/teardown and recopying dgraph into $(go env GOPATH)/bin. As we speak:

go clean -testcache
./t -r
cp dgraph/dgraph "$(go env GOPATH)/bin/dgraph"
./t --suite=core
./t --suite=vector
./t -r
go clean -testcache
go test -v -timeout=90m -failfast -tags=integration2 ./...

Please hold off on reviewing until I've confirmed everything is clean on the EC2 instance and I've finished pushing the fixes. Should have results later today.

Will provide detailed docs on the architectural tradeoff in detail in the next push -- happy to discuss the approach once you have a chance to review.

Thanks for your patience!

joelamming · 2025-10-26T18:24:52Z

@matthewmcneely CI tests all pass on my end after bumping to the 32GB instance

Guarded similar_to's JSON literal parsing in the parser so only the 4th argument can hold
{...} while the lexer stays strict elsewhere
Restored legacy “Unrecognized character…” errors for non-similar_to stray braces
Updated parser tests to reflect the restored messaging and added coverage for the
supported/unsupported call shapes

Ran tests:

go test ./dql
./t -r && cp dgraph/dgraph "$(go env GOPATH)/bin/dgraph"
./t --suite=core
./t --suite=vector
./t --suite=load
./t --suite=systest
./t --suite=ldbc
go test -v -timeout=90m -failfast -tags=integration2 ./...
go test -v -timeout=120m -failfast -tags=upgrade
github.com/hypermodeinc/dgraph/v25/{acl,worker,query}
go clean -testcache between suites and recopying dgraph each time per EC2 workflow

Given the runner matches Actions (Ubuntu 24.04, 16 vCPU/32 GiB) I don’t anticipate surprises on CI

Many thanks!

joelamming · 2025-11-01T19:34:56Z

Hi @matthewmcneely

Just to follow up on this -- the new commits are pushed and it looks like the CI workflows are just waiting for approval to run

Thanks for your time!

HNSW fixes and tests

f31851a

joelamming requested a review from a team October 19, 2025 20:30

Update comment in persistent_hnsw.go

16e8ead

Merge branch 'hypermodeinc:main' into main

408ec42

Fix HNSW distance threshold filtering

69c1eb0

Working on CI test failures

2c9c391

Merge branch 'dgraph-io:main' into main

b508ec2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(vector/hnsw): add per‑query ef and distance_threshold to similar_to, fix early termination #9514

feat(vector/hnsw): add per‑query ef and distance_threshold to similar_to, fix early termination #9514

joelamming commented Oct 19, 2025

Uh oh!

darkcoderrises commented Oct 19, 2025

Uh oh!

joelamming commented Oct 19, 2025

Uh oh!

darkcoderrises commented Oct 20, 2025

Uh oh!

matthewmcneely commented Oct 23, 2025

Uh oh!

joelamming commented Oct 24, 2025

Uh oh!

joelamming commented Oct 26, 2025 •

edited

Loading

Uh oh!

joelamming commented Oct 26, 2025

Uh oh!

joelamming commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

feat(vector/hnsw): add per‑query ef and distance_threshold to similar_to, fix early termination #9514

Are you sure you want to change the base?

feat(vector/hnsw): add per‑query ef and distance_threshold to similar_to, fix early termination #9514

Conversation

joelamming commented Oct 19, 2025

Uh oh!

darkcoderrises commented Oct 19, 2025

Uh oh!

joelamming commented Oct 19, 2025

Uh oh!

darkcoderrises commented Oct 20, 2025

Uh oh!

matthewmcneely commented Oct 23, 2025

Uh oh!

joelamming commented Oct 24, 2025

Uh oh!

joelamming commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joelamming commented Oct 26, 2025

Ran tests:

Uh oh!

joelamming commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

joelamming commented Oct 26, 2025 •

edited

Loading