Skip to content

fix for indexing issue with knowledge bases#274

Merged
vizsatiz merged 7 commits into
developfrom
embedding-index-issue
Apr 21, 2026
Merged

fix for indexing issue with knowledge bases#274
vizsatiz merged 7 commits into
developfrom
embedding-index-issue

Conversation

@vizsatiz

@vizsatiz vizsatiz commented Apr 10, 2026

Copy link
Copy Markdown
Member

Summary by CodeRabbit

  • Performance Improvements

    • Added new vector and token indexes and refactored search to a two-stage approximate nearest-neighbor flow, significantly speeding up vector search and retrieval.
  • Bug Fixes

    • Tightened token update to only modify missing tokens, preventing unintended overwrites.
  • Improvements

    • More reliable image retrieval: embedding extraction now validates required embeddings and uses a staged retrieval to improve relevance and reduce empty results.

@coderabbitai

coderabbitai Bot commented Apr 10, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an Alembic migration to create HNSW and GIN indexes; refactors vector search to use HNSW candidate CTEs with explicit vector casts and candidate limiting; tightens token updates to NULL-only; and changes image retrieval to require both clip and dino embeddings and run sequential retrieval.

Changes

Cohort / File(s) Summary
Database Migration
wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py
New Alembic migration switching to AUTOCOMMIT, sets maintenance_work_mem, and runs raw SQL to CREATE INDEX CONCURRENTLY IF NOT EXISTS three indexes: two HNSW indexes on embedding_vector::vector(512) and embedding_vector_1::vector(1024) (cosine ops, m=16, ef_construction=64) and a GIN index on token; downgrade() drops them concurrently.
Query Generation / Vector Search
wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py
Refactors vector retrieval into two-stage CTEs: hnsw_candidates (approximate neighbor selection using explicit ::vector(512/1024) casts and a larger candidate set) then vector_results for join/scoring and final LIMIT; moves some metadata filtering into the appropriate stage(s); get_update_tokens_query now updates only rows where token IS NULL.
Image Retrieval Refactor
wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py
Removes persistent self.reranked_image; extracts clip_embedding and dino_embedding conditionally from embedding response; proceeds only when both embeddings present: run clip retrieval, return [] if no clip results, then run dino retrieval constrained to reference IDs from clip results; otherwise return [].

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client as Client
participant Service as ImageRAGService
participant QueryGen as QueryGenerator
participant DB as Postgres(HNSW)
participant DocStore as DocumentStore

Client->>Service: image retrieval request (image)
Service->>Service: extract clip_embedding and dino_embedding
alt both embeddings present
    Service->>QueryGen: build clip hnsw_candidates CTE (::vector(512), limit * 20)
    QueryGen->>DB: run approximate neighbor query (CONCURRENT/approx)
    DB-->>QueryGen: candidate ids + distances
    QueryGen->>DocStore: join candidates -> vector_results (compute scores)
    DocStore-->>Service: clip retrieval results (reference IDs)
    Service->>QueryGen: build dino hnsw_candidates CTE (::vector(1024), constrained by reference IDs)
    QueryGen->>DB: run constrained neighbor query
    DB-->>Service: dino-ranked results
    Service-->>Client: combined retrieval results
else missing embeddings or empty clip results
    Service-->>Client: []
end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through vectors under moonlit code,
I scatter CTE seeds along the road,
Clip finds the crowd, Dino trims the line,
Indexes hum so neighbors align,
I munch on results — carrots and bytes.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding HNSW indexes to improve knowledge base embedding search performance, which addresses the indexing issue referenced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch embedding-index-issue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

# revision identifiers, used by Alembic.
revision: str = 'e8f2a1c3b5d9'
down_revision: Union[str, None] = 'c7a9e2f4b1d0'
branch_labels: Union[str, Sequence[str], None] = None

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py`:
- Around line 21-69: The migration uses CREATE/DROP INDEX CONCURRENTLY inside
upgrade() and downgrade() via op.execute, which fails because CONCURRENTLY
cannot run inside a transaction; wrap each op.execute that creates or drops an
index in both upgrade() and downgrade() with an autocommit block using
op.get_context().autocommit_block() so each CREATE INDEX CONCURRENTLY / DROP
INDEX CONCURRENTLY runs outside the surrounding transaction (i.e., replace
direct op.execute(...) calls for ix_kbe_embedding_vector_hnsw_cosine,
ix_kbe_embedding_vector_hnsw_l2, ix_kbe_embedding_vector_1_hnsw_cosine, and
ix_kbe_token_gin with the same SQL executed inside with
op.get_context().autocommit_block():).

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`:
- Around line 92-123: The ANN candidate CTE hnsw_candidates is built across the
entire KnowledgeBaseEmbeddings table before applying the KB and metadata filters
in vector_results, so its LIMIT (:limit * 10) can drop all matches for the
requested KB; modify the query so the KB and metadata filters are applied inside
hnsw_candidates (e.g., join hnsw_candidates to KnowledgeBaseDocuments or
restrict by document_id IN (SELECT id FROM KnowledgeBaseDocuments WHERE
knowledge_base_id = :kb_id AND <metadata_filter_clause_inner>)) so the pre-LIMIT
candidate set is already scoped to the requested KB and
metadata_filter_clause_inner, then keep the subsequent vector_results CTE and
its final LIMIT :limit unchanged.
- Around line 98-103: The query uses raw columns (embedding_vector /
embedding_vector_1) in the distance expression and ORDER BY, so PostgreSQL can't
use the HNSW expression indexes; update the distance calculation and ORDER BY to
use the exact casted expressions used by the indexes (e.g.
(embedding_vector::vector(512)) and (embedding_vector_1::vector(1024))) wherever
KnowledgeBaseEmbeddings.embedding_vector or embedding_vector_1 are referenced in
generate_query.py so the planner can utilize the HNSW indexes and avoid seq
scans.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py`:
- Around line 47-59: The CLIP->DINO flow should not call image_retrieve_dino
when the CLIP stage produced no candidates; detect when clip_results is empty in
the block inside image_rag_retrieve (where clip_results is built) and
short-circuit (e.g., return an empty list/result or the appropriate no-match
response) instead of calling image_retrieve_dino, because image_retrieve_dino
(and get_image_embedding_dino) will drop the reference_ids filter when given an
empty reference_id_list and perform an unwanted full-KB DINO search.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5efc6c6-4532-43af-9115-3815f5e884fe

📥 Commits

Reviewing files that changed from the base of the PR and between a3a7752 and e5dbd45.

📒 Files selected for processing (3)
  • wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py
  • wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py
  • wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py

@vizsatiz vizsatiz changed the title fix for migration fix for indexing issue with knowledge bases Apr 10, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py (2)

252-272: ⚠️ Potential issue | 🟠 Major

Scale the DINO ANN window with the requested page.

hnsw_candidates only pulls :limit * 10 rows, then the outer query applies OFFSET :offset. Once users page deeper, valid candidates are dropped before the final sort. image_rag_retrieve.py:47-63 forwards offset/limit into this query, so this will surface as incomplete later pages.

💡 Suggested change
         effective_limit = limit if limit is not None else int(params.get('top_k', 10))
         reference_id_list: List[Any] = params.get('reference_id_list', [])
         effective_offset = offset if offset is not None else 0
+        candidate_limit = (effective_offset + effective_limit) * 10
...
         params = {
             'query_embedding': query_embeddings,
             'kb_id': kb_id,
-            'top_k': effective_limit,
             'reference_ids': processed_reference_ids,
             'offset': effective_offset,
             'limit': effective_limit,
+            'candidate_limit': candidate_limit,
         }
...
-            LIMIT :limit * 10
+            LIMIT :candidate_limit

Also applies to: 290-322

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`
around lines 252 - 272, The DINO ANN candidate window is fixed to ":limit * 10"
causing valid results to be dropped when paginating; update generate_query.py to
scale the ANN retrieval window by page so hnsw_candidates fetches enough rows
for the requested offset. Concretely, compute an ann_window (e.g., ann_window =
(effective_offset + effective_limit) * 10 or ann_window =
(math.ceil((effective_offset + 1)/effective_limit) * effective_limit * 10))
using the existing effective_offset and effective_limit variables, pass that
ann_window into the query parameters used by hnsw_candidates (instead of plain
limit*10), and ensure the params dict (where 'top_k'/'reference_ids' are set)
supplies this scaled window so deeper pages retain enough candidates before the
outer OFFSET is applied.

53-70: ⚠️ Potential issue | 🟠 Major

Don't page after trimming both rerank sources to one page.

vector_results and keyword_results are capped at :limit before the final OFFSET :offset, so page 2+ is computed from page 1 candidates only. The new ANN CTE has the same problem. At minimum, size those inner windows from offset + limit and oversample that value for hnsw_candidates.

💡 Suggested change
         effective_offset = offset or 0
+        page_window = effective_offset + effective_limit

         # Prepare query parameters
         query_params = {
             'query_embed': str(query_embeddings[0]),
             'threshold': threshold,
             'kb_id': kb_id,
             'vector_weight': vector_weight,
             'keyword_weight': keyword_weight,
             'query': query,
             'offset': effective_offset,
             'limit': effective_limit,
+            'page_window': page_window,
+            'candidate_limit': page_window * 10,
         }
...
-                LIMIT :limit * 10
+                LIMIT :candidate_limit
...
-                LIMIT :limit
+                LIMIT :page_window
...
-                LIMIT :limit
+                LIMIT :page_window

Also applies to: 98-157

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`
around lines 53 - 70, The current query trims vector_results and keyword_results
to limit before applying OFFSET, causing page N>1 to only consider page 1
candidates; update the logic in generate_query (references: vector_results,
keyword_results, hnsw_candidates and the ANN CTE) to request/compute inner
windows sized at least offset + limit (and preferably oversample, e.g. multiply
by a small factor) so the subsequent OFFSET/:offset and final LIMIT/:limit
operate on a superset of candidates; adjust any places building query_params or
CTE limits to use this expanded window size instead of just effective_limit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`:
- Around line 252-272: The DINO ANN candidate window is fixed to ":limit * 10"
causing valid results to be dropped when paginating; update generate_query.py to
scale the ANN retrieval window by page so hnsw_candidates fetches enough rows
for the requested offset. Concretely, compute an ann_window (e.g., ann_window =
(effective_offset + effective_limit) * 10 or ann_window =
(math.ceil((effective_offset + 1)/effective_limit) * effective_limit * 10))
using the existing effective_offset and effective_limit variables, pass that
ann_window into the query parameters used by hnsw_candidates (instead of plain
limit*10), and ensure the params dict (where 'top_k'/'reference_ids' are set)
supplies this scaled window so deeper pages retain enough candidates before the
outer OFFSET is applied.
- Around line 53-70: The current query trims vector_results and keyword_results
to limit before applying OFFSET, causing page N>1 to only consider page 1
candidates; update the logic in generate_query (references: vector_results,
keyword_results, hnsw_candidates and the ANN CTE) to request/compute inner
windows sized at least offset + limit (and preferably oversample, e.g. multiply
by a small factor) so the subsequent OFFSET/:offset and final LIMIT/:limit
operate on a superset of candidates; adjust any places building query_params or
CTE limits to use this expanded window size instead of just effective_limit.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7e83758b-d51a-49dd-b1de-1e44c90a7ce5

📥 Commits

Reviewing files that changed from the base of the PR and between 82ec696 and da2062b.

📒 Files selected for processing (2)
  • wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py
  • wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py



# revision identifiers, used by Alembic.
revision: str = 'e8f2a1c3b5d9'
# revision identifiers, used by Alembic.
revision: str = 'e8f2a1c3b5d9'
down_revision: Union[str, None] = 'c7a9e2f4b1d0'
branch_labels: Union[str, Sequence[str], None] = None
revision: str = 'e8f2a1c3b5d9'
down_revision: Union[str, None] = 'c7a9e2f4b1d0'
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py (1)

92-123: ⚠️ Potential issue | 🟠 Major

Reapply KB/metadata scoping inside hnsw_candidates.

This reintroduces the earlier multi-KB bug: LIMIT :limit * 20 is still taken from the whole embeddings table, so rows for the requested KB/filter can be dropped before vector_results narrows them.

Possible fix
             WITH hnsw_candidates AS (
                 SELECT
                     id,
                     document_id,
                     chunk_text,
                     chunk_index,
                     (embedding_vector::vector(512)) <=> :query_embed ::vector(512) AS distance
                 FROM
                     {KnowledgeBaseEmbeddings.__tablename__}
+                WHERE
+                    document_id IN (
+                        SELECT d.id
+                        FROM {KnowledgeBaseDocuments.__tablename__} d
+                        WHERE d.knowledge_base_id = :kb_id
+                        {'AND (' + metadata_filter_clause_inner + ')' if metadata_filter_clause_inner else ''}
+                    )
                 ORDER BY
                     (embedding_vector::vector(512)) <=> :query_embed ::vector(512)
                 LIMIT :limit * 20
             ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`
around lines 92 - 123, The hnsw_candidates CTE currently reads from
KnowledgeBaseEmbeddings and applies LIMIT :limit * 20 before scoping by
KB/metadata, causing relevant rows to be dropped; move the KB and metadata
filtering and the join to KnowledgeBaseDocuments into the hnsw_candidates CTE
(apply the same d.knowledge_base_id = :kb_id and the
metadata_filter_clause_inner there and join on
KnowledgeBaseDocuments.__tablename__), compute distance there, ORDER BY
distance, and then LIMIT :limit * 20 so the prefiltered candidate set respects
KB/metadata before vector_results selects the final :limit results.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`:
- Around line 196-224: The HNSW CTE (hnsw_candidates) in generate_query.py
builds a global ANN candidate set and applies the knowledge-base (KB) and
metadata filters only after LIMIT, which can truncate results incorrectly;
update the SQL generation so the WHERE clause (kb_id = :kb_id and the
metadata_filter_clause) is applied inside the hnsw_candidates CTE (i.e., add
"WHERE d.knowledge_base_id = :kb_id" and the metadata filter when selecting from
KnowledgeBaseEmbeddings or by joining KnowledgeBaseDocuments inside the CTE)
before the ORDER BY / LIMIT to ensure the top_k*20 candidates are scoped
correctly; also apply the same change to the analogous get_image_embedding()
query builder to fix the identical truncation bug.
- Around line 270-299: The hnsw_candidates CTE currently applies the d.id =
ANY(:reference_ids) filter only after LIMIT, causing referenced documents to be
excluded before reranking; modify the query in generate_query.py so the
reference_ids (when processed_reference_ids is truthy) and the
kb_id/metadata_filter_clause are applied inside the hnsw_candidates CTE (either
by joining KnowledgeBaseDocuments within the CTE or by filtering
KnowledgeBaseEmbeddings.document_id against :reference_ids and :kb_id) so the
LIMIT :limit * 20 is computed after narrowing candidates; update the CTE
selection logic that uses KnowledgeBaseEmbeddings, hnsw_candidates,
:reference_ids, :kb_id, and metadata_filter_clause accordingly.

---

Duplicate comments:
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`:
- Around line 92-123: The hnsw_candidates CTE currently reads from
KnowledgeBaseEmbeddings and applies LIMIT :limit * 20 before scoping by
KB/metadata, causing relevant rows to be dropped; move the KB and metadata
filtering and the join to KnowledgeBaseDocuments into the hnsw_candidates CTE
(apply the same d.knowledge_base_id = :kb_id and the
metadata_filter_clause_inner there and join on
KnowledgeBaseDocuments.__tablename__), compute distance there, ORDER BY
distance, and then LIMIT :limit * 20 so the prefiltered candidate set respects
KB/metadata before vector_results selects the final :limit results.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c302abfe-8c2e-4e27-b501-bd636c4a3946

📥 Commits

Reviewing files that changed from the base of the PR and between 988f23f and 37acaf2.

📒 Files selected for processing (2)
  • wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py
  • wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py

Comment on lines +196 to 224
WITH hnsw_candidates AS (
SELECT
e.id AS embedding_id,
e.chunk_text,
e.chunk_index,
d.id AS document_id,
d.file_path,
d.file_name,
d.knowledge_base_id,
d.metadata_value,
e.embedding_vector <-> :query_embedding ::vector AS distance
id,
document_id,
chunk_text,
chunk_index,
(embedding_vector::vector(512)) <=> :query_embedding ::vector(512) AS distance
FROM
{KnowledgeBaseEmbeddings.__tablename__} e
JOIN
{KnowledgeBaseDocuments.__tablename__} d ON e.document_id = d.id
WHERE
d.knowledge_base_id = :kb_id {'AND (' + metadata_filter_clause_final + ')' if metadata_filter_clause_final else ''}
ORDER BY distance ASC
{KnowledgeBaseEmbeddings.__tablename__}
ORDER BY
(embedding_vector::vector(512)) <=> :query_embedding ::vector(512)
LIMIT :top_k * 20
)
SELECT
*
FROM
ranked_embeddings
hc.id AS embedding_id,
hc.chunk_text,
hc.chunk_index,
d.id AS document_id,
d.file_path,
d.file_name,
d.knowledge_base_id,
d.metadata_value,
hc.distance
FROM hnsw_candidates hc
JOIN {KnowledgeBaseDocuments.__tablename__} d ON hc.document_id = d.id
WHERE d.knowledge_base_id = :kb_id
{'AND (' + metadata_filter_clause + ')' if metadata_filter_clause else ''}
ORDER BY hc.distance ASC
LIMIT :top_k

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope the CLIP ANN candidate set before LIMIT.

get_image_embedding() has the same truncation problem: the HNSW CTE is global, and the KB/metadata filters only run after LIMIT :top_k * 20. In a multi-KB or selective-filter dataset this can silently return too few results.

Possible fix
         WITH hnsw_candidates AS (
             SELECT
                 id,
                 document_id,
                 chunk_text,
                 chunk_index,
                 (embedding_vector::vector(512)) <=> :query_embedding ::vector(512) AS distance
             FROM
                 {KnowledgeBaseEmbeddings.__tablename__}
+            WHERE
+                document_id IN (
+                    SELECT d.id
+                    FROM {KnowledgeBaseDocuments.__tablename__} d
+                    WHERE d.knowledge_base_id = :kb_id
+                    {'AND (' + metadata_filter_clause + ')' if metadata_filter_clause else ''}
+                )
             ORDER BY
                 (embedding_vector::vector(512)) <=> :query_embedding ::vector(512)
             LIMIT :top_k * 20
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`
around lines 196 - 224, The HNSW CTE (hnsw_candidates) in generate_query.py
builds a global ANN candidate set and applies the knowledge-base (KB) and
metadata filters only after LIMIT, which can truncate results incorrectly;
update the SQL generation so the WHERE clause (kb_id = :kb_id and the
metadata_filter_clause) is applied inside the hnsw_candidates CTE (i.e., add
"WHERE d.knowledge_base_id = :kb_id" and the metadata filter when selecting from
KnowledgeBaseEmbeddings or by joining KnowledgeBaseDocuments inside the CTE)
before the ORDER BY / LIMIT to ensure the top_k*20 candidates are scoped
correctly; also apply the same change to the analogous get_image_embedding()
query builder to fix the identical truncation bug.

Comment on lines +270 to 299
WITH hnsw_candidates AS (
SELECT
e.id AS embedding_id,
e.chunk_text,
e.chunk_index,
d.id AS document_id,
d.file_path,
d.file_name,
d.knowledge_base_id,
d.metadata_value,
(1 - (e.embedding_vector_1 <=> :query_embedding ::vector)) AS similarity
FROM {KnowledgeBaseEmbeddings.__tablename__} e
JOIN {KnowledgeBaseDocuments.__tablename__} d ON e.document_id = d.id
WHERE
d.knowledge_base_id = :kb_id {reference_filter} {'AND (' + metadata_filter_clause_final + ')' if metadata_filter_clause_final else ''}
ORDER BY similarity DESC
id,
document_id,
chunk_text,
chunk_index,
(embedding_vector_1::vector(1024)) <=> :query_embedding ::vector(1024) AS distance
FROM
{KnowledgeBaseEmbeddings.__tablename__}
ORDER BY
(embedding_vector_1::vector(1024)) <=> :query_embedding ::vector(1024)
LIMIT :limit * 20
)
SELECT
*
FROM
ranked_embeddings
hc.id AS embedding_id,
hc.chunk_text,
hc.chunk_index,
d.id AS document_id,
d.file_path,
d.file_name,
d.knowledge_base_id,
d.metadata_value,
1 - hc.distance AS similarity
FROM hnsw_candidates hc
JOIN {KnowledgeBaseDocuments.__tablename__} d ON hc.document_id = d.id
WHERE d.knowledge_base_id = :kb_id
{('AND d.id = ANY(:reference_ids)' if processed_reference_ids else '')}
{'AND (' + metadata_filter_clause + ')' if metadata_filter_clause else ''}
ORDER BY similarity DESC
LIMIT :limit OFFSET :offset

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Filter reference_ids inside the DINO candidate CTE.

image_rag_retrieve.py uses this query to rerank the CLIP-selected reference_id_list, but d.id = ANY(:reference_ids) is applied only after LIMIT :limit * 20. That means the global DINO top-N can exclude every referenced document and return empty/incomplete results even when CLIP found good matches.

Possible fix
         WITH hnsw_candidates AS (
             SELECT
                 id,
                 document_id,
                 chunk_text,
                 chunk_index,
                 (embedding_vector_1::vector(1024)) <=> :query_embedding ::vector(1024) AS distance
             FROM
                 {KnowledgeBaseEmbeddings.__tablename__}
+            WHERE
+                document_id IN (
+                    SELECT d.id
+                    FROM {KnowledgeBaseDocuments.__tablename__} d
+                    WHERE d.knowledge_base_id = :kb_id
+                    {('AND d.id = ANY(:reference_ids)' if processed_reference_ids else '')}
+                    {'AND (' + metadata_filter_clause + ')' if metadata_filter_clause else ''}
+                )
             ORDER BY
                 (embedding_vector_1::vector(1024)) <=> :query_embedding ::vector(1024)
             LIMIT :limit * 20
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`
around lines 270 - 299, The hnsw_candidates CTE currently applies the d.id =
ANY(:reference_ids) filter only after LIMIT, causing referenced documents to be
excluded before reranking; modify the query in generate_query.py so the
reference_ids (when processed_reference_ids is truthy) and the
kb_id/metadata_filter_clause are applied inside the hnsw_candidates CTE (either
by joining KnowledgeBaseDocuments within the CTE or by filtering
KnowledgeBaseEmbeddings.document_id against :reference_ids and :kb_id) so the
LIMIT :limit * 20 is computed after narrowing candidates; update the CTE
selection logic that uses KnowledgeBaseEmbeddings, hnsw_candidates,
:reference_ids, :kb_id, and metadata_filter_clause accordingly.

@vizsatiz vizsatiz merged commit b8d14e9 into develop Apr 21, 2026
10 checks passed
@vizsatiz vizsatiz deleted the embedding-index-issue branch April 21, 2026 05:00
thomastomy5 pushed a commit that referenced this pull request Apr 27, 2026
* fix for migration

* fix for migration

* fix for migration

* fix for migration

* fix for embedding index creation

* fix migration commit issue

* fix migration commit issue
@coderabbitai coderabbitai Bot mentioned this pull request Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant