fix for indexing issue with knowledge bases by vizsatiz · Pull Request #274 · rootflo/wavefront

vizsatiz · 2026-04-10T12:03:21Z

Summary by CodeRabbit

Performance Improvements
- Added new vector and token indexes and refactored search to a two-stage approximate nearest-neighbor flow, significantly speeding up vector search and retrieval.
Bug Fixes
- Tightened token update to only modify missing tokens, preventing unintended overwrites.
Improvements
- More reliable image retrieval: embedding extraction now validates required embeddings and uses a staged retrieval to improve relevance and reduce empty results.

coderabbitai · 2026-04-10T12:03:39Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an Alembic migration to create HNSW and GIN indexes; refactors vector search to use HNSW candidate CTEs with explicit vector casts and candidate limiting; tightens token updates to NULL-only; and changes image retrieval to require both clip and dino embeddings and run sequential retrieval.

Changes

Cohort / File(s)	Summary
Database Migration `wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py`	New Alembic migration switching to AUTOCOMMIT, sets `maintenance_work_mem`, and runs raw SQL to CREATE INDEX CONCURRENTLY IF NOT EXISTS three indexes: two HNSW indexes on `embedding_vector::vector(512)` and `embedding_vector_1::vector(1024)` (cosine ops, `m=16`, `ef_construction=64`) and a GIN index on `token`; `downgrade()` drops them concurrently.
Query Generation / Vector Search `wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`	Refactors vector retrieval into two-stage CTEs: `hnsw_candidates` (approximate neighbor selection using explicit `::vector(512/1024)` casts and a larger candidate set) then `vector_results` for join/scoring and final LIMIT; moves some metadata filtering into the appropriate stage(s); `get_update_tokens_query` now updates only rows where `token IS NULL`.
Image Retrieval Refactor `wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py`	Removes persistent `self.reranked_image`; extracts `clip_embedding` and `dino_embedding` conditionally from embedding response; proceeds only when both embeddings present: run clip retrieval, return `[]` if no clip results, then run dino retrieval constrained to reference IDs from clip results; otherwise return `[]`.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client as Client
participant Service as ImageRAGService
participant QueryGen as QueryGenerator
participant DB as Postgres(HNSW)
participant DocStore as DocumentStore

Client->>Service: image retrieval request (image)
Service->>Service: extract clip_embedding and dino_embedding
alt both embeddings present
    Service->>QueryGen: build clip hnsw_candidates CTE (::vector(512), limit * 20)
    QueryGen->>DB: run approximate neighbor query (CONCURRENT/approx)
    DB-->>QueryGen: candidate ids + distances
    QueryGen->>DocStore: join candidates -> vector_results (compute scores)
    DocStore-->>Service: clip retrieval results (reference IDs)
    Service->>QueryGen: build dino hnsw_candidates CTE (::vector(1024), constrained by reference IDs)
    QueryGen->>DB: run constrained neighbor query
    DB-->>Service: dino-ranked results
    Service-->>Client: combined retrieval results
else missing embeddings or empty clip results
    Service-->>Client: []
end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through vectors under moonlit code,
I scatter CTE seeds along the road,
Clip finds the crowd, Dino trims the line,
Indexes hum so neighbors align,
I munch on results — carrots and bytes.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding HNSW indexes to improve knowledge base embedding search performance, which addresses the indexing issue referenced.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch embedding-index-issue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

+# revision identifiers, used by Alembic.
+revision: str = 'e8f2a1c3b5d9'
+down_revision: Union[str, None] = 'c7a9e2f4b1d0'
+branch_labels: Union[str, Sequence[str], None] = None


coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py`:
- Around line 21-69: The migration uses CREATE/DROP INDEX CONCURRENTLY inside
upgrade() and downgrade() via op.execute, which fails because CONCURRENTLY
cannot run inside a transaction; wrap each op.execute that creates or drops an
index in both upgrade() and downgrade() with an autocommit block using
op.get_context().autocommit_block() so each CREATE INDEX CONCURRENTLY / DROP
INDEX CONCURRENTLY runs outside the surrounding transaction (i.e., replace
direct op.execute(...) calls for ix_kbe_embedding_vector_hnsw_cosine,
ix_kbe_embedding_vector_hnsw_l2, ix_kbe_embedding_vector_1_hnsw_cosine, and
ix_kbe_token_gin with the same SQL executed inside with
op.get_context().autocommit_block():).

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`:
- Around line 92-123: The ANN candidate CTE hnsw_candidates is built across the
entire KnowledgeBaseEmbeddings table before applying the KB and metadata filters
in vector_results, so its LIMIT (:limit * 10) can drop all matches for the
requested KB; modify the query so the KB and metadata filters are applied inside
hnsw_candidates (e.g., join hnsw_candidates to KnowledgeBaseDocuments or
restrict by document_id IN (SELECT id FROM KnowledgeBaseDocuments WHERE
knowledge_base_id = :kb_id AND <metadata_filter_clause_inner>)) so the pre-LIMIT
candidate set is already scoped to the requested KB and
metadata_filter_clause_inner, then keep the subsequent vector_results CTE and
its final LIMIT :limit unchanged.
- Around line 98-103: The query uses raw columns (embedding_vector /
embedding_vector_1) in the distance expression and ORDER BY, so PostgreSQL can't
use the HNSW expression indexes; update the distance calculation and ORDER BY to
use the exact casted expressions used by the indexes (e.g.
(embedding_vector::vector(512)) and (embedding_vector_1::vector(1024))) wherever
KnowledgeBaseEmbeddings.embedding_vector or embedding_vector_1 are referenced in
generate_query.py so the planner can utilize the HNSW indexes and avoid seq
scans.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py`:
- Around line 47-59: The CLIP->DINO flow should not call image_retrieve_dino
when the CLIP stage produced no candidates; detect when clip_results is empty in
the block inside image_rag_retrieve (where clip_results is built) and
short-circuit (e.g., return an empty list/result or the appropriate no-match
response) instead of calling image_retrieve_dino, because image_retrieve_dino
(and get_image_embedding_dino) will drop the reference_ids filter when given an
empty reference_id_list and perform an unwanted full-KB DINO search.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5efc6c6-4532-43af-9115-3815f5e884fe

📥 Commits

Reviewing files that changed from the base of the PR and between a3a7752 and e5dbd45.

📒 Files selected for processing (3)

wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py
wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py
wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py (2)

252-272: ⚠️ Potential issue | 🟠 Major

Scale the DINO ANN window with the requested page.

hnsw_candidates only pulls :limit * 10 rows, then the outer query applies OFFSET :offset. Once users page deeper, valid candidates are dropped before the final sort. image_rag_retrieve.py:47-63 forwards offset/limit into this query, so this will surface as incomplete later pages.

💡 Suggested change

         effective_limit = limit if limit is not None else int(params.get('top_k', 10))
         reference_id_list: List[Any] = params.get('reference_id_list', [])
         effective_offset = offset if offset is not None else 0
+        candidate_limit = (effective_offset + effective_limit) * 10
...
         params = {
             'query_embedding': query_embeddings,
             'kb_id': kb_id,
-            'top_k': effective_limit,
             'reference_ids': processed_reference_ids,
             'offset': effective_offset,
             'limit': effective_limit,
+            'candidate_limit': candidate_limit,
         }
...
-            LIMIT :limit * 10
+            LIMIT :candidate_limit

Also applies to: 290-322

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`
around lines 252 - 272, The DINO ANN candidate window is fixed to ":limit * 10"
causing valid results to be dropped when paginating; update generate_query.py to
scale the ANN retrieval window by page so hnsw_candidates fetches enough rows
for the requested offset. Concretely, compute an ann_window (e.g., ann_window =
(effective_offset + effective_limit) * 10 or ann_window =
(math.ceil((effective_offset + 1)/effective_limit) * effective_limit * 10))
using the existing effective_offset and effective_limit variables, pass that
ann_window into the query parameters used by hnsw_candidates (instead of plain
limit*10), and ensure the params dict (where 'top_k'/'reference_ids' are set)
supplies this scaled window so deeper pages retain enough candidates before the
outer OFFSET is applied.

53-70: ⚠️ Potential issue | 🟠 Major

Don't page after trimming both rerank sources to one page.

vector_results and keyword_results are capped at :limit before the final OFFSET :offset, so page 2+ is computed from page 1 candidates only. The new ANN CTE has the same problem. At minimum, size those inner windows from offset + limit and oversample that value for hnsw_candidates.

💡 Suggested change

         effective_offset = offset or 0
+        page_window = effective_offset + effective_limit

         # Prepare query parameters
         query_params = {
             'query_embed': str(query_embeddings[0]),
             'threshold': threshold,
             'kb_id': kb_id,
             'vector_weight': vector_weight,
             'keyword_weight': keyword_weight,
             'query': query,
             'offset': effective_offset,
             'limit': effective_limit,
+            'page_window': page_window,
+            'candidate_limit': page_window * 10,
         }
...
-                LIMIT :limit * 10
+                LIMIT :candidate_limit
...
-                LIMIT :limit
+                LIMIT :page_window
...
-                LIMIT :limit
+                LIMIT :page_window

Also applies to: 98-157

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`
around lines 53 - 70, The current query trims vector_results and keyword_results
to limit before applying OFFSET, causing page N>1 to only consider page 1
candidates; update the logic in generate_query (references: vector_results,
keyword_results, hnsw_candidates and the ANN CTE) to request/compute inner
windows sized at least offset + limit (and preferably oversample, e.g. multiply
by a small factor) so the subsequent OFFSET/:offset and final LIMIT/:limit
operate on a superset of candidates; adjust any places building query_params or
CTE limits to use this expanded window size instead of just effective_limit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`:
- Around line 252-272: The DINO ANN candidate window is fixed to ":limit * 10"
causing valid results to be dropped when paginating; update generate_query.py to
scale the ANN retrieval window by page so hnsw_candidates fetches enough rows
for the requested offset. Concretely, compute an ann_window (e.g., ann_window =
(effective_offset + effective_limit) * 10 or ann_window =
(math.ceil((effective_offset + 1)/effective_limit) * effective_limit * 10))
using the existing effective_offset and effective_limit variables, pass that
ann_window into the query parameters used by hnsw_candidates (instead of plain
limit*10), and ensure the params dict (where 'top_k'/'reference_ids' are set)
supplies this scaled window so deeper pages retain enough candidates before the
outer OFFSET is applied.
- Around line 53-70: The current query trims vector_results and keyword_results
to limit before applying OFFSET, causing page N>1 to only consider page 1
candidates; update the logic in generate_query (references: vector_results,
keyword_results, hnsw_candidates and the ANN CTE) to request/compute inner
windows sized at least offset + limit (and preferably oversample, e.g. multiply
by a small factor) so the subsequent OFFSET/:offset and final LIMIT/:limit
operate on a superset of candidates; adjust any places building query_params or
CTE limits to use this expanded window size instead of just effective_limit.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7e83758b-d51a-49dd-b1de-1e44c90a7ce5

📥 Commits

Reviewing files that changed from the base of the PR and between 82ec696 and da2062b.

📒 Files selected for processing (2)

wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py
wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py

🚧 Files skipped from review as they are similar to previous changes (1)

wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py

+
+
+# revision identifiers, used by Alembic.
+revision: str = 'e8f2a1c3b5d9'


+# revision identifiers, used by Alembic.
+revision: str = 'e8f2a1c3b5d9'
+down_revision: Union[str, None] = 'c7a9e2f4b1d0'
+branch_labels: Union[str, Sequence[str], None] = None


+revision: str = 'e8f2a1c3b5d9'
+down_revision: Union[str, None] = 'c7a9e2f4b1d0'
+branch_labels: Union[str, Sequence[str], None] = None
+depends_on: Union[str, Sequence[str], None] = None


coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py (1)

92-123: ⚠️ Potential issue | 🟠 Major

Reapply KB/metadata scoping inside hnsw_candidates.

This reintroduces the earlier multi-KB bug: LIMIT :limit * 20 is still taken from the whole embeddings table, so rows for the requested KB/filter can be dropped before vector_results narrows them.

Possible fix

             WITH hnsw_candidates AS (
                 SELECT
                     id,
                     document_id,
                     chunk_text,
                     chunk_index,
                     (embedding_vector::vector(512)) <=> :query_embed ::vector(512) AS distance
                 FROM
                     {KnowledgeBaseEmbeddings.__tablename__}
+                WHERE
+                    document_id IN (
+                        SELECT d.id
+                        FROM {KnowledgeBaseDocuments.__tablename__} d
+                        WHERE d.knowledge_base_id = :kb_id
+                        {'AND (' + metadata_filter_clause_inner + ')' if metadata_filter_clause_inner else ''}
+                    )
                 ORDER BY
                     (embedding_vector::vector(512)) <=> :query_embed ::vector(512)
                 LIMIT :limit * 20
             ),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`
around lines 92 - 123, The hnsw_candidates CTE currently reads from
KnowledgeBaseEmbeddings and applies LIMIT :limit * 20 before scoping by
KB/metadata, causing relevant rows to be dropped; move the KB and metadata
filtering and the join to KnowledgeBaseDocuments into the hnsw_candidates CTE
(apply the same d.knowledge_base_id = :kb_id and the
metadata_filter_clause_inner there and join on
KnowledgeBaseDocuments.__tablename__), compute distance there, ORDER BY
distance, and then LIMIT :limit * 20 so the prefiltered candidate set respects
KB/metadata before vector_results selects the final :limit results.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`:
- Around line 196-224: The HNSW CTE (hnsw_candidates) in generate_query.py
builds a global ANN candidate set and applies the knowledge-base (KB) and
metadata filters only after LIMIT, which can truncate results incorrectly;
update the SQL generation so the WHERE clause (kb_id = :kb_id and the
metadata_filter_clause) is applied inside the hnsw_candidates CTE (i.e., add
"WHERE d.knowledge_base_id = :kb_id" and the metadata filter when selecting from
KnowledgeBaseEmbeddings or by joining KnowledgeBaseDocuments inside the CTE)
before the ORDER BY / LIMIT to ensure the top_k*20 candidates are scoped
correctly; also apply the same change to the analogous get_image_embedding()
query builder to fix the identical truncation bug.
- Around line 270-299: The hnsw_candidates CTE currently applies the d.id =
ANY(:reference_ids) filter only after LIMIT, causing referenced documents to be
excluded before reranking; modify the query in generate_query.py so the
reference_ids (when processed_reference_ids is truthy) and the
kb_id/metadata_filter_clause are applied inside the hnsw_candidates CTE (either
by joining KnowledgeBaseDocuments within the CTE or by filtering
KnowledgeBaseEmbeddings.document_id against :reference_ids and :kb_id) so the
LIMIT :limit * 20 is computed after narrowing candidates; update the CTE
selection logic that uses KnowledgeBaseEmbeddings, hnsw_candidates,
:reference_ids, :kb_id, and metadata_filter_clause accordingly.

---

Duplicate comments:
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`:
- Around line 92-123: The hnsw_candidates CTE currently reads from
KnowledgeBaseEmbeddings and applies LIMIT :limit * 20 before scoping by
KB/metadata, causing relevant rows to be dropped; move the KB and metadata
filtering and the join to KnowledgeBaseDocuments into the hnsw_candidates CTE
(apply the same d.knowledge_base_id = :kb_id and the
metadata_filter_clause_inner there and join on
KnowledgeBaseDocuments.__tablename__), compute distance there, ORDER BY
distance, and then LIMIT :limit * 20 so the prefiltered candidate set respects
KB/metadata before vector_results selects the final :limit results.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c302abfe-8c2e-4e27-b501-bd636c4a3946

📥 Commits

Reviewing files that changed from the base of the PR and between 988f23f and 37acaf2.

📒 Files selected for processing (2)

wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py
wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py

🚧 Files skipped from review as they are similar to previous changes (1)

wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py

coderabbitai · 2026-04-14T04:17:34Z

+        WITH hnsw_candidates AS (
            SELECT
-                e.id AS embedding_id,
-                e.chunk_text,
-                e.chunk_index,
-                d.id AS document_id,
-                d.file_path,
-                d.file_name,
-                d.knowledge_base_id,
-                d.metadata_value,
-                e.embedding_vector <-> :query_embedding ::vector AS distance
+                id,
+                document_id,
+                chunk_text,
+                chunk_index,
+                (embedding_vector::vector(512)) <=> :query_embedding ::vector(512) AS distance
            FROM
-                {KnowledgeBaseEmbeddings.__tablename__} e
-            JOIN
-                {KnowledgeBaseDocuments.__tablename__} d ON e.document_id = d.id
-            WHERE
-                d.knowledge_base_id = :kb_id {'AND (' + metadata_filter_clause_final + ')' if metadata_filter_clause_final else ''}
-            ORDER BY distance ASC
+                {KnowledgeBaseEmbeddings.__tablename__}
+            ORDER BY
+                (embedding_vector::vector(512)) <=> :query_embedding ::vector(512)
+            LIMIT :top_k * 20
        )
        SELECT
-            *
-        FROM
-            ranked_embeddings
+            hc.id AS embedding_id,
+            hc.chunk_text,
+            hc.chunk_index,
+            d.id AS document_id,
+            d.file_path,
+            d.file_name,
+            d.knowledge_base_id,
+            d.metadata_value,
+            hc.distance
+        FROM hnsw_candidates hc
+        JOIN {KnowledgeBaseDocuments.__tablename__} d ON hc.document_id = d.id
+        WHERE d.knowledge_base_id = :kb_id
+            {'AND (' + metadata_filter_clause + ')' if metadata_filter_clause else ''}
+        ORDER BY hc.distance ASC
        LIMIT :top_k


⚠️ Potential issue | 🟠 Major

Scope the CLIP ANN candidate set before LIMIT.

get_image_embedding() has the same truncation problem: the HNSW CTE is global, and the KB/metadata filters only run after LIMIT :top_k * 20. In a multi-KB or selective-filter dataset this can silently return too few results.

Possible fix

WITH hnsw_candidates AS ( SELECT id, document_id, chunk_text, chunk_index, (embedding_vector::vector(512)) <=> :query_embedding ::vector(512) AS distance FROM {KnowledgeBaseEmbeddings.__tablename__} + WHERE + document_id IN ( + SELECT d.id + FROM {KnowledgeBaseDocuments.__tablename__} d + WHERE d.knowledge_base_id = :kb_id + {'AND (' + metadata_filter_clause + ')' if metadata_filter_clause else ''} + ) ORDER BY (embedding_vector::vector(512)) <=> :query_embedding ::vector(512) LIMIT :top_k * 20 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py` around lines 196 - 224, The HNSW CTE (hnsw_candidates) in generate_query.py builds a global ANN candidate set and applies the knowledge-base (KB) and metadata filters only after LIMIT, which can truncate results incorrectly; update the SQL generation so the WHERE clause (kb_id = :kb_id and the metadata_filter_clause) is applied inside the hnsw_candidates CTE (i.e., add "WHERE d.knowledge_base_id = :kb_id" and the metadata filter when selecting from KnowledgeBaseEmbeddings or by joining KnowledgeBaseDocuments inside the CTE) before the ORDER BY / LIMIT to ensure the top_k*20 candidates are scoped correctly; also apply the same change to the analogous get_image_embedding() query builder to fix the identical truncation bug.

coderabbitai · 2026-04-14T04:17:34Z

+        WITH hnsw_candidates AS (
            SELECT
-                e.id AS embedding_id,
-                e.chunk_text,
-                e.chunk_index,
-                d.id AS document_id,
-                d.file_path,
-                d.file_name,
-                d.knowledge_base_id,
-                d.metadata_value,
-                (1 - (e.embedding_vector_1 <=> :query_embedding ::vector)) AS similarity
-            FROM {KnowledgeBaseEmbeddings.__tablename__} e
-            JOIN {KnowledgeBaseDocuments.__tablename__} d ON e.document_id = d.id
-            WHERE
-                d.knowledge_base_id = :kb_id {reference_filter} {'AND (' + metadata_filter_clause_final + ')' if metadata_filter_clause_final else ''}
-            ORDER BY similarity DESC
+                id,
+                document_id,
+                chunk_text,
+                chunk_index,
+                (embedding_vector_1::vector(1024)) <=> :query_embedding ::vector(1024) AS distance
+            FROM
+                {KnowledgeBaseEmbeddings.__tablename__}
+            ORDER BY
+                (embedding_vector_1::vector(1024)) <=> :query_embedding ::vector(1024)
+            LIMIT :limit * 20
        )
        SELECT
-            *
-        FROM
-            ranked_embeddings
+            hc.id AS embedding_id,
+            hc.chunk_text,
+            hc.chunk_index,
+            d.id AS document_id,
+            d.file_path,
+            d.file_name,
+            d.knowledge_base_id,
+            d.metadata_value,
+            1 - hc.distance AS similarity
+        FROM hnsw_candidates hc
+        JOIN {KnowledgeBaseDocuments.__tablename__} d ON hc.document_id = d.id
+        WHERE d.knowledge_base_id = :kb_id
+            {('AND d.id = ANY(:reference_ids)' if processed_reference_ids else '')}
+            {'AND (' + metadata_filter_clause + ')' if metadata_filter_clause else ''}
+        ORDER BY similarity DESC
        LIMIT :limit OFFSET :offset


⚠️ Potential issue | 🟠 Major

Filter reference_ids inside the DINO candidate CTE.

image_rag_retrieve.py uses this query to rerank the CLIP-selected reference_id_list, but d.id = ANY(:reference_ids) is applied only after LIMIT :limit * 20. That means the global DINO top-N can exclude every referenced document and return empty/incomplete results even when CLIP found good matches.

Possible fix

WITH hnsw_candidates AS ( SELECT id, document_id, chunk_text, chunk_index, (embedding_vector_1::vector(1024)) <=> :query_embedding ::vector(1024) AS distance FROM {KnowledgeBaseEmbeddings.__tablename__} + WHERE + document_id IN ( + SELECT d.id + FROM {KnowledgeBaseDocuments.__tablename__} d + WHERE d.knowledge_base_id = :kb_id + {('AND d.id = ANY(:reference_ids)' if processed_reference_ids else '')} + {'AND (' + metadata_filter_clause + ')' if metadata_filter_clause else ''} + ) ORDER BY (embedding_vector_1::vector(1024)) <=> :query_embedding ::vector(1024) LIMIT :limit * 20 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py` around lines 270 - 299, The hnsw_candidates CTE currently applies the d.id = ANY(:reference_ids) filter only after LIMIT, causing referenced documents to be excluded before reranking; modify the query in generate_query.py so the reference_ids (when processed_reference_ids is truthy) and the kb_id/metadata_filter_clause are applied inside the hnsw_candidates CTE (either by joining KnowledgeBaseDocuments within the CTE or by filtering KnowledgeBaseEmbeddings.document_id against :reference_ids and :kb_id) so the LIMIT :limit * 20 is computed after narrowing candidates; update the CTE selection logic that uses KnowledgeBaseEmbeddings, hnsw_candidates, :reference_ids, :kb_id, and metadata_filter_clause accordingly.

* fix for migration * fix for migration * fix for migration * fix for migration * fix for embedding index creation * fix migration commit issue * fix migration commit issue

fix for migration

e5dbd45

github-code-quality Bot found potential problems Apr 10, 2026

View reviewed changes

fix for migration

82ec696

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

vizsatiz changed the title ~~fix for migration~~ fix for indexing issue with knowledge bases Apr 10, 2026

fix for migration

da2062b

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

fix for migration

988f23f

github-code-quality Bot found potential problems Apr 10, 2026

View reviewed changes

fix for embedding index creation

37acaf2

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

vizsatiz added 2 commits April 15, 2026 14:12

fix migration commit issue

762e812

fix migration commit issue

d5dd2ee

vizsatiz merged commit b8d14e9 into develop Apr 21, 2026
10 checks passed

vizsatiz deleted the embedding-index-issue branch April 21, 2026 05:00

coderabbitai Bot mentioned this pull request Jun 16, 2026

floware init script #295

Merged



		# revision identifiers, used by Alembic.
		revision: str = 'e8f2a1c3b5d9'

Uh oh!

Conversation

vizsatiz commented Apr 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vizsatiz commented Apr 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading