Skip to content

fix: backfill abstract from file content in vectorize_file#1343

Open
yc111233 wants to merge 1 commit intovolcengine:mainfrom
yc111233:fix/vectorize-file-backfill-abstract
Open

fix: backfill abstract from file content in vectorize_file#1343
yc111233 wants to merge 1 commit intovolcengine:mainfrom
yc111233:fix/vectorize-file-backfill-abstract

Conversation

@yc111233
Copy link
Copy Markdown
Contributor

@yc111233 yc111233 commented Apr 9, 2026

Summary

  • When index_resource calls vectorize_file without a summary in summary_dict, the abstract field on the Context is set to an empty string
  • This means L2 (leaf) records in the vector database have empty abstract fields
  • Downstream, hierarchical_retriever passes these empty abstracts as documents to the rerank API, causing rerank providers (e.g. DashScope qwen3-rerank) to return HTTP 400 because they reject empty document strings

Root cause

index_resource (line 371) calls vectorize_file(summary_dict={"name": file_name}) — no "summary" key. Inside vectorize_file, summary = summary_dict.get("summary", "") resolves to "", which becomes Context(abstract=""). The file content IS read for embedding but never used to populate abstract.

Fix

When vectorize_file reads raw file content for embedding and the abstract is still empty, backfill it with the first 200 characters of the file content:

content = _truncate_text(content)
if not context.abstract and content:
    context.abstract = content[:200]
context.set_vectorize(Vectorize(text=content))

Impact

Every L2 record created via index_resource will now have a non-empty abstract, preventing rerank 400 errors.

Test plan

  • Run index_resource on a directory with text files
  • Verify L2 records in vectordb have non-empty abstract
  • Run a search query that triggers rerank — confirm no 400 errors

🤖 Generated with Claude Code

When index_resource calls vectorize_file without a summary in
summary_dict, the abstract field on the Context is set to an empty
string. This means leaf (L2) records in the vector database end up
with an empty abstract. Downstream, hierarchical_retriever passes
these empty abstracts as documents to the rerank API, which causes
rerank providers (e.g. DashScope qwen3-rerank) to return HTTP 400
because they reject empty document strings.

Fix: when vectorize_file reads raw file content for embedding and the
abstract is still empty, backfill it with the first 200 characters of
the file content. This ensures every L2 record has a non-empty
abstract for reranking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 1 🔵⚪⚪⚪⚪
🏅 Score: 90
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

PR Code Suggestions ✨

No code suggestions found for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant