fix: backfill abstract from file content in vectorize_file by yc111233 · Pull Request #1343 · volcengine/OpenViking

yc111233 · 2026-04-09T18:20:22Z

Summary

When index_resource calls vectorize_file without a summary in summary_dict, the abstract field on the Context is set to an empty string
This means L2 (leaf) records in the vector database have empty abstract fields
Downstream, hierarchical_retriever passes these empty abstracts as documents to the rerank API, causing rerank providers (e.g. DashScope qwen3-rerank) to return HTTP 400 because they reject empty document strings

Root cause

index_resource (line 371) calls vectorize_file(summary_dict={"name": file_name}) — no "summary" key. Inside vectorize_file, summary = summary_dict.get("summary", "") resolves to "", which becomes Context(abstract=""). The file content IS read for embedding but never used to populate abstract.

Fix

When vectorize_file reads raw file content for embedding and the abstract is still empty, backfill it with the first 200 characters of the file content:

content = _truncate_text(content)
if not context.abstract and content:
    context.abstract = content[:200]
context.set_vectorize(Vectorize(text=content))

Impact

Every L2 record created via index_resource will now have a non-empty abstract, preventing rerank 400 errors.

Test plan

Run index_resource on a directory with text files
Verify L2 records in vectordb have non-empty abstract
Run a search query that triggers rerank — confirm no 400 errors

🤖 Generated with Claude Code

When index_resource calls vectorize_file without a summary in summary_dict, the abstract field on the Context is set to an empty string. This means leaf (L2) records in the vector database end up with an empty abstract. Downstream, hierarchical_retriever passes these empty abstracts as documents to the rerank API, which causes rerank providers (e.g. DashScope qwen3-rerank) to return HTTP 400 because they reject empty document strings. Fix: when vectorize_file reads raw file content for embedding and the abstract is still empty, backfill it with the first 200 characters of the file content. This ensures every L2 record has a non-empty abstract for reranking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-09T18:21:14Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 1 🔵⚪⚪⚪⚪
🏅 Score: 90
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

github-actions · 2026-04-09T18:21:36Z

PR Code Suggestions ✨

No code suggestions found for the PR.

github-project-automation bot added this to OpenViking project Apr 9, 2026

github-project-automation bot moved this to Backlog in OpenViking project Apr 9, 2026

github-actions bot added the Review effort 1/5 label Apr 9, 2026

yc111233 mentioned this pull request Apr 9, 2026

fix: filter empty documents in OpenAI rerank client #1345

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: backfill abstract from file content in vectorize_file#1343

fix: backfill abstract from file content in vectorize_file#1343
yc111233 wants to merge 1 commit intovolcengine:mainfrom
yc111233:fix/vectorize-file-backfill-abstract

yc111233 commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yc111233 commented Apr 9, 2026

Summary

Root cause

Fix

Impact

Test plan

Uh oh!

github-actions bot commented Apr 9, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Apr 9, 2026

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant