Skip to content

Add incremental update mode to Box crawler#326

Merged
adeelehsan merged 3 commits intomainfrom
box-incremental-update
Mar 10, 2026
Merged

Add incremental update mode to Box crawler#326
adeelehsan merged 3 commits intomainfrom
box-incremental-update

Conversation

@adeelehsan
Copy link
Copy Markdown
Contributor

Summary

  • Adds incremental_update and hours_back config options to the Box crawler
  • When enabled, only processes files modified within the hours_back window (single Box folder traversal)
  • Detects files removed from Box and deletes them from Vectara, updating indexed.csv
  • Uses the full indexing pipeline (docling, tables, OCR, permissions) — no shortcuts

Config

box_crawler:
  incremental_update: true
  hours_back: 8h  # supports: 6h, 24h, 2d

vectara:
  reindex: true  # needed for updated files

How it works

  1. Scans all Box folders once → gets file IDs + modified_at
  2. Filters to files modified within hours_back
  3. Downloads and indexes only those files (new + updated)
  4. Compares full Box file list vs indexed.csv → deletes orphaned docs from Vectara
  5. If incremental_update is not set, crawler runs exactly as before (zero behavior change)

Test plan

  • Run with incremental_update: true, hours_back: 24h on existing corpus with indexed.csv
  • Verify only recently modified files are processed
  • Verify deleted files are removed from Vectara
  • Verify normal crawl mode still works when incremental_update is not set

🤖 Generated with Claude Code

adeelehsan and others added 3 commits March 10, 2026 00:21
When incremental_update: true and hours_back are set, the crawler:
- Scans Box folders once to get all files with modified_at timestamps
- Filters to only files modified within the hours_back window
- Downloads and indexes only those files through the full pipeline
- Compares Box files vs indexed.csv to detect removed files
- Deletes removed files from Vectara and updates indexed.csv

Config:
  box_crawler:
    incremental_update: true
    hours_back: 8h  # supports: 6h, 24h, 2d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Switch base image from python:3.11-slim to python:3.12-slim
  (eliminates CRITICAL CVE-2026-3731 in libssh-4, reduces vendored CVEs)
- Upgrade Authlib 1.6.5 → 1.6.7 (CVE-2026-28802)
- Upgrade ray 2.53.0 → 2.54.0
- Recompile requirements.txt with --universal for Python 3.12

Trivy results: 0 CRITICAL, 15 HIGH (all NO FIX available upstream)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@adeelehsan adeelehsan merged commit ee0ec39 into main Mar 10, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant