Skip to content

feat(indexer): SMI-5286 1c facet driver — size-faceted resumable backfill crawl#1489

Merged
wrsmith108 merged 3 commits into
mainfrom
fix/smi-5286-1c-facet-driver
Jun 18, 2026
Merged

feat(indexer): SMI-5286 1c facet driver — size-faceted resumable backfill crawl#1489
wrsmith108 merged 3 commits into
mainfrom
fix/smi-5286-1c-facet-driver

Conversation

@wrsmith108

Copy link
Copy Markdown
Member

SMI-5286 sub-wave 1c — the wide backfill crawl engine

Implements the size-faceted facet driver for the out-of-band indexer backfill (indexer-backfill.yml), the engine that lets the SMI-5174 "Indexer Scale Hardening" initiative crawl GitHub's ~245k-file SKILL.md back-catalog past the 1000-results-per-query code-search cap. Code only — the live (DRY_RUN=false) crawl remains gated on explicit operator sign-off; this PR ships the engine + a DRY_RUN-ready workflow.

What it does

  • code-search.facets.ts (NEW): fixed 9-bucket size: byte-range ladder (so facets_total is static) + bisectFacet. size: is the exhaustive primary partitioner — the SMI-5176 probe proved created:/pushed: are tokenized (non-functional) on /search/code.
  • subdirectory-search.helpers.ts (NEW): runBackfillFacetCrawl — a depth-first size-faceted crawl with adaptive bisect-on-saturation: any facet whose total_count exceeds the cap is split and its halves crawled before the next facet, so every file is reachable. Shared processSearchResults extracted here (500-line gate).
  • backfill-checkpoint.ts: BackfillCursor extended with facet_index + pending_subranges (Infinity persisted as null) so a dispatch boundary mid-bisection resumes losslessly across the 6h GHA cap; full cursor↔frontier state machine.
  • run.ts / discovery-orchestrator.ts: build the plan from the resumed checkpoint, thread it into Phase 3b (extracted to runSubdirectorySearchPhase, orchestrator back under 500), write the advanced cursor checkpoint, emit real facet counters.
  • parse-env.ts (C-5): BACKFILL_PATH_PREFIX (one-ecosystem DRY_RUN) + BACKFILL_MAX_RANGES; cap defaults raised only under BACKFILL_MODE. per_page 30→100.
  • trees-search.ts (C-4): root SKILL.md no longer dropped — emit path:''. High-trust active-cron path verified unaffected (independent root discovery).
  • indexer-backfill.yml: DISCOVERY_PHASE=3 (focus each dispatch on the facet crawl + finalize), max_ranges input, terminal condition keyed on the authoritative current_facet=='done'.
  • docs/internal → a88e016 carries the 1c-prep SPARC + plan-review re-confirm (GO).

Quality

  • 2 SPARC plan-review passes (GO) + 2 governance reviews (the first found a Critical facet-retirement bug C-1 — fixed with 2 regression tests; the re-review = GO).
  • 44 new tests (facet math, state machine incl. JSON Infinity round-trip, saturation→bisection, persistent-saturation termination, budget+resume round-trip, errored-range handling, parse-env levers). Full indexer suite 471 green.
  • lint / typecheck / format / audit:standards (95%, 0 failed) clean; every changed file < 500 lines.

Plan: docs/internal/implementation/smi-5286-wave1-backfill-engine-sparc.md#3, Wave 1c).

🤖 Generated with Ruflo

wrsmith108 and others added 2 commits June 18, 2026 12:47
…fill crawl

Sub-wave 1c of the Indexer Scale Hardening initiative (SMI-5174). The
out-of-band backfill now crawls the full filename:SKILL.md universe past
GitHub code-search's 1000-result-per-query cap.

- code-search.facets.ts (NEW): fixed 9-bucket size: ladder (facets_total
  static) + bisectFacet; size: is the exhaustive primary partitioner (the
  SMI-5176 probe proved created:/pushed: are tokenized on /search/code).
- subdirectory-search.helpers.ts (NEW): runBackfillFacetCrawl — depth-first
  size-faceted crawl with adaptive bisect-on-saturation; a facet whose
  total_count exceeds the cap is split and its halves crawled before the next
  facet. Shared processSearchResults extracted here (500-line gate).
- backfill-checkpoint.ts: BackfillCursor extended with facet_index +
  pending_subranges (Infinity persisted as null) so a dispatch boundary
  mid-bisection resumes losslessly; cursor<->frontier state machine.
- run.ts: builds the facet plan from the resumed checkpoint, threads it into
  Phase 3b, writes the advanced cursor checkpoint, emits real facet counters.
- discovery-orchestrator.ts: Phase 3b extracted to runSubdirectorySearchPhase
  (orchestrator back under 500); threads backfillFacetPlan + result.backfill_crawl.
- parse-env.ts (C-5): BACKFILL_PATH_PREFIX + BACKFILL_MAX_RANGES; cap defaults
  raised only when BACKFILL_MODE. subdirectory-search.ts: per_page 30->100.
- trees-search.ts (C-4): root SKILL.md no longer dropped — emit path:''
  (high-trust active-cron path verified unaffected: independent root discovery).
- indexer-backfill.yml: DISCOVERY_PHASE=3 (focus each dispatch on the facet
  crawl + finalize) + max_ranges input.
- docs/internal -> a88e016 (carries the 1c-prep SPARC + plan-review re-confirm).

39 new tests (facet math, state machine incl. JSON Infinity round-trip,
saturation->bisection, budget+resume round-trip, parse-env levers). Full
indexer suite 466->505 green; lint/typecheck/format/audit:standards clean.
The live (DRY_RUN=false) crawl remains gated on explicit operator sign-off.

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
…arden bisection

Resolves all governance-review findings on 8ddf7d3:

- C-1 (Critical): bisectCurrentFacet now RETIRES a saturated top-level facet
  (facetIndex++) before pushing its halves. Previously a top-level facet that
  stayed saturated was re-queried after its sub-ranges drained → infinite
  re-crawl with facets_completed stuck at 0. Two regression tests added
  (unit retirement + integration persistent-saturation, budget-bounded).
- Open-ended bisection ceiling (4 MiB): an always-saturating open-ended facet
  doubled forever (never reaching lo===hi); bisectFacet now returns null past
  the ceiling so the tail terminates as truncated. Test added.
- M-1: a page error on a range is now counted in truncated_repo_count + logged
  (was silently skipped) and the crawl advances. Test added.
- M-2: BackfillSummary gains current_facet + pending_subrange_count; the
  workflow terminal condition now keys on current_facet=='done' (authoritative)
  instead of facets_remaining==0, which reads 0 while the last facet's
  bisected sub-ranges still drain.
- L-2: split the facet state-machine tests into
  backfill-checkpoint.statemachine.test.ts (both files now <500 lines).

Indexer suite 471 green; lint/typecheck/format/audit:standards clean.

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

E2E Test Results

E2E Test Results - June 18, 2026

Summary

  • Status: ✅ PASSED
  • Total Duration: 0.00s
  • Generated: 2026-06-18T20:40:07.466Z

Test Results

Phase Status Duration
CLI E2E ⏭️ Skipped -
MCP E2E ⏭️ Skipped -

Generated by skillsmith E2E test suite

@github-actions

Copy link
Copy Markdown

E2E Test Results

E2E Test Results - June 18, 2026

Summary

  • Status: ✅ PASSED
  • Total Duration: 0.00s
  • Generated: 2026-06-18T21:07:24.123Z

Test Results

Phase Status Duration
CLI E2E ⏭️ Skipped -
MCP E2E ⏭️ Skipped -

Generated by skillsmith E2E test suite

@wrsmith108 wrsmith108 merged commit 1b825c7 into main Jun 18, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant