feat(indexer): SMI-5286 1c facet driver — size-faceted resumable backfill crawl#1489
Merged
Conversation
…fill crawl Sub-wave 1c of the Indexer Scale Hardening initiative (SMI-5174). The out-of-band backfill now crawls the full filename:SKILL.md universe past GitHub code-search's 1000-result-per-query cap. - code-search.facets.ts (NEW): fixed 9-bucket size: ladder (facets_total static) + bisectFacet; size: is the exhaustive primary partitioner (the SMI-5176 probe proved created:/pushed: are tokenized on /search/code). - subdirectory-search.helpers.ts (NEW): runBackfillFacetCrawl — depth-first size-faceted crawl with adaptive bisect-on-saturation; a facet whose total_count exceeds the cap is split and its halves crawled before the next facet. Shared processSearchResults extracted here (500-line gate). - backfill-checkpoint.ts: BackfillCursor extended with facet_index + pending_subranges (Infinity persisted as null) so a dispatch boundary mid-bisection resumes losslessly; cursor<->frontier state machine. - run.ts: builds the facet plan from the resumed checkpoint, threads it into Phase 3b, writes the advanced cursor checkpoint, emits real facet counters. - discovery-orchestrator.ts: Phase 3b extracted to runSubdirectorySearchPhase (orchestrator back under 500); threads backfillFacetPlan + result.backfill_crawl. - parse-env.ts (C-5): BACKFILL_PATH_PREFIX + BACKFILL_MAX_RANGES; cap defaults raised only when BACKFILL_MODE. subdirectory-search.ts: per_page 30->100. - trees-search.ts (C-4): root SKILL.md no longer dropped — emit path:'' (high-trust active-cron path verified unaffected: independent root discovery). - indexer-backfill.yml: DISCOVERY_PHASE=3 (focus each dispatch on the facet crawl + finalize) + max_ranges input. - docs/internal -> a88e016 (carries the 1c-prep SPARC + plan-review re-confirm). 39 new tests (facet math, state machine incl. JSON Infinity round-trip, saturation->bisection, budget+resume round-trip, parse-env levers). Full indexer suite 466->505 green; lint/typecheck/format/audit:standards clean. The live (DRY_RUN=false) crawl remains gated on explicit operator sign-off. Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
…arden bisection Resolves all governance-review findings on 8ddf7d3: - C-1 (Critical): bisectCurrentFacet now RETIRES a saturated top-level facet (facetIndex++) before pushing its halves. Previously a top-level facet that stayed saturated was re-queried after its sub-ranges drained → infinite re-crawl with facets_completed stuck at 0. Two regression tests added (unit retirement + integration persistent-saturation, budget-bounded). - Open-ended bisection ceiling (4 MiB): an always-saturating open-ended facet doubled forever (never reaching lo===hi); bisectFacet now returns null past the ceiling so the tail terminates as truncated. Test added. - M-1: a page error on a range is now counted in truncated_repo_count + logged (was silently skipped) and the crawl advances. Test added. - M-2: BackfillSummary gains current_facet + pending_subrange_count; the workflow terminal condition now keys on current_facet=='done' (authoritative) instead of facets_remaining==0, which reads 0 while the last facet's bisected sub-ranges still drain. - L-2: split the facet state-machine tests into backfill-checkpoint.statemachine.test.ts (both files now <500 lines). Indexer suite 471 green; lint/typecheck/format/audit:standards clean. Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
E2E Test ResultsE2E Test Results - June 18, 2026Summary
Test Results
Generated by skillsmith E2E test suite |
…-driver # Conflicts: # docs/internal
E2E Test ResultsE2E Test Results - June 18, 2026Summary
Test Results
Generated by skillsmith E2E test suite |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SMI-5286 sub-wave 1c — the wide backfill crawl engine
Implements the size-faceted facet driver for the out-of-band indexer backfill (
indexer-backfill.yml), the engine that lets the SMI-5174 "Indexer Scale Hardening" initiative crawl GitHub's ~245k-fileSKILL.mdback-catalog past the 1000-results-per-query code-search cap. Code only — the live (DRY_RUN=false) crawl remains gated on explicit operator sign-off; this PR ships the engine + a DRY_RUN-ready workflow.What it does
code-search.facets.ts(NEW): fixed 9-bucketsize:byte-range ladder (sofacets_totalis static) +bisectFacet.size:is the exhaustive primary partitioner — the SMI-5176 probe provedcreated:/pushed:are tokenized (non-functional) on/search/code.subdirectory-search.helpers.ts(NEW):runBackfillFacetCrawl— a depth-first size-faceted crawl with adaptive bisect-on-saturation: any facet whosetotal_countexceeds the cap is split and its halves crawled before the next facet, so every file is reachable. SharedprocessSearchResultsextracted here (500-line gate).backfill-checkpoint.ts:BackfillCursorextended withfacet_index+pending_subranges(Infinitypersisted asnull) so a dispatch boundary mid-bisection resumes losslessly across the 6h GHA cap; full cursor↔frontier state machine.run.ts/discovery-orchestrator.ts: build the plan from the resumed checkpoint, thread it into Phase 3b (extracted torunSubdirectorySearchPhase, orchestrator back under 500), write the advanced cursor checkpoint, emit real facet counters.parse-env.ts(C-5):BACKFILL_PATH_PREFIX(one-ecosystem DRY_RUN) +BACKFILL_MAX_RANGES; cap defaults raised only underBACKFILL_MODE.per_page30→100.trees-search.ts(C-4): rootSKILL.mdno longer dropped — emitpath:''. High-trust active-cron path verified unaffected (independent root discovery).indexer-backfill.yml:DISCOVERY_PHASE=3(focus each dispatch on the facet crawl + finalize),max_rangesinput, terminal condition keyed on the authoritativecurrent_facet=='done'.Quality
Infinityround-trip, saturation→bisection, persistent-saturation termination, budget+resume round-trip, errored-range handling, parse-env levers). Full indexer suite 471 green.audit:standards(95%, 0 failed) clean; every changed file < 500 lines.Plan:
docs/internal/implementation/smi-5286-wave1-backfill-engine-sparc.md(§#3, Wave 1c).🤖 Generated with Ruflo