Skip to content

feat(pipeline): index-time importance scoring (weighted-degree)#879

Open
petercoxphoto wants to merge 4 commits into
DeusData:mainfrom
petercoxphoto:feat/importance-ranking
Open

feat(pipeline): index-time importance scoring (weighted-degree)#879
petercoxphoto wants to merge 4 commits into
DeusData:mainfrom
petercoxphoto:feat/importance-ranking

Conversation

@petercoxphoto

Copy link
Copy Markdown
Contributor

Index-time importance scoring (weighted-degree, Aider-style)

Split out of #776 (now closed) as an atomic feature PR, per @DeusData's request to break that bundle into single-concern PRs. The RSS/backpressure stability fix from that bundle already landed as 610531a; the memory-safety fixes are superseded by upstream's own leak/slab work, so this PR carries only the importance-ranking feature. The repo_map tool is a separate stacked PR that builds on this.

What it does

Adds a 7th predump pass (pass_importance.c, registered after all edge-producing passes) that persists a numeric importance key per Function/Method/Class node on properties_json (append pattern from pass_complexity.c). It does not touch compute_search_score — it's index-time only.

importance = sqrt(num_refs) x priv x generic x distinct x test_penalty
  num_refs      = incoming CALLS + USAGE edges
  priv     x0.1   leading-underscore
  generic  x0.1   name defined in >=5 distinct files
  distinct x10    snake_case|camelCase and len>=8
  test_penalty x0.1 when the graph's own canonical cbm_is_test_path()
                    flags the file, OR the symbol is the TARGET of an
                    incoming TESTS edge (prod helpers in non-test files)

Weighted-degree only; PageRank deferred (measured decision).

Test-penalty note

The original spec used an edge-only TESTS/TESTS_FILE penalty. Real-corpus verification showed that under-fires (TESTS_FILE had ~15 edges; is_test stamped on <2% of nodes), leaving 33/40 top-ranked symbols as test scaffolding. Switched to the graph's canonical cbm_is_test_path() classifier (the one pass_tests uses) OR an incoming TESTS edge — not a naive strstr (it leaves testutil_helpers-style prod helpers unpenalised). After re-index: top-40 = 0 test-file symbols (was 33).

Tests

tests/test_importance.c — 10 cases (registration/dispatch, the formula factors, and the test-penalty inversion). Rebased onto current main; the only merge work was test-harness wiring (RUN_SELECTED_SUITE).

@DeusData

DeusData commented Jul 5, 2026

Copy link
Copy Markdown
Owner

Thanks for splitting this out of #776. Triage: normal-priority feature work for ranking/index quality.

Review will focus on score semantics, index-time overhead, properties_json mutation safety, and whether the tests prove stable ordering without making the ranking formula too rigid. This is the right size for a standalone review.

AC1 tests (tests/test_importance.c) + suite wiring + a no-op pass stub
(pass_importance.c) registered as the 7th predump pass in run_predump_passes
(after all edge-producing passes). Tests fail against the stub (no importance
key persisted) — the red-test boundary. Formula lands in the impl commit.

Red evidence: full ASan test-runner run captured at $ARTIFACT_DIR/red-run.log —
all 6 importance tests FAIL (test_importance.c:91/167/215/279/331/398). The
pre-existing lang_contract ASan abort (vendored ts_runtime lexer UAF) is a
baseline flake unrelated to this change.

Boundary committed by the orchestrator (recovery: the builder agent authored
these files but died on an API overload before committing).

Signed-off-by: Peter Cox <info@petercox.ie>
pass_importance.c: weighted-degree Aider model persisted per Function/Method/
Class node as a numeric "importance" key on properties_json (append pattern
from pass_complexity.c). importance = sqrt(num_refs) x priv x generic x
distinct x test_penalty:
  num_refs = incoming CALLS + USAGE edges
  priv     x0.1 leading-underscore
  generic  x0.1 name defined in >=5 distinct files
  distinct x10  snake_case|camelCase and len>=8
  test_penalty x0.1 edge-based (TARGET of a TESTS edge, or file is the SOURCE
                    of a TESTS_FILE edge) -- replaces the filename strstr; a
                    'test'-substring path with no test edge is NOT penalised.
Weighted-degree only; PageRank deferred (measured decision, builder-notes).
All 10 fixture importance tests pass. Does not touch compute_search_score.

Signed-off-by: Peter Cox <info@petercox.ie>
Real-data verification (full connectors re-index) showed the spec's edge-only
TESTS/TESTS_FILE penalty does NOT achieve AC1's exclusion goal: TESTS_FILE has
only 15 edges (needs a test-file->prod-file mapping), TESTS never targets a
test-resident symbol, and node is_test is stamped on <2% of nodes and none of
the test fixtures/doubles. Edge-only left 33/40 of the top ranked symbols as
test scaffolding (session_dir, FakeConnectorClient, ...).

Fix: penalise when the graph's OWN canonical classifier cbm_is_test_path()
(the one pass_tests uses: test_ prefix, _test.<ext>, /tests/, ...) flags the
file, OR the symbol is the target of an incoming TESTS edge (prod helpers in
non-test files). This is NOT the naive strstr the spec warned against — it
leaves src/testutil_helpers.go unpenalised (test-plan DeusData#6 inversion still green).

Result after re-index: top-40 = 0 test-file symbols (was 33); named exclusions
_run/now/run_hook/client and make_proposals/make_golden/seed_recording all rank
>100 (well outside top-40); top is real domain symbols. All 10 unit tests pass.
Deviation from the spec's stated mechanism — surfaced to Peter.

Signed-off-by: Peter Cox <info@petercox.ie>
Signed-off-by: Peter Cox <info@petercox.ie>
@petercoxphoto petercoxphoto force-pushed the feat/importance-ranking branch from ffcead9 to cce0f68 Compare July 5, 2026 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request parsing/quality Graph extraction bugs, false positives, missing edges priority/normal Standard review queue; useful PR with ordinary maintainer urgency.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants