fix(core): deterministic node ID normalization for parallel batch output by fishinakleinbottle · Pull Request #65 · Lum1104/Understand-Anything

fishinakleinbottle · 2026-03-31T13:53:29Z

Resolves #59

Problem

Parallel file-analyzer subagents can produce inconsistent node IDs (project-name prefixed, double-prefixed, bare paths) and numeric complexity values. The existing autoFixGraph/COMPLEXITY_ALIASES pipeline handles string alias correction but does not address malformed IDs or numeric complexity — causing cascading edge drops at assembly and potential dashboard load failures.

What this PR adds

A pre-assembly normalization pass that runs before the existing sanitize/autoFix/validate pipeline, handling the things it doesn't cover.

Changed files

File	What changed
`packages/core/src/analyzer/normalize-graph.ts`	New. `normalizeNodeId` (strips project-name/double prefixes, adds missing type prefixes), `normalizeComplexity` (maps numeric 1-10 scale), `normalizeBatchOutput` (orchestrates ID fix → edge rewriting → dedup)
`packages/core/src/__tests__/normalize-graph.test.ts`	New. 31 tests covering all normalization paths
`packages/core/src/index.ts`	Exports the new module
`packages/core/src/__tests__/schema.test.ts`	1 test confirming schema accepts bare string IDs (documents the design decision that normalization, not schema, enforces prefix format)
`skills/understand/SKILL.md`	Rewrites Phase 3 ASSEMBLE from 3 bullets to a 6-step normalization sequence
`skills/understand/file-analyzer-prompt.md`	Adds a warning box about ID format requirements

Not changed

Zod schema — id stays as z.string(). Normalization fixes IDs upstream of validation.
Dashboard — no changes needed.
String complexity aliases — already handled by COMPLEXITY_ALIASES in autoFixGraph.

Reviewer checklist

normalize-graph.ts — stripToValidPrefix() loop: verify it correctly peels project-name prefixes, double-prefixes, and bottoms out on bare paths
normalize-graph.ts — normalizeBatchOutput(): confirm edge rewriting happens after all node IDs are normalized (order matters)
SKILL.md Phase 3: check the 6-step sequence aligns with the code logic
Verify no overlap with autoFixGraph/sanitizeGraph/normalizeGraph in schema.ts — this PR intentionally avoids duplicating that work

Test plan

pnpm --filter @understand-anything/core build — passes
pnpm --filter @understand-anything/core test — 189 tests pass
Run /understand --full on a multi-file project and verify no dropped nodes/edges in dashboard

…r output Parallel file-analyzer subagents can produce inconsistent node IDs (project-name prefixed, double-prefixed, bare paths) and invalid complexity values. Phase 3 ASSEMBLE now normalizes these deterministically before merging, preventing cascading edge drops and dashboard load failures. - Add normalize-graph.ts with normalizeNodeId, normalizeComplexity, and normalizeBatchOutput utilities - Rewrite SKILL.md Phase 3 with 6-step normalization sequence - Strengthen file-analyzer prompt with ID format warnings - Add 32 normalization tests and 2 schema boundary tests

cursor · 2026-03-31T13:53:35Z

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Copilot

Pull request overview

This PR introduces a deterministic normalization step for merged parallel batch outputs to prevent malformed node IDs and invalid complexity values from breaking graph assembly and dashboard loading (resolving #59).

Changes:

Adds normalizeNodeId, normalizeComplexity, and normalizeBatchOutput utilities in core, plus exports from packages/core/src/index.ts.
Adds comprehensive unit tests for normalization behavior and a schema test documenting that id remains a plain string.
Updates /understand skill docs and the file-analyzer prompt to emphasize ID requirements and the new assembly sequence.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`skills/understand/SKILL.md`	Documents a new ordered normalization + cleanup sequence in Phase 3 (ASSEMBLE).
`skills/understand/file-analyzer-prompt.md`	Adds an explicit warning about correct node ID formatting.
`packages/core/src/index.ts`	Exports the new normalization utilities/types.
`packages/core/src/analyzer/normalize-graph.ts`	Implements node ID normalization, complexity normalization, edge rewriting, and dedup/drop logic with stats.
`packages/core/src/__tests__/schema.test.ts`	Adds a test asserting schema leniency for `id` format.
`packages/core/src/__tests__/normalize-graph.test.ts`	Adds unit tests covering normalization and stats behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-31T14:00:37Z

understand-anything-plugin/packages/core/src/analyzer/normalize-graph.ts

+ * Strips all non-valid prefixes from an ID, returning the bare path
+ * and the first valid prefix found (if any).
+ */
+function stripToValidPrefix(id: string): { prefix: string | null; path: string } {
+  let remaining = id;
+
+  // Peel off colon-separated segments until we find a valid prefix or run out
+  while (true) {
+    const colonIdx = remaining.indexOf(":");
+    if (colonIdx <= 0) break;
+
+    const segment = remaining.slice(0, colonIdx);
+    if (VALID_PREFIXES.has(segment)) {
+      // Check for double valid prefix (e.g., "file:file:src/foo.ts")
+      const rest = remaining.slice(colonIdx + 1);
+      const innerColonIdx = rest.indexOf(":");
+      if (innerColonIdx > 0 && VALID_PREFIXES.has(rest.slice(0, innerColonIdx))) {
+        // Double-prefixed — skip the outer, recurse on inner
+        remaining = rest;
+        continue;
+      }
+      return { prefix: segment, path: rest };
+    }
+
+    // Not a valid prefix — strip it and continue
+    remaining = remaining.slice(colonIdx + 1);
+  }
+
+  return { prefix: null, path: remaining };


stripToValidPrefix() currently strips every leading colon-delimited segment until it finds a valid prefix, which means IDs like my-project:service:docker-compose.yml normalize to file:docker-compose.yml (dropping service). Given the issue examples, the extra segment is likely part of the relative path, and dropping it risks ID collisions and incorrect edge rewrites. Consider only stripping a single project-name prefix (first segment) and then treating the remainder as the path (or otherwise preserving non-type segments) instead of repeatedly discarding unknown segments.

Suggested change

* Strips all non-valid prefixes from an ID, returning the bare path

* and the first valid prefix found (if any).

*/

function stripToValidPrefix(id: string): { prefix: string | null; path: string } {

let remaining = id;

// Peel off colon-separated segments until we find a valid prefix or run out

while (true) {

const colonIdx = remaining.indexOf(":");

if (colonIdx <= 0) break;

const segment = remaining.slice(0, colonIdx);

if (VALID_PREFIXES.has(segment)) {

// Check for double valid prefix (e.g., "file:file:src/foo.ts")

const rest = remaining.slice(colonIdx + 1);

const innerColonIdx = rest.indexOf(":");

if (innerColonIdx > 0 && VALID_PREFIXES.has(rest.slice(0, innerColonIdx))) {

// Double-prefixed — skip the outer, recurse on inner

remaining = rest;

continue;

}

return { prefix: segment, path: rest };

}

// Not a valid prefix — strip it and continue

remaining = remaining.slice(colonIdx + 1);

}

return { prefix: null, path: remaining };

* Strips at most one non-valid leading segment (project prefix) from an ID,

* then returns the first valid type prefix found (if any) and the remaining path.

* Non-type segments after the optional project prefix are preserved as part of the path.

*/

function stripToValidPrefix(id: string): { prefix: string | null; path: string } {

if (!id) {

return { prefix: null, path: id };

}

/**

* Helper: given a string that may start with a valid prefix, extract

* the prefix and path, including handling double-prefixes like

* "file:file:src/foo.ts".

*/

const extractPrefix = (value: string): { prefix: string | null; path: string } => {

const colonIdx = value.indexOf(":");

if (colonIdx <= 0) {

return { prefix: null, path: value };

}

const segment = value.slice(0, colonIdx);

const rest = value.slice(colonIdx + 1);

if (!VALID_PREFIXES.has(segment)) {

// Does not start with a valid prefix

return { prefix: null, path: value };

}

// Check for double valid prefix (e.g., "file:file:src/foo.ts")

const innerColonIdx = rest.indexOf(":");

if (innerColonIdx > 0 && VALID_PREFIXES.has(rest.slice(0, innerColonIdx))) {

const innerPrefix = rest.slice(0, innerColonIdx);

const innerPath = rest.slice(innerColonIdx + 1);

return { prefix: innerPrefix, path: innerPath };

}

return { prefix: segment, path: rest };

};

const firstColonIdx = id.indexOf(":");

if (firstColonIdx <= 0) {

// No colon or colon at position 0 — no prefix information

return { prefix: null, path: id };

}

const firstSegment = id.slice(0, firstColonIdx);

const rest = id.slice(firstColonIdx + 1);

// If the ID already starts with a valid type prefix, extract directly.

if (VALID_PREFIXES.has(firstSegment)) {

return extractPrefix(id);

}

// Treat the first non-valid segment as an optional project prefix.

// Only inspect the remainder once for a valid type prefix.

const { prefix, path } = extractPrefix(rest);

if (prefix) {

return { prefix, path };

}

// No valid prefix after the (optional) project segment — preserve the rest as path.

return { prefix: null, path: rest };

Copilot · 2026-03-31T14:00:38Z

understand-anything-plugin/packages/core/src/analyzer/normalize-graph.ts

+    // Only fix numeric complexity here — string aliases are handled by upstream's
+    // COMPLEXITY_ALIASES in autoFixGraph
+    if (typeof raw.complexity === "number") {
+      result.complexity = normalizeComplexity(raw.complexity);


normalizeBatchOutput() only normalizes numeric complexity values, but the schema requires the enum and the motivating issue includes string values like "low"/"high"/free-text. In this repo there is no autoFixGraph/COMPLEXITY_ALIASES implementation to catch those later, so invalid string complexity values will still fail validateGraph() and prevent dashboard load. Suggest applying normalizeComplexity() whenever raw.complexity is not already one of the valid enum values (not just when it’s a number), and updating complexityFixed accordingly.

Suggested change

// Only fix numeric complexity here — string aliases are handled by upstream's

// COMPLEXITY_ALIASES in autoFixGraph

if (typeof raw.complexity === "number") {

result.complexity = normalizeComplexity(raw.complexity);

// Normalize any provided complexity value here; this repo does not rely on

// an upstream autoFixGraph/COMPLEXITY_ALIASES step to handle string aliases.

if (raw.complexity !== undefined && raw.complexity !== null) {

// normalizeComplexity is responsible for mapping numbers, string aliases,

// and other inputs to the valid complexity enum expected by the schema.

result.complexity = normalizeComplexity(raw.complexity as any);

Copilot · 2026-03-31T14:00:38Z

understand-anything-plugin/packages/core/src/analyzer/normalize-graph.ts

+const COMPLEXITY_STRING_MAP: Record<string, string> = {
+  low: "simple",
+  easy: "simple",
+  medium: "moderate",
+  intermediate: "moderate",
+  high: "complex",
+  hard: "complex",
+  difficult: "complex",


COMPLEXITY_STRING_MAP doesn’t match the normalization rules documented in Phase 3 (e.g., missing "trivial"/"basic", "mid"/"average", "advanced"). If Phase 3 is the source of truth, please align the mapping (or update the docs) so that assembly behavior is deterministic and matches the documented contract.

Suggested change

const COMPLEXITY_STRING_MAP: Record<string, string> = {

low: "simple",

easy: "simple",

medium: "moderate",

intermediate: "moderate",

high: "complex",

hard: "complex",

difficult: "complex",

const COMPLEXITY_STRING_MAP: Record<string, string> = {

// simple

low: "simple",

easy: "simple",

trivial: "simple",

basic: "simple",

// moderate

medium: "moderate",

intermediate: "moderate",

mid: "moderate",

average: "moderate",

// complex

high: "complex",

hard: "complex",

difficult: "complex",

advanced: "complex",

Copilot · 2026-03-31T14:00:39Z

understand-anything-plugin/packages/core/src/analyzer/normalize-graph.ts

+  if (expectedPrefix) {
+    // For func/class, reconstruct from filePath + name if available
+    if (
+      (node.type === "function" || node.type === "class") &&
+      node.filePath &&
+      node.name
+    ) {
+      return `${expectedPrefix}:${node.filePath}:${node.name}`;
+    }
+    return `${expectedPrefix}:${path}`;
+  }
+
+  return trimmed;


For function/class nodes, when there’s no valid prefix and filePath/name aren’t both present, normalizeNodeId() returns ${expectedPrefix}:${path} (e.g., func:formatDate). That format is non-canonical (missing file path) and can easily collide across files, which can lead to incorrect edge rewrites/deduplication. Consider only reconstructing func/class IDs when both filePath and name are available; otherwise leave the ID unchanged (or log it for review) rather than manufacturing a likely-wrong canonical ID.

Suggested change

if (expectedPrefix) {

// For func/class, reconstruct from filePath + name if available

if (

(node.type === "function" || node.type === "class") &&

node.filePath &&

node.name

) {

return `${expectedPrefix}:${node.filePath}:${node.name}`;

}

return `${expectedPrefix}:${path}`;

}

return trimmed;

if (!expectedPrefix) {

return trimmed;

}

// For func/class, reconstruct from filePath + name if available; otherwise, leave ID unchanged

if (node.type === "function" || node.type === "class") {

if (node.filePath && node.name) {

return `${expectedPrefix}:${node.filePath}:${node.name}`;

}

// Avoid manufacturing a non-canonical func/class ID like "func:formatDate"

return trimmed;

}

return `${expectedPrefix}:${path}`;

Copilot · 2026-03-31T14:00:39Z

understand-anything-plugin/skills/understand/SKILL.md

+5. **Remove dangling edges:** Remove any edge whose `source` or `target` references a node ID that does not exist in the merged node set.
+
+6. **Log changes:** Record counts of IDs corrected, complexity values fixed, edges rewritten, duplicates removed, and dangling edges dropped. Include these counts in the Phase warnings list passed to the reviewer.



Phase 3 step 6 lists what to log, but the implementation of normalizeBatchOutput() also deduplicates edges (composite key) and doesn’t currently mention logging that count. Either document edge deduplication as part of the normalization sequence/logging, or remove the behavior if it’s not intended for assembly reporting.

Suggested change

5. **Remove dangling edges:** Remove any edge whose `source` or `target` references a node ID that does not exist in the merged node set.

6. **Log changes:** Record counts of IDs corrected, complexity values fixed, edges rewritten, duplicates removed, and dangling edges dropped. Include these counts in the Phase warnings list passed to the reviewer.

5. **Remove duplicate edges:** After node IDs are normalized and edge references are rewritten, deduplicate edges using a composite key of `(source, target, type)` (or equivalent edge kind identifier). For any set of duplicate edges sharing this composite key, keep a single representative and drop the rest.

6. **Remove dangling edges:** Remove any edge whose `source` or `target` references a node ID that does not exist in the merged node set.

7. **Log changes:** Record counts of IDs corrected, complexity values fixed, edges rewritten, duplicate node IDs removed, duplicate edges removed, and dangling edges dropped. Include these counts in the Phase warnings list passed to the reviewer.

Copilot · 2026-03-31T14:00:39Z

understand-anything-plugin/packages/core/src/__tests__/normalize-graph.test.ts

+      normalizeNodeId("my-project:service:docker-compose.yml", {
+        type: "file",
+      }),
+    ).toBe("file:docker-compose.yml");


This test asserts that my-project:service:docker-compose.yml normalizes to file:docker-compose.yml, which drops the service segment entirely. If service is intended to be part of the relative path (as suggested by the issue’s double-prefix examples), this expectation should be updated to preserve it; otherwise normalization can collapse distinct files into the same ID and cause incorrect edge rewrites.

Suggested change

).toBe("file:docker-compose.yml");

).toBe("file:service/docker-compose.yml");

Lum1104

In normalize-graph.ts, I think the edge rewrite is still a bit too strict for the inconsistency this PR is trying to handle.

Right now source/target are rewritten only through exact idMap lookup from the original node IDs. That works when the edge uses the same malformed ID string as the node did, but it misses cases where parallel outputs produce different malformed variants for the same logical node.

Example:

node ID: src/bare.ts -> normalizes to file:src/bare.ts
edge source: my-project:file:src/bare.ts

Those refer to the same node, but since the edge string does not exactly match the node’s original id, it won’t be rewritten and will be dropped as dangling.

I think the bigger issue is that this becomes a silent relationship loss: once the edge is dropped here, later validation no longer sees a dangling reference to report, so the graph just loses the connection with only the aggregate counter incremented.

Would it make sense to canonicalize edge endpoints through the same normalization logic before the dangling-edge check, or otherwise emit enough detail here for the caller to surface which edges were dropped?

…nd dropped edge traceability - Add all 13 node types (including non-code) to VALID_PREFIXES and TYPE_TO_PREFIX to prevent valid IDs like config:tsconfig.json from being stripped - Add fallback normalizeNodeId on edge endpoints not found in idMap, fixing silent relationship loss when edges use different malformed variants than nodes - Add DroppedEdge interface with source, target, type, and reason fields so callers can surface exactly which edges were lost and why - Use honest Record<string, unknown>[] return types instead of unsafe type casts - Align SKILL.md complexity aliases with COMPLEXITY_STRING_MAP - Add 5 new tests for non-code types, cross-variant edges, dropped edge detail, and validateGraph integration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fishinakleinbottle requested review from Lum1104 and Copilot March 31, 2026 13:54

Copilot started reviewing on behalf of fishinakleinbottle March 31, 2026 13:54 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

Lum1104 reviewed Apr 1, 2026

View reviewed changes

Lum1104 merged commit efd4c48 into Lum1104:main Apr 1, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): deterministic node ID normalization for parallel batch output#65

fix(core): deterministic node ID normalization for parallel batch output#65
Lum1104 merged 2 commits intoLum1104:mainfrom
fishinakleinbottle:fix/normalization-for-generated-graph

fishinakleinbottle commented Mar 31, 2026

Uh oh!

cursor bot commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Lum1104 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    // Only fix numeric complexity here — string aliases are handled by upstream's
-    // COMPLEXITY_ALIASES in autoFixGraph
-    if (typeof raw.complexity === "number") {
-      result.complexity = normalizeComplexity(raw.complexity);
+    // Normalize any provided complexity value here; this repo does not rely on
+    // an upstream autoFixGraph/COMPLEXITY_ALIASES step to handle string aliases.
+    if (raw.complexity !== undefined && raw.complexity !== null) {
+      // normalizeComplexity is responsible for mapping numbers, string aliases,
+      // and other inputs to the valid complexity enum expected by the schema.
+      result.complexity = normalizeComplexity(raw.complexity as any);

		5. Remove dangling edges: Remove any edge whose `source` or `target` references a node ID that does not exist in the merged node set.

		6. Log changes: Record counts of IDs corrected, complexity values fixed, edges rewritten, duplicates removed, and dangling edges dropped. Include these counts in the Phase warnings list passed to the reviewer.

	).toBe("file:docker-compose.yml");
	).toBe("file:service/docker-compose.yml");

Conversation

fishinakleinbottle commented Mar 31, 2026

Problem

What this PR adds

Changed files

Not changed

Reviewer checklist

Test plan

Uh oh!

cursor bot commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Lum1104 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants