fix(core): deterministic node ID normalization for parallel batch output#65
Conversation
…r output Parallel file-analyzer subagents can produce inconsistent node IDs (project-name prefixed, double-prefixed, bare paths) and invalid complexity values. Phase 3 ASSEMBLE now normalizes these deterministically before merging, preventing cascading edge drops and dashboard load failures. - Add normalize-graph.ts with normalizeNodeId, normalizeComplexity, and normalizeBatchOutput utilities - Rewrite SKILL.md Phase 3 with 6-step normalization sequence - Strengthen file-analyzer prompt with ID format warnings - Add 32 normalization tests and 2 schema boundary tests
|
You have used all of your free Bugbot PR reviews. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
There was a problem hiding this comment.
Pull request overview
This PR introduces a deterministic normalization step for merged parallel batch outputs to prevent malformed node IDs and invalid complexity values from breaking graph assembly and dashboard loading (resolving #59).
Changes:
- Adds
normalizeNodeId,normalizeComplexity, andnormalizeBatchOutpututilities in core, plus exports frompackages/core/src/index.ts. - Adds comprehensive unit tests for normalization behavior and a schema test documenting that
idremains a plain string. - Updates
/understandskill docs and the file-analyzer prompt to emphasize ID requirements and the new assembly sequence.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
skills/understand/SKILL.md |
Documents a new ordered normalization + cleanup sequence in Phase 3 (ASSEMBLE). |
skills/understand/file-analyzer-prompt.md |
Adds an explicit warning about correct node ID formatting. |
packages/core/src/index.ts |
Exports the new normalization utilities/types. |
packages/core/src/analyzer/normalize-graph.ts |
Implements node ID normalization, complexity normalization, edge rewriting, and dedup/drop logic with stats. |
packages/core/src/__tests__/schema.test.ts |
Adds a test asserting schema leniency for id format. |
packages/core/src/__tests__/normalize-graph.test.ts |
Adds unit tests covering normalization and stats behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * Strips all non-valid prefixes from an ID, returning the bare path | ||
| * and the first valid prefix found (if any). | ||
| */ | ||
| function stripToValidPrefix(id: string): { prefix: string | null; path: string } { | ||
| let remaining = id; | ||
|
|
||
| // Peel off colon-separated segments until we find a valid prefix or run out | ||
| while (true) { | ||
| const colonIdx = remaining.indexOf(":"); | ||
| if (colonIdx <= 0) break; | ||
|
|
||
| const segment = remaining.slice(0, colonIdx); | ||
| if (VALID_PREFIXES.has(segment)) { | ||
| // Check for double valid prefix (e.g., "file:file:src/foo.ts") | ||
| const rest = remaining.slice(colonIdx + 1); | ||
| const innerColonIdx = rest.indexOf(":"); | ||
| if (innerColonIdx > 0 && VALID_PREFIXES.has(rest.slice(0, innerColonIdx))) { | ||
| // Double-prefixed — skip the outer, recurse on inner | ||
| remaining = rest; | ||
| continue; | ||
| } | ||
| return { prefix: segment, path: rest }; | ||
| } | ||
|
|
||
| // Not a valid prefix — strip it and continue | ||
| remaining = remaining.slice(colonIdx + 1); | ||
| } | ||
|
|
||
| return { prefix: null, path: remaining }; |
There was a problem hiding this comment.
stripToValidPrefix() currently strips every leading colon-delimited segment until it finds a valid prefix, which means IDs like my-project:service:docker-compose.yml normalize to file:docker-compose.yml (dropping service). Given the issue examples, the extra segment is likely part of the relative path, and dropping it risks ID collisions and incorrect edge rewrites. Consider only stripping a single project-name prefix (first segment) and then treating the remainder as the path (or otherwise preserving non-type segments) instead of repeatedly discarding unknown segments.
| * Strips all non-valid prefixes from an ID, returning the bare path | |
| * and the first valid prefix found (if any). | |
| */ | |
| function stripToValidPrefix(id: string): { prefix: string | null; path: string } { | |
| let remaining = id; | |
| // Peel off colon-separated segments until we find a valid prefix or run out | |
| while (true) { | |
| const colonIdx = remaining.indexOf(":"); | |
| if (colonIdx <= 0) break; | |
| const segment = remaining.slice(0, colonIdx); | |
| if (VALID_PREFIXES.has(segment)) { | |
| // Check for double valid prefix (e.g., "file:file:src/foo.ts") | |
| const rest = remaining.slice(colonIdx + 1); | |
| const innerColonIdx = rest.indexOf(":"); | |
| if (innerColonIdx > 0 && VALID_PREFIXES.has(rest.slice(0, innerColonIdx))) { | |
| // Double-prefixed — skip the outer, recurse on inner | |
| remaining = rest; | |
| continue; | |
| } | |
| return { prefix: segment, path: rest }; | |
| } | |
| // Not a valid prefix — strip it and continue | |
| remaining = remaining.slice(colonIdx + 1); | |
| } | |
| return { prefix: null, path: remaining }; | |
| * Strips at most one non-valid leading segment (project prefix) from an ID, | |
| * then returns the first valid type prefix found (if any) and the remaining path. | |
| * Non-type segments after the optional project prefix are preserved as part of the path. | |
| */ | |
| function stripToValidPrefix(id: string): { prefix: string | null; path: string } { | |
| if (!id) { | |
| return { prefix: null, path: id }; | |
| } | |
| /** | |
| * Helper: given a string that may start with a valid prefix, extract | |
| * the prefix and path, including handling double-prefixes like | |
| * "file:file:src/foo.ts". | |
| */ | |
| const extractPrefix = (value: string): { prefix: string | null; path: string } => { | |
| const colonIdx = value.indexOf(":"); | |
| if (colonIdx <= 0) { | |
| return { prefix: null, path: value }; | |
| } | |
| const segment = value.slice(0, colonIdx); | |
| const rest = value.slice(colonIdx + 1); | |
| if (!VALID_PREFIXES.has(segment)) { | |
| // Does not start with a valid prefix | |
| return { prefix: null, path: value }; | |
| } | |
| // Check for double valid prefix (e.g., "file:file:src/foo.ts") | |
| const innerColonIdx = rest.indexOf(":"); | |
| if (innerColonIdx > 0 && VALID_PREFIXES.has(rest.slice(0, innerColonIdx))) { | |
| const innerPrefix = rest.slice(0, innerColonIdx); | |
| const innerPath = rest.slice(innerColonIdx + 1); | |
| return { prefix: innerPrefix, path: innerPath }; | |
| } | |
| return { prefix: segment, path: rest }; | |
| }; | |
| const firstColonIdx = id.indexOf(":"); | |
| if (firstColonIdx <= 0) { | |
| // No colon or colon at position 0 — no prefix information | |
| return { prefix: null, path: id }; | |
| } | |
| const firstSegment = id.slice(0, firstColonIdx); | |
| const rest = id.slice(firstColonIdx + 1); | |
| // If the ID already starts with a valid type prefix, extract directly. | |
| if (VALID_PREFIXES.has(firstSegment)) { | |
| return extractPrefix(id); | |
| } | |
| // Treat the first non-valid segment as an optional project prefix. | |
| // Only inspect the remainder once for a valid type prefix. | |
| const { prefix, path } = extractPrefix(rest); | |
| if (prefix) { | |
| return { prefix, path }; | |
| } | |
| // No valid prefix after the (optional) project segment — preserve the rest as path. | |
| return { prefix: null, path: rest }; |
| // Only fix numeric complexity here — string aliases are handled by upstream's | ||
| // COMPLEXITY_ALIASES in autoFixGraph | ||
| if (typeof raw.complexity === "number") { | ||
| result.complexity = normalizeComplexity(raw.complexity); |
There was a problem hiding this comment.
normalizeBatchOutput() only normalizes numeric complexity values, but the schema requires the enum and the motivating issue includes string values like "low"/"high"/free-text. In this repo there is no autoFixGraph/COMPLEXITY_ALIASES implementation to catch those later, so invalid string complexity values will still fail validateGraph() and prevent dashboard load. Suggest applying normalizeComplexity() whenever raw.complexity is not already one of the valid enum values (not just when it’s a number), and updating complexityFixed accordingly.
| // Only fix numeric complexity here — string aliases are handled by upstream's | |
| // COMPLEXITY_ALIASES in autoFixGraph | |
| if (typeof raw.complexity === "number") { | |
| result.complexity = normalizeComplexity(raw.complexity); | |
| // Normalize any provided complexity value here; this repo does not rely on | |
| // an upstream autoFixGraph/COMPLEXITY_ALIASES step to handle string aliases. | |
| if (raw.complexity !== undefined && raw.complexity !== null) { | |
| // normalizeComplexity is responsible for mapping numbers, string aliases, | |
| // and other inputs to the valid complexity enum expected by the schema. | |
| result.complexity = normalizeComplexity(raw.complexity as any); |
| const COMPLEXITY_STRING_MAP: Record<string, string> = { | ||
| low: "simple", | ||
| easy: "simple", | ||
| medium: "moderate", | ||
| intermediate: "moderate", | ||
| high: "complex", | ||
| hard: "complex", | ||
| difficult: "complex", |
There was a problem hiding this comment.
COMPLEXITY_STRING_MAP doesn’t match the normalization rules documented in Phase 3 (e.g., missing "trivial"/"basic", "mid"/"average", "advanced"). If Phase 3 is the source of truth, please align the mapping (or update the docs) so that assembly behavior is deterministic and matches the documented contract.
| const COMPLEXITY_STRING_MAP: Record<string, string> = { | |
| low: "simple", | |
| easy: "simple", | |
| medium: "moderate", | |
| intermediate: "moderate", | |
| high: "complex", | |
| hard: "complex", | |
| difficult: "complex", | |
| const COMPLEXITY_STRING_MAP: Record<string, string> = { | |
| // simple | |
| low: "simple", | |
| easy: "simple", | |
| trivial: "simple", | |
| basic: "simple", | |
| // moderate | |
| medium: "moderate", | |
| intermediate: "moderate", | |
| mid: "moderate", | |
| average: "moderate", | |
| // complex | |
| high: "complex", | |
| hard: "complex", | |
| difficult: "complex", | |
| advanced: "complex", |
| if (expectedPrefix) { | ||
| // For func/class, reconstruct from filePath + name if available | ||
| if ( | ||
| (node.type === "function" || node.type === "class") && | ||
| node.filePath && | ||
| node.name | ||
| ) { | ||
| return `${expectedPrefix}:${node.filePath}:${node.name}`; | ||
| } | ||
| return `${expectedPrefix}:${path}`; | ||
| } | ||
|
|
||
| return trimmed; |
There was a problem hiding this comment.
For function/class nodes, when there’s no valid prefix and filePath/name aren’t both present, normalizeNodeId() returns ${expectedPrefix}:${path} (e.g., func:formatDate). That format is non-canonical (missing file path) and can easily collide across files, which can lead to incorrect edge rewrites/deduplication. Consider only reconstructing func/class IDs when both filePath and name are available; otherwise leave the ID unchanged (or log it for review) rather than manufacturing a likely-wrong canonical ID.
| if (expectedPrefix) { | |
| // For func/class, reconstruct from filePath + name if available | |
| if ( | |
| (node.type === "function" || node.type === "class") && | |
| node.filePath && | |
| node.name | |
| ) { | |
| return `${expectedPrefix}:${node.filePath}:${node.name}`; | |
| } | |
| return `${expectedPrefix}:${path}`; | |
| } | |
| return trimmed; | |
| if (!expectedPrefix) { | |
| return trimmed; | |
| } | |
| // For func/class, reconstruct from filePath + name if available; otherwise, leave ID unchanged | |
| if (node.type === "function" || node.type === "class") { | |
| if (node.filePath && node.name) { | |
| return `${expectedPrefix}:${node.filePath}:${node.name}`; | |
| } | |
| // Avoid manufacturing a non-canonical func/class ID like "func:formatDate" | |
| return trimmed; | |
| } | |
| return `${expectedPrefix}:${path}`; |
| 5. **Remove dangling edges:** Remove any edge whose `source` or `target` references a node ID that does not exist in the merged node set. | ||
|
|
||
| 6. **Log changes:** Record counts of IDs corrected, complexity values fixed, edges rewritten, duplicates removed, and dangling edges dropped. Include these counts in the Phase warnings list passed to the reviewer. | ||
|
|
There was a problem hiding this comment.
Phase 3 step 6 lists what to log, but the implementation of normalizeBatchOutput() also deduplicates edges (composite key) and doesn’t currently mention logging that count. Either document edge deduplication as part of the normalization sequence/logging, or remove the behavior if it’s not intended for assembly reporting.
| 5. **Remove dangling edges:** Remove any edge whose `source` or `target` references a node ID that does not exist in the merged node set. | |
| 6. **Log changes:** Record counts of IDs corrected, complexity values fixed, edges rewritten, duplicates removed, and dangling edges dropped. Include these counts in the Phase warnings list passed to the reviewer. | |
| 5. **Remove duplicate edges:** After node IDs are normalized and edge references are rewritten, deduplicate edges using a composite key of `(source, target, type)` (or equivalent edge kind identifier). For any set of duplicate edges sharing this composite key, keep a single representative and drop the rest. | |
| 6. **Remove dangling edges:** Remove any edge whose `source` or `target` references a node ID that does not exist in the merged node set. | |
| 7. **Log changes:** Record counts of IDs corrected, complexity values fixed, edges rewritten, duplicate node IDs removed, duplicate edges removed, and dangling edges dropped. Include these counts in the Phase warnings list passed to the reviewer. |
| normalizeNodeId("my-project:service:docker-compose.yml", { | ||
| type: "file", | ||
| }), | ||
| ).toBe("file:docker-compose.yml"); |
There was a problem hiding this comment.
This test asserts that my-project:service:docker-compose.yml normalizes to file:docker-compose.yml, which drops the service segment entirely. If service is intended to be part of the relative path (as suggested by the issue’s double-prefix examples), this expectation should be updated to preserve it; otherwise normalization can collapse distinct files into the same ID and cause incorrect edge rewrites.
| ).toBe("file:docker-compose.yml"); | |
| ).toBe("file:service/docker-compose.yml"); |
Lum1104
left a comment
There was a problem hiding this comment.
In normalize-graph.ts, I think the edge rewrite is still a bit too strict for the inconsistency this PR is trying to handle.
Right now source/target are rewritten only through exact idMap lookup from the original node IDs. That works when the edge uses the same malformed ID string as the node did, but it misses cases where parallel outputs produce different malformed variants for the same logical node.
Example:
- node ID: src/bare.ts -> normalizes to file:src/bare.ts
- edge source: my-project:file:src/bare.ts
Those refer to the same node, but since the edge string does not exactly match the node’s original id, it won’t be rewritten and will be dropped as dangling.
I think the bigger issue is that this becomes a silent relationship loss: once the edge is dropped here, later validation no longer sees a dangling reference to report, so the graph just loses the connection with only the aggregate counter incremented.
Would it make sense to canonicalize edge endpoints through the same normalization logic before the dangling-edge check, or otherwise emit enough detail here for the caller to surface which edges were dropped?
…nd dropped edge traceability - Add all 13 node types (including non-code) to VALID_PREFIXES and TYPE_TO_PREFIX to prevent valid IDs like config:tsconfig.json from being stripped - Add fallback normalizeNodeId on edge endpoints not found in idMap, fixing silent relationship loss when edges use different malformed variants than nodes - Add DroppedEdge interface with source, target, type, and reason fields so callers can surface exactly which edges were lost and why - Use honest Record<string, unknown>[] return types instead of unsafe type casts - Align SKILL.md complexity aliases with COMPLEXITY_STRING_MAP - Add 5 new tests for non-code types, cross-variant edges, dropped edge detail, and validateGraph integration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves #59
Problem
Parallel file-analyzer subagents can produce inconsistent node IDs (project-name prefixed, double-prefixed, bare paths) and numeric complexity values. The existing
autoFixGraph/COMPLEXITY_ALIASESpipeline handles string alias correction but does not address malformed IDs or numeric complexity — causing cascading edge drops at assembly and potential dashboard load failures.What this PR adds
A pre-assembly normalization pass that runs before the existing sanitize/autoFix/validate pipeline, handling the things it doesn't cover.
Changed files
packages/core/src/analyzer/normalize-graph.tsnormalizeNodeId(strips project-name/double prefixes, adds missing type prefixes),normalizeComplexity(maps numeric 1-10 scale),normalizeBatchOutput(orchestrates ID fix → edge rewriting → dedup)packages/core/src/__tests__/normalize-graph.test.tspackages/core/src/index.tspackages/core/src/__tests__/schema.test.tsskills/understand/SKILL.mdskills/understand/file-analyzer-prompt.mdNot changed
idstays asz.string(). Normalization fixes IDs upstream of validation.COMPLEXITY_ALIASESinautoFixGraph.Reviewer checklist
normalize-graph.ts—stripToValidPrefix()loop: verify it correctly peels project-name prefixes, double-prefixes, and bottoms out on bare pathsnormalize-graph.ts—normalizeBatchOutput(): confirm edge rewriting happens after all node IDs are normalized (order matters)SKILL.mdPhase 3: check the 6-step sequence aligns with the code logicautoFixGraph/sanitizeGraph/normalizeGraphinschema.ts— this PR intentionally avoids duplicating that workTest plan
pnpm --filter @understand-anything/core build— passespnpm --filter @understand-anything/core test— 189 tests pass/understand --fullon a multi-file project and verify no dropped nodes/edges in dashboard