Skip to content

fix(core): deterministic node ID normalization for parallel batch output#65

Merged
Lum1104 merged 2 commits intoLum1104:mainfrom
fishinakleinbottle:fix/normalization-for-generated-graph
Apr 1, 2026
Merged

fix(core): deterministic node ID normalization for parallel batch output#65
Lum1104 merged 2 commits intoLum1104:mainfrom
fishinakleinbottle:fix/normalization-for-generated-graph

Conversation

@fishinakleinbottle
Copy link
Copy Markdown
Collaborator

Resolves #59

Problem

Parallel file-analyzer subagents can produce inconsistent node IDs (project-name prefixed, double-prefixed, bare paths) and numeric complexity values. The existing autoFixGraph/COMPLEXITY_ALIASES pipeline handles string alias correction but does not address malformed IDs or numeric complexity — causing cascading edge drops at assembly and potential dashboard load failures.

What this PR adds

A pre-assembly normalization pass that runs before the existing sanitize/autoFix/validate pipeline, handling the things it doesn't cover.

Changed files

File What changed
packages/core/src/analyzer/normalize-graph.ts New. normalizeNodeId (strips project-name/double prefixes, adds missing type prefixes), normalizeComplexity (maps numeric 1-10 scale), normalizeBatchOutput (orchestrates ID fix → edge rewriting → dedup)
packages/core/src/__tests__/normalize-graph.test.ts New. 31 tests covering all normalization paths
packages/core/src/index.ts Exports the new module
packages/core/src/__tests__/schema.test.ts 1 test confirming schema accepts bare string IDs (documents the design decision that normalization, not schema, enforces prefix format)
skills/understand/SKILL.md Rewrites Phase 3 ASSEMBLE from 3 bullets to a 6-step normalization sequence
skills/understand/file-analyzer-prompt.md Adds a warning box about ID format requirements

Not changed

  • Zod schemaid stays as z.string(). Normalization fixes IDs upstream of validation.
  • Dashboard — no changes needed.
  • String complexity aliases — already handled by COMPLEXITY_ALIASES in autoFixGraph.

Reviewer checklist

  • normalize-graph.tsstripToValidPrefix() loop: verify it correctly peels project-name prefixes, double-prefixes, and bottoms out on bare paths
  • normalize-graph.tsnormalizeBatchOutput(): confirm edge rewriting happens after all node IDs are normalized (order matters)
  • SKILL.md Phase 3: check the 6-step sequence aligns with the code logic
  • Verify no overlap with autoFixGraph/sanitizeGraph/normalizeGraph in schema.ts — this PR intentionally avoids duplicating that work

Test plan

  • pnpm --filter @understand-anything/core build — passes
  • pnpm --filter @understand-anything/core test — 189 tests pass
  • Run /understand --full on a multi-file project and verify no dropped nodes/edges in dashboard

…r output

Parallel file-analyzer subagents can produce inconsistent node IDs
(project-name prefixed, double-prefixed, bare paths) and invalid
complexity values. Phase 3 ASSEMBLE now normalizes these deterministically
before merging, preventing cascading edge drops and dashboard load failures.

- Add normalize-graph.ts with normalizeNodeId, normalizeComplexity,
  and normalizeBatchOutput utilities
- Rewrite SKILL.md Phase 3 with 6-step normalization sequence
- Strengthen file-analyzer prompt with ID format warnings
- Add 32 normalization tests and 2 schema boundary tests
@cursor
Copy link
Copy Markdown

cursor bot commented Mar 31, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a deterministic normalization step for merged parallel batch outputs to prevent malformed node IDs and invalid complexity values from breaking graph assembly and dashboard loading (resolving #59).

Changes:

  • Adds normalizeNodeId, normalizeComplexity, and normalizeBatchOutput utilities in core, plus exports from packages/core/src/index.ts.
  • Adds comprehensive unit tests for normalization behavior and a schema test documenting that id remains a plain string.
  • Updates /understand skill docs and the file-analyzer prompt to emphasize ID requirements and the new assembly sequence.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
skills/understand/SKILL.md Documents a new ordered normalization + cleanup sequence in Phase 3 (ASSEMBLE).
skills/understand/file-analyzer-prompt.md Adds an explicit warning about correct node ID formatting.
packages/core/src/index.ts Exports the new normalization utilities/types.
packages/core/src/analyzer/normalize-graph.ts Implements node ID normalization, complexity normalization, edge rewriting, and dedup/drop logic with stats.
packages/core/src/__tests__/schema.test.ts Adds a test asserting schema leniency for id format.
packages/core/src/__tests__/normalize-graph.test.ts Adds unit tests covering normalization and stats behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +14 to +42
* Strips all non-valid prefixes from an ID, returning the bare path
* and the first valid prefix found (if any).
*/
function stripToValidPrefix(id: string): { prefix: string | null; path: string } {
let remaining = id;

// Peel off colon-separated segments until we find a valid prefix or run out
while (true) {
const colonIdx = remaining.indexOf(":");
if (colonIdx <= 0) break;

const segment = remaining.slice(0, colonIdx);
if (VALID_PREFIXES.has(segment)) {
// Check for double valid prefix (e.g., "file:file:src/foo.ts")
const rest = remaining.slice(colonIdx + 1);
const innerColonIdx = rest.indexOf(":");
if (innerColonIdx > 0 && VALID_PREFIXES.has(rest.slice(0, innerColonIdx))) {
// Double-prefixed — skip the outer, recurse on inner
remaining = rest;
continue;
}
return { prefix: segment, path: rest };
}

// Not a valid prefix — strip it and continue
remaining = remaining.slice(colonIdx + 1);
}

return { prefix: null, path: remaining };
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stripToValidPrefix() currently strips every leading colon-delimited segment until it finds a valid prefix, which means IDs like my-project:service:docker-compose.yml normalize to file:docker-compose.yml (dropping service). Given the issue examples, the extra segment is likely part of the relative path, and dropping it risks ID collisions and incorrect edge rewrites. Consider only stripping a single project-name prefix (first segment) and then treating the remainder as the path (or otherwise preserving non-type segments) instead of repeatedly discarding unknown segments.

Suggested change
* Strips all non-valid prefixes from an ID, returning the bare path
* and the first valid prefix found (if any).
*/
function stripToValidPrefix(id: string): { prefix: string | null; path: string } {
let remaining = id;
// Peel off colon-separated segments until we find a valid prefix or run out
while (true) {
const colonIdx = remaining.indexOf(":");
if (colonIdx <= 0) break;
const segment = remaining.slice(0, colonIdx);
if (VALID_PREFIXES.has(segment)) {
// Check for double valid prefix (e.g., "file:file:src/foo.ts")
const rest = remaining.slice(colonIdx + 1);
const innerColonIdx = rest.indexOf(":");
if (innerColonIdx > 0 && VALID_PREFIXES.has(rest.slice(0, innerColonIdx))) {
// Double-prefixed — skip the outer, recurse on inner
remaining = rest;
continue;
}
return { prefix: segment, path: rest };
}
// Not a valid prefix — strip it and continue
remaining = remaining.slice(colonIdx + 1);
}
return { prefix: null, path: remaining };
* Strips at most one non-valid leading segment (project prefix) from an ID,
* then returns the first valid type prefix found (if any) and the remaining path.
* Non-type segments after the optional project prefix are preserved as part of the path.
*/
function stripToValidPrefix(id: string): { prefix: string | null; path: string } {
if (!id) {
return { prefix: null, path: id };
}
/**
* Helper: given a string that may start with a valid prefix, extract
* the prefix and path, including handling double-prefixes like
* "file:file:src/foo.ts".
*/
const extractPrefix = (value: string): { prefix: string | null; path: string } => {
const colonIdx = value.indexOf(":");
if (colonIdx <= 0) {
return { prefix: null, path: value };
}
const segment = value.slice(0, colonIdx);
const rest = value.slice(colonIdx + 1);
if (!VALID_PREFIXES.has(segment)) {
// Does not start with a valid prefix
return { prefix: null, path: value };
}
// Check for double valid prefix (e.g., "file:file:src/foo.ts")
const innerColonIdx = rest.indexOf(":");
if (innerColonIdx > 0 && VALID_PREFIXES.has(rest.slice(0, innerColonIdx))) {
const innerPrefix = rest.slice(0, innerColonIdx);
const innerPath = rest.slice(innerColonIdx + 1);
return { prefix: innerPrefix, path: innerPath };
}
return { prefix: segment, path: rest };
};
const firstColonIdx = id.indexOf(":");
if (firstColonIdx <= 0) {
// No colon or colon at position 0 — no prefix information
return { prefix: null, path: id };
}
const firstSegment = id.slice(0, firstColonIdx);
const rest = id.slice(firstColonIdx + 1);
// If the ID already starts with a valid type prefix, extract directly.
if (VALID_PREFIXES.has(firstSegment)) {
return extractPrefix(id);
}
// Treat the first non-valid segment as an optional project prefix.
// Only inspect the remainder once for a valid type prefix.
const { prefix, path } = extractPrefix(rest);
if (prefix) {
return { prefix, path };
}
// No valid prefix after the (optional) project segment — preserve the rest as path.
return { prefix: null, path: rest };

Copilot uses AI. Check for mistakes.
Comment on lines +170 to +173
// Only fix numeric complexity here — string aliases are handled by upstream's
// COMPLEXITY_ALIASES in autoFixGraph
if (typeof raw.complexity === "number") {
result.complexity = normalizeComplexity(raw.complexity);
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalizeBatchOutput() only normalizes numeric complexity values, but the schema requires the enum and the motivating issue includes string values like "low"/"high"/free-text. In this repo there is no autoFixGraph/COMPLEXITY_ALIASES implementation to catch those later, so invalid string complexity values will still fail validateGraph() and prevent dashboard load. Suggest applying normalizeComplexity() whenever raw.complexity is not already one of the valid enum values (not just when it’s a number), and updating complexityFixed accordingly.

Suggested change
// Only fix numeric complexity here — string aliases are handled by upstream's
// COMPLEXITY_ALIASES in autoFixGraph
if (typeof raw.complexity === "number") {
result.complexity = normalizeComplexity(raw.complexity);
// Normalize any provided complexity value here; this repo does not rely on
// an upstream autoFixGraph/COMPLEXITY_ALIASES step to handle string aliases.
if (raw.complexity !== undefined && raw.complexity !== null) {
// normalizeComplexity is responsible for mapping numbers, string aliases,
// and other inputs to the valid complexity enum expected by the schema.
result.complexity = normalizeComplexity(raw.complexity as any);

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +91
const COMPLEXITY_STRING_MAP: Record<string, string> = {
low: "simple",
easy: "simple",
medium: "moderate",
intermediate: "moderate",
high: "complex",
hard: "complex",
difficult: "complex",
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COMPLEXITY_STRING_MAP doesn’t match the normalization rules documented in Phase 3 (e.g., missing "trivial"/"basic", "mid"/"average", "advanced"). If Phase 3 is the source of truth, please align the mapping (or update the docs) so that assembly behavior is deterministic and matches the documented contract.

Suggested change
const COMPLEXITY_STRING_MAP: Record<string, string> = {
low: "simple",
easy: "simple",
medium: "moderate",
intermediate: "moderate",
high: "complex",
hard: "complex",
difficult: "complex",
const COMPLEXITY_STRING_MAP: Record<string, string> = {
// simple
low: "simple",
easy: "simple",
trivial: "simple",
basic: "simple",
// moderate
medium: "moderate",
intermediate: "moderate",
mid: "moderate",
average: "moderate",
// complex
high: "complex",
hard: "complex",
difficult: "complex",
advanced: "complex",

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +77
if (expectedPrefix) {
// For func/class, reconstruct from filePath + name if available
if (
(node.type === "function" || node.type === "class") &&
node.filePath &&
node.name
) {
return `${expectedPrefix}:${node.filePath}:${node.name}`;
}
return `${expectedPrefix}:${path}`;
}

return trimmed;
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For function/class nodes, when there’s no valid prefix and filePath/name aren’t both present, normalizeNodeId() returns ${expectedPrefix}:${path} (e.g., func:formatDate). That format is non-canonical (missing file path) and can easily collide across files, which can lead to incorrect edge rewrites/deduplication. Consider only reconstructing func/class IDs when both filePath and name are available; otherwise leave the ID unchanged (or log it for review) rather than manufacturing a likely-wrong canonical ID.

Suggested change
if (expectedPrefix) {
// For func/class, reconstruct from filePath + name if available
if (
(node.type === "function" || node.type === "class") &&
node.filePath &&
node.name
) {
return `${expectedPrefix}:${node.filePath}:${node.name}`;
}
return `${expectedPrefix}:${path}`;
}
return trimmed;
if (!expectedPrefix) {
return trimmed;
}
// For func/class, reconstruct from filePath + name if available; otherwise, leave ID unchanged
if (node.type === "function" || node.type === "class") {
if (node.filePath && node.name) {
return `${expectedPrefix}:${node.filePath}:${node.name}`;
}
// Avoid manufacturing a non-canonical func/class ID like "func:formatDate"
return trimmed;
}
return `${expectedPrefix}:${path}`;

Copilot uses AI. Check for mistakes.
Comment on lines +172 to 175
5. **Remove dangling edges:** Remove any edge whose `source` or `target` references a node ID that does not exist in the merged node set.

6. **Log changes:** Record counts of IDs corrected, complexity values fixed, edges rewritten, duplicates removed, and dangling edges dropped. Include these counts in the Phase warnings list passed to the reviewer.

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 3 step 6 lists what to log, but the implementation of normalizeBatchOutput() also deduplicates edges (composite key) and doesn’t currently mention logging that count. Either document edge deduplication as part of the normalization sequence/logging, or remove the behavior if it’s not intended for assembly reporting.

Suggested change
5. **Remove dangling edges:** Remove any edge whose `source` or `target` references a node ID that does not exist in the merged node set.
6. **Log changes:** Record counts of IDs corrected, complexity values fixed, edges rewritten, duplicates removed, and dangling edges dropped. Include these counts in the Phase warnings list passed to the reviewer.
5. **Remove duplicate edges:** After node IDs are normalized and edge references are rewritten, deduplicate edges using a composite key of `(source, target, type)` (or equivalent edge kind identifier). For any set of duplicate edges sharing this composite key, keep a single representative and drop the rest.
6. **Remove dangling edges:** Remove any edge whose `source` or `target` references a node ID that does not exist in the merged node set.
7. **Log changes:** Record counts of IDs corrected, complexity values fixed, edges rewritten, duplicate node IDs removed, duplicate edges removed, and dangling edges dropped. Include these counts in the Phase warnings list passed to the reviewer.

Copilot uses AI. Check for mistakes.
normalizeNodeId("my-project:service:docker-compose.yml", {
type: "file",
}),
).toBe("file:docker-compose.yml");
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test asserts that my-project:service:docker-compose.yml normalizes to file:docker-compose.yml, which drops the service segment entirely. If service is intended to be part of the relative path (as suggested by the issue’s double-prefix examples), this expectation should be updated to preserve it; otherwise normalization can collapse distinct files into the same ID and cause incorrect edge rewrites.

Suggested change
).toBe("file:docker-compose.yml");
).toBe("file:service/docker-compose.yml");

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

@Lum1104 Lum1104 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In normalize-graph.ts, I think the edge rewrite is still a bit too strict for the inconsistency this PR is trying to handle.

Right now source/target are rewritten only through exact idMap lookup from the original node IDs. That works when the edge uses the same malformed ID string as the node did, but it misses cases where parallel outputs produce different malformed variants for the same logical node.

Example:

  • node ID: src/bare.ts -> normalizes to file:src/bare.ts
  • edge source: my-project:file:src/bare.ts

Those refer to the same node, but since the edge string does not exactly match the node’s original id, it won’t be rewritten and will be dropped as dangling.

I think the bigger issue is that this becomes a silent relationship loss: once the edge is dropped here, later validation no longer sees a dangling reference to report, so the graph just loses the connection with only the aggregate counter incremented.

Would it make sense to canonicalize edge endpoints through the same normalization logic before the dangling-edge check, or otherwise emit enough detail here for the caller to surface which edges were dropped?

…nd dropped edge traceability

- Add all 13 node types (including non-code) to VALID_PREFIXES and TYPE_TO_PREFIX
  to prevent valid IDs like config:tsconfig.json from being stripped
- Add fallback normalizeNodeId on edge endpoints not found in idMap, fixing
  silent relationship loss when edges use different malformed variants than nodes
- Add DroppedEdge interface with source, target, type, and reason fields so
  callers can surface exactly which edges were lost and why
- Use honest Record<string, unknown>[] return types instead of unsafe type casts
- Align SKILL.md complexity aliases with COMPLEXITY_STRING_MAP
- Add 5 new tests for non-code types, cross-variant edges, dropped edge detail,
  and validateGraph integration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Lum1104 Lum1104 merged commit efd4c48 into Lum1104:main Apr 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallel file-analyzer subagents can produce inconsistent node IDs and invalid complexity values

3 participants