Skip to content

Parallel file-analyzer subagents can produce inconsistent node IDs and invalid complexity values #59

@fishinakleinbottle

Description

@fishinakleinbottle

Problem

When /understand --full dispatches parallel file-analyzer subagents, there is no deterministic enforcement of node ID format or complexity enum values. The prompt specifies the correct formats, but the assembly pipeline trusts LLM output without validation — so inconsistent batches silently corrupt the final graph.

Issue 1: No runtime enforcement of node ID format

The file-analyzer prompt (skills/understand/file-analyzer-prompt.md, lines 219–227) specifies:

Node Type Required Format Example
File file:<relative-path> file:src/index.ts
Function func:<relative-path>:<name> func:src/utils.ts:formatDate
Class class:<relative-path>:<name> class:src/models/User.ts:User

However, the Zod schema only validates id: z.string() (packages/core/src/schema.ts, line 13) — any string passes. Neither Phase 3 (ASSEMBLE) nor the GraphBuilder (packages/core/src/analyzer/graph-builder.ts, line 84) validates ID prefix format on merged batch output.

Since subagents are LLMs writing JSON directly to batch-<N>.json files, they can produce:

  • Project-name-prefixed IDs: myproject:backend/main.py
  • Double-prefixed IDs: myproject:service:docker-compose.yml
  • Bare paths with no prefix: frontend/src/utils/constants.ts

Evidence from a 226-file project run:

// Batch 1 — correct
{ "id": "file:backend/app/api/audit.py" }

// Batch 4 — project name prefixed
{ "id": "noora-health-res-cms:backend/main.py" }

// Batch 8 — no prefix
{ "id": "frontend/src/utils/constants.ts" }

Issue 2: Complexity enum not validated during assembly

The schema defines complexity: z.enum(["simple", "moderate", "complex"]) (packages/core/src/schema.ts, line 20). The parseFileAnalysisResponse function (packages/core/src/analyzer/llm-analyzer.ts, lines 97–116) normalizes invalid values to "moderate" — but this function is only used in the programmatic GraphBuilder path.

The /understand skill's subagents write batch JSON files directly to disk. Phase 3 merges these files without passing nodes through parseFileAnalysisResponse, so invalid complexity values survive into the assembled graph:

// LLM used "low"/"high" instead of "simple"/"complex"
{ "id": "file:backend/app/api/__init__.py", "complexity": "low" }

// LLM used numeric scale
{ "id": "file:backend/app/services/eval.py", "complexity": 4 }

// LLM used free-text
{ "id": "document:CLAUDE.md", "complexity": "detailed" }

Issue 3: Cascading edge drops at assembly

Phase 3 (skills/understand/SKILL.md, line 155) removes edges whose source or target references a non-existent node ID. When Issue 1 causes ID mismatches between what edges reference and what nodes were actually saved as, valid edges get silently dropped:

// Edge references correct format, but the node was saved with wrong ID format
{ "source": "file:backend/app/services/turnio_client.py", "target": "file:backend/app/core/config.py", "type": "imports" }
// Actual node ID was "noora-health-res-cms:file:backend/app/services/turnio_client.py" → mismatch → dropped

Issue 4: All-or-nothing dashboard validation

The dashboard calls validateGraph() on load (packages/dashboard/src/App.tsx, lines 122–130). Since this is Zod safeParse, any invalid node (e.g., complexity: "low") causes the entire graph to fail to load with an error banner. There is no partial load or per-node auto-correction.

This means if even one batch produces a non-enum complexity value and it survives to the final JSON, the dashboard won't render at all.

Impact

On a 226-file project, we observed: nodes with 3 different ID formats across batches, non-standard complexity values ("low", "high", numeric, free-text), and layers with depleted node counts due to ID mismatches during assembly.

Root Cause

The assembly pipeline (Phase 3 in SKILL.md) trusts that all subagent batches produce consistent output. The only format enforcement is:

  1. Prompt instructions (not deterministic)
  2. Graph reviewer in Phase 6 (also LLM-based, not deterministic)
  3. Zod schema at final write/load (catches invalid values but rejects the whole graph rather than fixing them)

There is no deterministic normalization step between batch merge and final write.

Suggested Fixes

  1. Add a normalization pass in Phase 3 (ASSEMBLE): After merging batches, deterministically normalize all node IDs to type:path format (strip project-name prefixes, add missing file: prefixes) and map complexity values to the valid enum (lowsimple, highcomplex, numeric/other→moderate).

  2. Validate node ID prefix format in the Zod schema: Change id: z.string() to id: z.string().regex(/^(file|func|class|module|concept):/) so invalid IDs are caught at validation time rather than silently passing.

  3. Run parseFileAnalysisResponse-style normalization on batch JSON: Before merging batch files in Phase 3, pass each node through the same complexity normalization that llm-analyzer.ts already implements.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions