This document describes the key algorithms used in spec-gen for code analysis and specification generation.
Files are scored to prioritize high-value code for LLM context. The algorithm considers multiple factors:
TotalScore = NameScore + PathScore + StructureScore + ConnectivityScore
Where:
- NameScore (0-30): Based on file name patterns
- PathScore (0-25): Based on directory location
- StructureScore (0-25): Based on code content
- ConnectivityScore (0-20): Based on import/export relationships
| Pattern | Score |
|---|---|
schema, model, entity |
30 |
service, controller, handler |
28 |
api, route, endpoint |
25 |
store, reducer, action |
22 |
util, helper, constant |
10 |
test, spec, mock |
5 |
index |
15 |
| Directory Pattern | Score |
|---|---|
src/models, src/entities |
25 |
src/services, src/core |
23 |
src/api, src/routes |
20 |
src/components |
18 |
lib/, packages/ |
15 |
test/, __tests__/ |
5 |
Analyzes code content for:
- Class definitions (+5 each, max 15)
- Interface/type exports (+3 each, max 12)
- Function exports (+2 each, max 10)
- Import count (high import count indicates core module)
- Parse all source files for imports/exports
- Resolve import paths to actual files
- Create directed edges from importer → imported
- Handle various module systems (ESM, CommonJS, TypeScript)
- In-degree: Number of files that import this file
- Out-degree: Number of files this file imports
- High in-degree = widely used module
- High out-degree = integration/orchestration module
Iterative algorithm:
PR(A) = (1-d) + d × Σ(PR(T) / OutDegree(T))
Where:
d= damping factor (0.85)T= files that link to A- Converges after ~20 iterations
Measures how often a node appears on shortest paths between other nodes:
- High betweenness = bridge between modules
- Indicates architectural importance
Modified for code structure:
- Initialize: Each file in its own community
- Optimize: Move files between communities to maximize modularity
- Aggregate: Merge communities and repeat
Q = (1/2m) × Σ[(Aij - kikj/2m) × δ(ci, cj)]
Where:
Aij= edge weight between i and jki,kj= degree of nodesm= total edgesδ= 1 if same community, 0 otherwise
- Files in same directory get affinity bonus
- Shared naming patterns increase affinity
- Bidirectional imports strongly indicate same domain
Given token limit L and files sorted by significance:
def select_files(files, limit):
selected = []
tokens = 0
for file in sorted(files, key=lambda f: -f.score):
file_tokens = count_tokens(file.content)
if tokens + file_tokens <= limit:
selected.append(file)
tokens += file_tokens
return selectedFor files that don't fit entirely:
- Keep imports/exports (essential for understanding)
- Keep class/function signatures
- Truncate implementation bodies
- Add
// ... implementation ...markers
Quick categorization prompt:
- Project type (web app, API, library, CLI)
- Primary language/framework
- Architecture style (monolith, microservices, serverless)
Extract core data models:
- Input: High-significance model/schema files
- Output: Entity names, fields, relationships
Map business logic:
- Input: Service files + extracted entities
- Output: Operations, business rules, dependencies
Document external interfaces:
- Input: Route/controller files
- Output: Endpoints, methods, request/response schemas
Create system overview:
- Input: All previous stage outputs
- Output: Component diagram, data flow, key decisions
- Select N files NOT used in generation
- For each file:
- LLM predicts content from specs only
- Compare prediction to actual code
- Score similarity
- Same imports predicted: +20%
- Same exports predicted: +20%
- Same function names: +15%
- Same class names: +15%
- Business logic correctly described: +15%
- Data transformations accurate: +15%
Score = (StructuralMatch × 0.5) + (SemanticMatch × 0.5)
| Score | Interpretation |
|---|---|
| ≥ 0.8 | Excellent - specs capture system well |
| 0.6-0.8 | Good - minor gaps in coverage |
| 0.4-0.6 | Fair - significant gaps |
| < 0.4 | Poor - specs need revision |
Used to detect circular dependencies:
def tarjan_scc(graph):
index = 0
stack = []
lowlinks = {}
indices = {}
on_stack = set()
sccs = []
def strongconnect(v):
nonlocal index
indices[v] = index
lowlinks[v] = index
index += 1
stack.append(v)
on_stack.add(v)
for w in graph[v]:
if w not in indices:
strongconnect(w)
lowlinks[v] = min(lowlinks[v], lowlinks[w])
elif w in on_stack:
lowlinks[v] = min(lowlinks[v], indices[w])
if lowlinks[v] == indices[v]:
scc = []
while True:
w = stack.pop()
on_stack.remove(w)
scc.append(w)
if w == v:
break
sccs.append(scc)
for v in graph:
if v not in indices:
strongconnect(v)
return sccsCycles with > 1 node indicate circular dependencies:
- Report in analysis artifacts
- Flag as potential architectural issue
- Include in LLM context for spec generation
| Operation | Complexity |
|---|---|
| File walking | O(n) where n = files |
| Import parsing | O(n × m) where m = avg file size |
| Graph building | O(e) where e = edges |
| PageRank | O(k × (n + e)) where k = iterations |
| Clustering | O(n × log n) average case |
| Data Structure | Size |
|---|---|
| File metadata | O(n) |
| Dependency graph | O(n + e) |
| Parsed ASTs | O(n × m) - transient |
| LLM context | Bounded by token limit |
- Lazy AST parsing: Only parse files when needed
- Parallel file reading: Use async I/O with concurrency limit
- Early filtering: Skip excluded patterns before reading
- Incremental updates: Cache analysis, only re-analyze changed files