dwarf2pdb produces an executable that converts DWARF-embedded executables into stripped executables with separate PDB debug files:
Input: DWARF-embedded executable (ELF/PE with embedded DWARF debug info) Output: Stripped executable + separate PDB file
Enable Windows debugging tools (Visual Studio, WinDbg, Windows Performance Analyzer) to debug executables originally compiled with DWARF debug information (GCC, Clang on Linux/macOS, MinGW).
# Convert DWARF-embedded executable to stripped exe + PDB
dwarf2pdb my_program.exe my_program_stripped.exe my_program.pdb
# Result:
# - my_program_stripped.exe (no debug info, smaller size)
# - my_program.pdb (full debug information in PDB format)This project follows strict Test Driven Development:
- Write tests first - Define expected behavior through failing tests
- Implement minimally - Write just enough code to pass tests
- Refactor - Improve code while maintaining test coverage
- Iterate - Repeat for each feature
Test Hierarchy:
- Unit tests (ut/) - Component isolation tests (node ↔ IR transformations)
- Integration tests (it/) - Format I/O with real DWARF/PDB libraries
- System tests (st/) - End-to-end executable conversion validation
All features require corresponding tests. Code coverage must be maintained or increased with every change.
Love this problem. You're basically trying to build a round-trip translator between DWARF and PDB:
- Load DWARF → create "dwarf object model"
- Convert that into a "pdb object model"
- Use a PDB writer lib to emit a real PDB file
- Later read that PDB → rebuild DWARF
You're asking:
- Can we do 1:1 mapping?
- How do we handle shared / recursive types and ID assignment?
- Is the node/tree/smart-pointer design good?
- Can we sketch a class design / UML-ish plan?
I'll walk through:
- Why 1:1 is tricky and how to solve it anyway
- How to handle IDs & shared types safely
- Suggested architecture layers
- UML-style diagram (Mermaid)
- Improvements to your current “node with parent pointer + smart_ptr tree” idea
Short answer: not literally “1 struct node ↔ 1 struct node” in all cases, because:
- DWARF scopes types lexically (a struct inside a function is IN that function DIE).
- PDB scopes types by symbol streams, not in the TPI record itself.
- PDB forward refs + full defs are two records; DWARF forward decl + full def are one logical DIE pair.
- Templates and anonymous aggregates are represented differently.
But we can do a stable, loss-minimized round trip if we introduce our own middle IR that’s rich enough to represent everything we care about, even if neither format matches it perfectly.
So: instead of “DWARF → PDB directly”, do DWARF → IR → PDB and on the way back PDB → IR → DWARF.
This IR is the place where you keep canonical type identity, ownership, and cross-refs. That IR is what you manage with smart pointers, parent pointers, etc.
You can still keep per-format node trees (like DwarfNode, PdbNode) if you want for debugging/export fidelity, but you shouldn’t try to mutate/merge those trees directly. They are format views, not truth.
This is the hardest technical point you asked: “how inter shared type handled both dwarf and pdb alloc ids?”
- DWARF uses DIE references. Two members in different structs can both point to the same type DIE.
- PDB uses Type Indices in the TPI stream. Same deal: multiple fields point at the same TI.
You must NOT duplicate “struct Foo” six times just because six places reference it. The pointer identity matters, especially for recursive types (like struct Node { Node* next; }).
Introduce a TypeTable in the IR layer:
- Every distinct semantic type (struct Foo, pointer-to-int, array[10] of Bar, etc.) has exactly one IR node instance.
- That node has a stable
IRTypeID(your own integer or pointer identity). - The IR tracks children (members) and references to other IR types.
Then each frontend builds that table:
-
The DWARF reader walks DIEs. For each DIE that describes a type, you either:
- look it up in a map
<dwarf_die_offset → IRTypeID>and reuse, or - create a new IR node and assign a new IRTypeID.
- look it up in a map
-
The PDB reader does the same with
<pdb_type_index → IRTypeID>.
Then each backend consumes the same IR table:
- The PDB writer will assign CodeView TIs in a stable order by walking IR types and emitting records, tracking
<IRTypeID → pdb_type_index>. - The DWARF writer will assign DIE offsets / references and track
<IRTypeID → dwarf_die_offset>.
That gives you consistent sharing / recursion.
Do not try to reuse DWARF offsets or PDB type indices as your “canonical ID.” Make your own IRTypeID and maintain per-format maps.
That solves:
- shared types
- forward refs
- local scoped types
- anonymous aggregates that appear multiple times under different parents
- template instantiations
Here’s a clean layering:
-
IRType,IRField,IRFunction,IRVariable, etc. -
IRTypeTable(interns / dedups types, owns allIRTypeobjects) -
IRScopetree (lexical scopes / namespaces / function bodies) -
Handles:
- struct / class / union
- bitfield info
- arrays
- template params
- “this type is local to function F”
- access specifiers, etc.
This is where you do canonicalization and dedup.
DwarfReader→ walks DWARF DIE tree → produces IR objects usingIRTypeTable.DwarfWriter← walks IR → emits DIEs.- Extra:
DwarfNodeobjects if you want to retain exact original DIE layout, attributes, offsets, CU boundaries. This is useful if you want to preserve “as close to original as possible,” but it is optional logically.
-
PdbReader→ parses TPI, symbol streams → fillsIRTypeTable, buildsIRScopefrom lexical symbols (S_BLOCK32, etc.). -
PdbWriter← walks IR → emits:- TPI stream with
LF_*records, - symbol streams (
S_GPROC32,S_LOCAL,S_UDT, etc.).
- TPI stream with
-
Extra:
PdbNodeobjects if you want a “verbatim CodeView record graph” view for debugging or round-trip fidelity.
DwarfToIRMapper,IRToDwarfMapperPdbToIRMapper,IRToPdbMapper
These maintain lookup maps:
- DWARF DIE offset ↔ IRTypeID
- PDB type index ↔ IRTypeID
- IRTypeID ↔ emitted DIE offset
- IRTypeID ↔ emitted PDB type index
This is where we ensure shared types don’t explode.
classDiagram
%% ===== Core IR Layer =====
class IRTypeTable {
+getOrCreateType(astKey) : IRType*
+lookupByIRTypeID(id) : IRType*
-types : map<IRTypeID, IRType*>
-byKey : map<HashKey, IRTypeID>
}
class IRType {
<<abstract>>
+irTypeID : IRTypeID
+name : string // "Node", "anonymous$1", "MyVec<int,42>"
+linkageScope : IRScope* // where it's visible (file, func, namespace)
+flags : TypeFlags // struct/union/class/enum/typedef/pointer/array/etc.
}
class IRStructType {
+sizeBytes : uint64
+fields : vector<IRField>
+templateParams : vector<IRTemplateParam>
+isUnion : bool
+isForwardDecl : bool
}
IRStructType --|> IRType
class IRArrayType {
+elem : IRType*
+indexType : IRType* // for PDB LF_ARRAY
+dims : vector<IRArrayDim>
+totalSizeBytes : uint64
}
IRArrayType --|> IRType
class IRPointerType {
+pointee : IRType*
+qualifiers : QualFlags // const/volatile/restrict
+ptrSizeBytes : uint32
}
IRPointerType --|> IRType
class IRField {
+name : string
+type : IRType*
+byteOffset : uint64
+bitOffset : uint16 // start bit within storage unit
+bitSize : uint16 // 0 if not bitfield
+access : AccessKind // public/protected/private
+isAnonymousAggregateArm : bool
}
class IRTemplateParam {
+paramName : string
+kind : {Type,Value}
+typeArg : IRType* // if kind==Type
+valueArg : string // if kind==Value (store literal "42")
}
class IRArrayDim {
+lowerBound : int64
+count : uint64 // element count
}
class IRScope {
+parent : IRScope*
+children : vector<IRScope*>
+declaredTypes : vector<IRType*>
+declaredSymbols : vector<IRSymbol>
+kind : {CompileUnit,Namespace,Function,Block,FileStatic}
+name : string
}
class IRSymbol {
+name : string
+kind : {Variable,Parameter,Function}
+type : IRType*
+storageInfo : StorageLoc
+rangeInfo : LiveRangeInfo
}
%% ===== Format-specific views =====
class DwarfNode {
+tag : DW_TAG
+attrs : map<DW_AT, AttrValue>
+children : vector<DwarfNode*>
+parent : DwarfNode*
+cuID : int
+originalDieOffset : uint64
}
class PdbNode {
+leafKind : uint16 // LF_*, S_*
+payload : bytes
+children : vector<PdbNode*>
+parent : PdbNode*
+moduleID : int
+typeIndexOrSymOffset : uint32
}
%% ===== Readers / Writers =====
class DwarfReader {
+readObject(objFilePath) : IRScope*
-mapDieToIR : map<uint64, IRTypeID>
-typeTable : IRTypeTable*
}
class DwarfWriter {
+emit(objFilePath, rootScope:IRScope*)
-mapIRToDie : map<IRTypeID, uint64>
-typeTable : IRTypeTable*
}
class PdbReader {
+readPdb(pdbPath) : IRScope*
-mapTItoIR : map<uint32, IRTypeID>
-typeTable : IRTypeTable*
}
class PdbWriter {
+emit(pdbPath, rootScope:IRScope*)
-mapIRtoTI : map<IRTypeID, uint32>
-typeTable : IRTypeTable*
}
%% relations
IRScope "1" o--> "many" IRType : declaredTypes
IRScope "1" o--> "many" IRSymbol : declaredSymbols
IRScope "1" o--> "many" IRScope : children
IRStructType "1" o--> "many" IRField
IRStructType "1" o--> "many" IRTemplateParam
IRArrayType "1" o--> "many" IRArrayDim
DwarfReader --> IRTypeTable
DwarfWriter --> IRTypeTable
PdbReader --> IRTypeTable
PdbWriter --> IRTypeTable
DwarfReader --> DwarfNode : (optional capture original)
PdbReader --> PdbNode : (optional capture original)
DwarfWriter --> DwarfNode : (construct to serialize)
PdbWriter --> PdbNode : (construct to serialize)
Notes:
IRTypeTableowns allIRTypeinstances (unique_ptr inside it).IRScopeis your lexical tree (compile unit → namespace → function → block…).DwarfNode/PdbNodeare format ASTs, mainly for emitting or debugging fidelity.
dwarf read wrapper, dwarf write wrapper, pdb read wrapper, pdb write wrapper. dwarf node, pdb node. node is tree with parent pointer. node with smart pointer to delete automatically.
That’s a good skeleton, but a few improvements are important:
Why:
- You cannot safely map DWARF nodes ⇄ PDB nodes directly because of scope vs TPI differences, forward-ref vs final-def differences, etc.
- You need canonical type dedup + ID mapping. That wants to live in one place, not in both sides.
If you skip IR, you'll fight duplication, missing recursion, anonymous things, etc.
Right now you have only one generic “Node”. You actually want:
-
IR graph nodes (IRType / IRScope / IRSymbol)
- This is semantic, normalized.
- These are referenced by other IR nodes and are stable across formats.
- Memory-owned by
IRTypeTable(types) and rootIRScope(scopes/symbols).
-
Format nodes (DwarfNode / PdbNode)
- These mirror the source/target format exactly.
- They are mostly for I/O, not long-term truth.
This split lets you do:
- DwarfReader: DwarfNode → IR
- PdbWriter: IR → PdbNode
and back.
Yes, use smart pointers, but:
-
Have central ownership, not arbitrary shared cycles.
IRTypeTablecan store allstd::unique_ptr<IRType>in a vector or map.IRScopecan storestd::unique_ptr<IRScope>children andIRSymbolby value.
-
Cross-references between IR nodes should be raw pointers or
IRTypeIDintegers, NOT shared_ptr, to avoid ref cycles (struct A has field of type B, type B has field of type A (via pointer), etc.).- e.g.
IRField::typecan beIRType*orIRTypeID.
- e.g.
-
parentpointers in scopes can be non-owning raw pointers.
So:
- Ownership is acyclic (IRTypeTable/IRScope root own things).
- Parent pointers are just backrefs, not owning.
- No shared_ptr cycles => deterministic destruction.
Add explicit mapping tables:
mapDieToIR,mapIRToDiemapTItoIR,mapIRtoTI
These are absolutely required for round-trip stability and for recursive types.
- Build a central IR layer (
IRTypeTable,IRScope,IRTypesubclasses). - Both DWARF and PDB frontends fill that same IR and keep side maps.
- Both DWARF and PDB backends consume that IR and use side maps to assign per-format IDs (DIE offsets, PDB Type Indices).
- Use
unique_ptrownership in IR, plus raw backpointers/non-owning refs for relationships. - Keep
DwarfNode/PdbNodetrees only as format-specific snapshots/helpers, not as the canonical truth.
This gives you:
- Deterministic memory management
- No duplication of shared types
- A place to normalize tricky constructs (bitfields, anonymous unions, template params, local-scope types)
- A clean seam to add future formats (e.g. DWARF5+Split-DWARF, or CV in PE/OBJ with incremental linking quirks)
This is a solid base to implement DWARF→PDB→DWARF without losing meaning.