Skip to content

ormastes/dwarf2pdb

Repository files navigation

dwarf2pdb - DWARF to PDB Executable Converter

Final Project Goal

dwarf2pdb produces an executable that converts DWARF-embedded executables into stripped executables with separate PDB debug files:

Input: DWARF-embedded executable (ELF/PE with embedded DWARF debug info) Output: Stripped executable + separate PDB file

Use Case

Enable Windows debugging tools (Visual Studio, WinDbg, Windows Performance Analyzer) to debug executables originally compiled with DWARF debug information (GCC, Clang on Linux/macOS, MinGW).

Example Usage

# Convert DWARF-embedded executable to stripped exe + PDB
dwarf2pdb my_program.exe my_program_stripped.exe my_program.pdb

# Result:
# - my_program_stripped.exe (no debug info, smaller size)
# - my_program.pdb (full debug information in PDB format)

Development Methodology: Test Driven Development (TDD)

This project follows strict Test Driven Development:

  1. Write tests first - Define expected behavior through failing tests
  2. Implement minimally - Write just enough code to pass tests
  3. Refactor - Improve code while maintaining test coverage
  4. Iterate - Repeat for each feature

Test Hierarchy:

  • Unit tests (ut/) - Component isolation tests (node ↔ IR transformations)
  • Integration tests (it/) - Format I/O with real DWARF/PDB libraries
  • System tests (st/) - End-to-end executable conversion validation

All features require corresponding tests. Code coverage must be maintained or increased with every change.


Architecture Discussion

Love this problem. You're basically trying to build a round-trip translator between DWARF and PDB:

  1. Load DWARF → create "dwarf object model"
  2. Convert that into a "pdb object model"
  3. Use a PDB writer lib to emit a real PDB file
  4. Later read that PDB → rebuild DWARF

You're asking:

  • Can we do 1:1 mapping?
  • How do we handle shared / recursive types and ID assignment?
  • Is the node/tree/smart-pointer design good?
  • Can we sketch a class design / UML-ish plan?

I'll walk through:

  1. Why 1:1 is tricky and how to solve it anyway
  2. How to handle IDs & shared types safely
  3. Suggested architecture layers
  4. UML-style diagram (Mermaid)
  5. Improvements to your current “node with parent pointer + smart_ptr tree” idea

1. Can we do DWARF ⇄ PDB as 1:1?

Short answer: not literally “1 struct node ↔ 1 struct node” in all cases, because:

  • DWARF scopes types lexically (a struct inside a function is IN that function DIE).
  • PDB scopes types by symbol streams, not in the TPI record itself.
  • PDB forward refs + full defs are two records; DWARF forward decl + full def are one logical DIE pair.
  • Templates and anonymous aggregates are represented differently.

But we can do a stable, loss-minimized round trip if we introduce our own middle IR that’s rich enough to represent everything we care about, even if neither format matches it perfectly.

So: instead of “DWARF → PDB directly”, do DWARF → IR → PDB and on the way back PDB → IR → DWARF.

This IR is the place where you keep canonical type identity, ownership, and cross-refs. That IR is what you manage with smart pointers, parent pointers, etc.

You can still keep per-format node trees (like DwarfNode, PdbNode) if you want for debugging/export fidelity, but you shouldn’t try to mutate/merge those trees directly. They are format views, not truth.


2. Shared / recursive types and ID allocation

This is the hardest technical point you asked: “how inter shared type handled both dwarf and pdb alloc ids?”

The problem

  • DWARF uses DIE references. Two members in different structs can both point to the same type DIE.
  • PDB uses Type Indices in the TPI stream. Same deal: multiple fields point at the same TI.

You must NOT duplicate “struct Foo” six times just because six places reference it. The pointer identity matters, especially for recursive types (like struct Node { Node* next; }).

The fix

Introduce a TypeTable in the IR layer:

  • Every distinct semantic type (struct Foo, pointer-to-int, array[10] of Bar, etc.) has exactly one IR node instance.
  • That node has a stable IRTypeID (your own integer or pointer identity).
  • The IR tracks children (members) and references to other IR types.

Then each frontend builds that table:

  • The DWARF reader walks DIEs. For each DIE that describes a type, you either:

    • look it up in a map <dwarf_die_offset → IRTypeID> and reuse, or
    • create a new IR node and assign a new IRTypeID.
  • The PDB reader does the same with <pdb_type_index → IRTypeID>.

Then each backend consumes the same IR table:

  • The PDB writer will assign CodeView TIs in a stable order by walking IR types and emitting records, tracking <IRTypeID → pdb_type_index>.
  • The DWARF writer will assign DIE offsets / references and track <IRTypeID → dwarf_die_offset>.

That gives you consistent sharing / recursion.

Summary rule:

Do not try to reuse DWARF offsets or PDB type indices as your “canonical ID.” Make your own IRTypeID and maintain per-format maps.

That solves:

  • shared types
  • forward refs
  • local scoped types
  • anonymous aggregates that appear multiple times under different parents
  • template instantiations

3. Layered architecture

Here’s a clean layering:

3.1 Core IR layer (format-agnostic)

  • IRType, IRField, IRFunction, IRVariable, etc.

  • IRTypeTable (interns / dedups types, owns all IRType objects)

  • IRScope tree (lexical scopes / namespaces / function bodies)

  • Handles:

    • struct / class / union
    • bitfield info
    • arrays
    • template params
    • “this type is local to function F”
    • access specifiers, etc.

This is where you do canonicalization and dedup.

3.2 DWARF front/back

  • DwarfReader → walks DWARF DIE tree → produces IR objects using IRTypeTable.
  • DwarfWriter ← walks IR → emits DIEs.
  • Extra: DwarfNode objects if you want to retain exact original DIE layout, attributes, offsets, CU boundaries. This is useful if you want to preserve “as close to original as possible,” but it is optional logically.

3.3 PDB front/back

  • PdbReader → parses TPI, symbol streams → fills IRTypeTable, builds IRScope from lexical symbols (S_BLOCK32, etc.).

  • PdbWriter ← walks IR → emits:

    • TPI stream with LF_* records,
    • symbol streams (S_GPROC32, S_LOCAL, S_UDT, etc.).
  • Extra: PdbNode objects if you want a “verbatim CodeView record graph” view for debugging or round-trip fidelity.

3.4 Mapping helpers

  • DwarfToIRMapper, IRToDwarfMapper
  • PdbToIRMapper, IRToPdbMapper

These maintain lookup maps:

  • DWARF DIE offset ↔ IRTypeID
  • PDB type index ↔ IRTypeID
  • IRTypeID ↔ emitted DIE offset
  • IRTypeID ↔ emitted PDB type index

This is where we ensure shared types don’t explode.


4. UML-style sketch (Mermaid)

classDiagram
    %% ===== Core IR Layer =====
    class IRTypeTable {
        +getOrCreateType(astKey) : IRType*
        +lookupByIRTypeID(id) : IRType*
        -types : map<IRTypeID, IRType*>
        -byKey : map<HashKey, IRTypeID>
    }

    class IRType {
        <<abstract>>
        +irTypeID : IRTypeID
        +name : string          // "Node", "anonymous$1", "MyVec<int,42>"
        +linkageScope : IRScope* // where it's visible (file, func, namespace)
        +flags : TypeFlags       // struct/union/class/enum/typedef/pointer/array/etc.
    }

    class IRStructType {
        +sizeBytes : uint64
        +fields : vector<IRField>
        +templateParams : vector<IRTemplateParam>
        +isUnion : bool
        +isForwardDecl : bool
    }
    IRStructType --|> IRType

    class IRArrayType {
        +elem : IRType*
        +indexType : IRType*     // for PDB LF_ARRAY
        +dims : vector<IRArrayDim>
        +totalSizeBytes : uint64
    }
    IRArrayType --|> IRType

    class IRPointerType {
        +pointee : IRType*
        +qualifiers : QualFlags  // const/volatile/restrict
        +ptrSizeBytes : uint32
    }
    IRPointerType --|> IRType

    class IRField {
        +name : string
        +type : IRType*
        +byteOffset : uint64
        +bitOffset : uint16      // start bit within storage unit
        +bitSize   : uint16      // 0 if not bitfield
        +access    : AccessKind  // public/protected/private
        +isAnonymousAggregateArm : bool
    }

    class IRTemplateParam {
        +paramName : string
        +kind : {Type,Value}
        +typeArg : IRType*       // if kind==Type
        +valueArg : string       // if kind==Value (store literal "42")
    }

    class IRArrayDim {
        +lowerBound : int64
        +count      : uint64     // element count
    }

    class IRScope {
        +parent : IRScope*
        +children : vector<IRScope*>
        +declaredTypes : vector<IRType*>
        +declaredSymbols : vector<IRSymbol>
        +kind : {CompileUnit,Namespace,Function,Block,FileStatic}
        +name : string
    }

    class IRSymbol {
        +name : string
        +kind : {Variable,Parameter,Function}
        +type : IRType*
        +storageInfo : StorageLoc
        +rangeInfo : LiveRangeInfo
    }

    %% ===== Format-specific views =====
    class DwarfNode {
        +tag : DW_TAG
        +attrs : map<DW_AT, AttrValue>
        +children : vector<DwarfNode*>
        +parent : DwarfNode*
        +cuID : int
        +originalDieOffset : uint64
    }

    class PdbNode {
        +leafKind : uint16       // LF_*, S_*
        +payload  : bytes
        +children : vector<PdbNode*>
        +parent   : PdbNode*
        +moduleID : int
        +typeIndexOrSymOffset : uint32
    }

    %% ===== Readers / Writers =====
    class DwarfReader {
        +readObject(objFilePath) : IRScope*
        -mapDieToIR : map<uint64, IRTypeID>
        -typeTable : IRTypeTable*
    }

    class DwarfWriter {
        +emit(objFilePath, rootScope:IRScope*)
        -mapIRToDie : map<IRTypeID, uint64>
        -typeTable : IRTypeTable*
    }

    class PdbReader {
        +readPdb(pdbPath) : IRScope*
        -mapTItoIR : map<uint32, IRTypeID>
        -typeTable : IRTypeTable*
    }

    class PdbWriter {
        +emit(pdbPath, rootScope:IRScope*)
        -mapIRtoTI : map<IRTypeID, uint32>
        -typeTable : IRTypeTable*
    }

    %% relations
    IRScope "1" o--> "many" IRType : declaredTypes
    IRScope "1" o--> "many" IRSymbol : declaredSymbols
    IRScope "1" o--> "many" IRScope : children
    IRStructType "1" o--> "many" IRField
    IRStructType "1" o--> "many" IRTemplateParam
    IRArrayType "1" o--> "many" IRArrayDim

    DwarfReader --> IRTypeTable
    DwarfWriter --> IRTypeTable
    PdbReader   --> IRTypeTable
    PdbWriter   --> IRTypeTable

    DwarfReader --> DwarfNode : (optional capture original)
    PdbReader   --> PdbNode   : (optional capture original)
    DwarfWriter --> DwarfNode : (construct to serialize)
    PdbWriter   --> PdbNode   : (construct to serialize)
Loading

Notes:

  • IRTypeTable owns all IRType instances (unique_ptr inside it).
  • IRScope is your lexical tree (compile unit → namespace → function → block…).
  • DwarfNode / PdbNode are format ASTs, mainly for emitting or debugging fidelity.

5. Review of your proposed design and improvements

Your idea:

dwarf read wrapper, dwarf write wrapper, pdb read wrapper, pdb write wrapper. dwarf node, pdb node. node is tree with parent pointer. node with smart pointer to delete automatically.

That’s a good skeleton, but a few improvements are important:

(A) Add a format-neutral IR layer

Why:

  • You cannot safely map DWARF nodes ⇄ PDB nodes directly because of scope vs TPI differences, forward-ref vs final-def differences, etc.
  • You need canonical type dedup + ID mapping. That wants to live in one place, not in both sides.

If you skip IR, you'll fight duplication, missing recursion, anonymous things, etc.

(B) Use 2 kinds of "node"

Right now you have only one generic “Node”. You actually want:

  1. IR graph nodes (IRType / IRScope / IRSymbol)

    • This is semantic, normalized.
    • These are referenced by other IR nodes and are stable across formats.
    • Memory-owned by IRTypeTable (types) and root IRScope (scopes/symbols).
  2. Format nodes (DwarfNode / PdbNode)

    • These mirror the source/target format exactly.
    • They are mostly for I/O, not long-term truth.

This split lets you do:

  • DwarfReader: DwarfNode → IR
  • PdbWriter: IR → PdbNode

and back.

(C) Smart pointer / lifetime

Yes, use smart pointers, but:

  • Have central ownership, not arbitrary shared cycles.

    • IRTypeTable can store all std::unique_ptr<IRType> in a vector or map.
    • IRScope can store std::unique_ptr<IRScope> children and IRSymbol by value.
  • Cross-references between IR nodes should be raw pointers or IRTypeID integers, NOT shared_ptr, to avoid ref cycles (struct A has field of type B, type B has field of type A (via pointer), etc.).

    • e.g. IRField::type can be IRType* or IRTypeID.
  • parent pointers in scopes can be non-owning raw pointers.

So:

  • Ownership is acyclic (IRTypeTable/IRScope root own things).
  • Parent pointers are just backrefs, not owning.
  • No shared_ptr cycles => deterministic destruction.

(D) ID maps

Add explicit mapping tables:

  • mapDieToIR, mapIRToDie
  • mapTItoIR, mapIRtoTI

These are absolutely required for round-trip stability and for recursive types.


6. TL;DR design

  • Build a central IR layer (IRTypeTable, IRScope, IRType subclasses).
  • Both DWARF and PDB frontends fill that same IR and keep side maps.
  • Both DWARF and PDB backends consume that IR and use side maps to assign per-format IDs (DIE offsets, PDB Type Indices).
  • Use unique_ptr ownership in IR, plus raw backpointers/non-owning refs for relationships.
  • Keep DwarfNode / PdbNode trees only as format-specific snapshots/helpers, not as the canonical truth.

This gives you:

  • Deterministic memory management
  • No duplication of shared types
  • A place to normalize tricky constructs (bitfields, anonymous unions, template params, local-scope types)
  • A clean seam to add future formats (e.g. DWARF5+Split-DWARF, or CV in PE/OBJ with incremental linking quirks)

This is a solid base to implement DWARF→PDB→DWARF without losing meaning.

About

dwarf2pdb converter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors