Quantify the loss of functionality at each translation step #62

artemdinaburg · 2025-03-12T19:29:48Z

We lose some information at each step of High Pcode -> Clang IR -> Tower of IRs -> LLVM -> Binary

There should be some way to quantify this loss, possibly by measuring re-translation on Clang's compilation test suite.

kaoudis · 2025-05-07T19:26:50Z

Notes from sync today:

does lifting from the binary produce reasonable p-code?
does lifting to C produce reasonable C (we believe so especially following the use of the Rellic optimization passes Akshay is currently extracting into a library)?
does teh C created from lifted p-code actually build?
does a binary built from the C function the same as the previous?

Can answering the question of "do we lift correctly" ultimately be automated, maybe through differential testing between the resulting binaries?

kaoudis · 2025-05-07T22:22:30Z

Looks like PulseOX builds but Bloodlight doesn't, on the main branch at the pinned commits using the firmwares/build.sh script, going to sort out debugging this, since ideally I'd like to try lifting them both

kaoudis · 2025-05-07T22:43:05Z

Looks like Bloodlight didn't build by default but it did once I changed the build cmd in the script for it locally a bit to be

docker run --rm \
-v "/home/kellykaoudis/patchestry/firmwares/repos/bloodlight-firmware":"/work/bloodlight-firmware" \
-v "/home/kellykaoudis/patchestry/firmwares/output":"/output" \
firmware-builder bash -c "git config --global --add safe.directory /work/bloodlight-firmware && \
                                             cd bloodlight-firmware && \
                                             make -C firmware/libopencm3 && \
                                             make -C firmware -j8 && \
                                             cp firmware/bloodlight-firmware.elf /output/bloodlight-firmware.elf"

I'm not a huge fan of using a build container with a single command like this since it's kinda messy (3-4 cmds that could be separate RUN commands in a Dockerfile instead of "loose" in a string executed by a bash shell created from another bash script like this), would prefer to have one base and two child builds that inherit from it. Might clean that up honestly, this is annoying since it's not oneshot and git yelled about ambiguous ownership, but might also be I'm the only one with such an issue.

kaoudis · 2025-05-08T14:53:10Z

In decompiling to pcode, I observed the following errors for pulseox:

./scripts/ghidra/decompile-headless.sh --input firmwares/output/pulseox-firmware.elf --output firmwares/output/pulseox.pcode

yielded, among output that seemed correct,

ERROR DWARF data type remappings (DWARF data type definitions that changed meaning in different compile units): (DWARFImportSummary)  
ERROR   Data type -> changed to -> Data Type (DWARFImportSummary)  
ERROR   /uint -> /wchar_t (DWARFImportSummary)  
ERROR DWARF variable definitions that failed because they depended on the dynamic value of a register: 1 (DWARFImportSummary)  
ERROR DWARF variable definitions that failed because they are computed pseudo variables: 18 (DWARFImportSummary)

kaoudis · 2025-05-08T14:54:48Z

I also observed errors for bloodlight, but much different:

./scripts/ghidra/decompile-headless.sh --input firmwares/output/bloodlight-firmware.elf --output firmwares/output/bloodlight.pcode

ERROR DWARF static variables with missing address info: (DWARFImportSummary)  
ERROR   [Variable symbolic name  : variable data type] (DWARFImportSummary)  
ERROR   bl_spi_dma__rx:/DWARF/_UNCATEGORIZED_/bl_spi_dma_t (DWARFImportSummary)  
ERROR   bl_spi_dma_tx:/DWARF/_UNCATEGORIZED_/bl_spi_dma_t * (DWARFImportSummary)  
ERROR   rcc_hsi_configs:/DWARF/_UNCATEGORIZED_/rcc_clock_scale[2] (DWARFImportSummary)  
ERROR   rcc_hse8mhz_configs:/DWARF/_UNCATEGORIZED_/rcc_clock_scale[1] (DWARFImportSummary)  
ERROR   bl_spi_dma__tx:/DWARF/_UNCATEGORIZED_/bl_spi_dma_t (DWARFImportSummary)  
ERROR   bl_spi_dma_rx:/DWARF/_UNCATEGORIZED_/bl_spi_dma_t * (DWARFImportSummary)  
ERROR DWARF variable definitions that failed because they are computed pseudo variables: 50 (DWARFImportSummary)  
ERROR DWARF expression failed to read: 1 (DWARFImportSummary)

and ultimately

-----------------------------------------------------
     Total Time   2 secs
-----------------------------------------------------
 (AutoAnalysisManager)  
INFO  PatchestryDecompileFunctions.java> Running in mode: all (GhidraScript)  
INFO  PatchestryDecompileFunctions.java> Error: Offset must be between 0x0 and 0xffffffff, got 0x400a856808000f3c instead! (GhidraScript)  
ghidra.program.model.address.AddressFormatException: Offset must be between 0x0 and 0xffffffff, got 0x400a856808000f3c instead!
	at ghidra.program.model.address.AbstractAddressSpace.getAddress(AbstractAddressSpace.java:224)
	at ghidra.program.model.address.GenericAddressSpace.getAddress(GenericAddressSpace.java:21)
	at ghidra.program.model.address.AbstractAddressSpace.getAddress(AbstractAddressSpace.java:199)
	at PatchestryDecompileFunctions$PcodeSerializer.convertAddressToRamSpace(PatchestryDecompileFunctions.java:849)
	at PatchestryDecompileFunctions$PcodeSerializer.getDataReferencedAsConstant(PatchestryDecompileFunctions.java:895)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeInput(PatchestryDecompileFunctions.java:976)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeIntrinsicCallOp(PatchestryDecompileFunctions.java:2028)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeCallOtherOp(PatchestryDecompileFunctions.java:2052)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2104)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2239)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2346)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeFunctions(PatchestryDecompileFunctions.java:2516)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2538)
	at PatchestryDecompileFunctions.serializeToFile(PatchestryDecompileFunctions.java:2599)
	at PatchestryDecompileFunctions.decompileAllFunctions(PatchestryDecompileFunctions.java:2626)
	at PatchestryDecompileFunctions.runHeadless(PatchestryDecompileFunctions.java:2664)
	at PatchestryDecompileFunctions.run(PatchestryDecompileFunctions.java:2724)
	at ghidra.app.script.GhidraScript.executeNormal(GhidraScript.java:405)
	at ghidra.app.script.GhidraScript.doExecute(GhidraScript.java:260)
	at ghidra.app.script.GhidraScript.execute(GhidraScript.java:238)
	at ghidra.app.util.headless.HeadlessAnalyzer.runScript(HeadlessAnalyzer.java:588)
	at ghidra.app.util.headless.HeadlessAnalyzer.runScriptsList(HeadlessAnalyzer.java:926)
	at ghidra.app.util.headless.HeadlessAnalyzer.analyzeProgram(HeadlessAnalyzer.java:1074)
	at ghidra.app.util.headless.HeadlessAnalyzer.processFileWithImport(HeadlessAnalyzer.java:1563)
	at ghidra.app.util.headless.HeadlessAnalyzer.processWithLoader(HeadlessAnalyzer.java:1745)
	at ghidra.app.util.headless.HeadlessAnalyzer.processWithImport(HeadlessAnalyzer.java:1ERROR REPORT SCRIPT ERROR:  (HeadlessAnalyzer) ghidra.program.model.address.AddressFormatException: Offset must be between 0x0 and 0xffffffff, got 0x400a856808000f3c instead!
	at ghidra.program.model.address.AbstractAddressSpace.getAddress(AbstractAddressSpace.java:224)
	at ghidra.program.model.address.GenericAddressSpace.getAddress(GenericAddressSpace.java:21)
	at ghidra.program.model.address.AbstractAddressSpace.getAddress(AbstractAddressSpace.java:199)
	at PatchestryDecompileFunctions$PcodeSerializer.convertAddressToRamSpace(PatchestryDecompileFunctions.java:849)
	at PatchestryDecompileFunctions$PcodeSerializer.getDataReferencedAsConstant(PatchestryDecompileFunctions.java:895)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeInput(PatchestryDecompileFunctions.java:976)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeIntrinsicCallOp(PatchestryDecompileFunctions.java:2028)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeCallOtherOp(PatchestryDecompileFunctions.java:2052)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2104)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2239)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2346)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeFunctions(PatchestryDecompileFunctions.java:2516)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2538)
	at PatchestryDecompileFunctions.serializeToFile(PatchestryDecompileFunctions.java:2599)
	at PatchestryDecompileFunctions.decompileAllFunctions(PatchestryDecompileFunctions.java:2626)
	at PatchestryDecompileFunctions.runHeadless(PatchestryDecompileFunctions.java:2664)
	at PatchestryDecompileFunctions.run(PatchestryDecompileFunctions.java:2724)
	at ghidra.app.script.GhidraScript.executeNormal(GhidraScript.java:405)
	at ghidra.app.script.GhidraScript.doExecute(GhidraScript.java:260)
	at ghidra.app.script.GhidraScript.execute(GhidraScript.java:238)
	at ghidra.app.util.headless.HeadlessAnalyzer.runScript(HeadlessAnalyzer.java:588)
	at ghidra.app.util.headless.HeadlessAnalyzer.runScriptsList(HeadlessAnalyzer.java:926)
	at ghidra.app.util.headless.HeadlessAnalyzer.analyzeProgram(HeadlessAnalyzer.java:1074)
	at ghidra.app.util.headless.HeadlessAnalyzer.processFileWithImport(HeadlessAnalyzer.java:1563)
	at ghidra.app.util.headless.HeadlessAnalyzer.processWithLoader(HeadlessAnalyzer.java:1745)
	at ghidra.app.util.headless.HeadlessAnalyzer.processWithImport(HeadlessAnalyzer.java:1686)
	at ghidra.app.util.headless.HeadlessAnalyzer.processWithImport(HeadlessAnalyzer.java:1770)
	at ghidra.app.util.headless.HeadlessAnalyzer.processLocal(HeadlessAnalyzer.java:457)
	at ghidra.app.util.headless.AnalyzeHeadless.launch(AnalyzeHeadless.java:198)
	at ghidra.GhidraLauncher.launch(GhidraLauncher.java:81)
	at ghidra.Ghidra.main(Ghidra.java:54)

kaoudis · 2025-05-08T15:29:57Z

Lifting seems like it maybe worked for pulseox, the following command made a .cir and a .c file where directed for pulseox:

builds/default/tools/pcode-lifter/Debug/pcode-lifter --input firmwares/output/pulseox.pcode --emit-cir --output firmwares/output/pulseox-lifted --print-tu

kaoudis · 2025-05-08T15:30:35Z

Butttttt bloodlight has some problems:

builds/default/tools/pcode-lifter/Debug/pcode-lifter --input firmwares/output/bloodlight.pcode --emit-cir --output firmwares/output/bloodlight-lifted --print-tu

yields the following:

[ERROR] (/home/kellykaoudis/patchestry/tools/pcode-lifter/main.cpp:218) Failed to parse pcode JSON: [1:311296, byte=311296]: Unterminated stringProgram aborted due to an unhandled Error:
[1:311296, byte=311296]: Unterminated string
[1]    229892 IOT instruction (core dumped)  builds/default/tools/pcode-lifter/Debug/pcode-lifter --input  --emit-cir

kaoudis · 2025-05-08T15:33:01Z

Making p-code is sort of the equivalent of making a whole-program statically compiled output I suppose, so I'm not really... sure how to compile the C down again in the way it originally was. The Patchestry proposal was hella light on build environment related details. So I guess I'll try the dumbest way possible (all deps are there so maybe just clang foo.c -o foo?) and then will see if there's prior art in the repo anywhere and if not I'll see if I can recycle the original build from pulseox... wondering if this will work since the whole thing has been linked before. What about dynamic dependencies/symbols that only get linked at runtime??? Are these even accounted for by Patchestry generally? Can they be? Can we force people to only run on static builds? That seems kind of silly for a non-sanitizer context (sanitizers are a place where you can say to people to do everything dynamically or everything statically - it's a test/debugging thing so it's not so weird) since dynamic dependencies are so normal in regular builds

moreover, what about a build made on a specialized build machine to which I don't have access? how would or could I go about reproducing that environment ever? or even reverse engineering what it would have been?

kaoudis · 2025-05-08T15:59:14Z

For High Pcode -> Clang IR -> Tower of IRs -> LLVM -> Binary:

How accurate vs lossy is each translation step?
Did we lift correctly?
Do we have something for each function in the previous step, since we're working at the function level?
Does the recompiled C ultimately do what the original binary did?

kaoudis · 2025-05-08T16:42:21Z

Working on figuring out if there's anything "cheap" I can do about Bloodlight not extracting properly

kaoudis · 2025-05-12T14:31:47Z

Also have been taking notes on this in Slack. I so far have three problems that feed into "is the pcode we make accurate, and how accurate":

neither the bloodlight nor the pulseox firmware architecture (environment it was compiled for) is default-identified by Ghidra.

For (1) as a short term hack I used file and readelf in the headless entrypoint script to better identify the arch in question (e.g. pulseox was decompiling as firmware of an Armv8 device and readelf says it's an Arm v6-M device with EABI; bloodlight had similar misidentification issues, but more divergent, which is why it doesn't fully extract for me). I think this impacts how accurate our pcode can be.

There aren't any unit tests, or tests I can see, beyond e2e test cases, for the PatchestryDecompileFunctions.java or PatchestryListFunctions.java tests. I want test cases that tell me what the code does, that run in the same environment as the code is supposed to run in, so that I know more about what even the happy path is.

For (2) I'm slowly figuring out how to get some unit tests for PatchestryDecompileFunctions wedged in as an option to the decomp scripts and container, the goal being to at least be able to tell if we roughly are doing what we think we are doing and maybe even to add some property-based testing if I am very lucky. Maybe there's a way I can integrate this flow with LIT (which I do see runs tests of some sort right now, but apparently not on the decomp functionality), but I want to surface test-running in the same way / environment the code is run, rather than making the tests hard to find, at the very least.

I'm learning Ghidra through this use of the headless API; I've never used this particular toolset before. What C/C++ RE I've done before has been with gdb, Radare2, and occasionally with IDA.

For (3) I've been working with the doc https://ghidra.re/ghidra_docs/api/ and with Claude to try to get some reasonable background and context

kaoudis · 2025-05-12T18:59:44Z

Started a branch in service of this work: https://github.com/lifting-bits/patchestry/tree/kaoudis/pcode-gen-correctness-tests
I shall probably use this just for "binary to pcode" and make a new one for... working with the existing LIT tests or something for "pcode to LLVM" before I look at "IR to C"

kaoudis · 2025-05-27T11:39:04Z

An issue discovered with p-code generation through this work is fixed in #86

kaoudis self-assigned this May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantify the loss of functionality at each translation step #62

Quantify the loss of functionality at each translation step #62

artemdinaburg commented Mar 12, 2025

kaoudis commented May 7, 2025 •

edited

Loading

Uh oh!

kaoudis commented May 7, 2025 •

edited

Loading

Uh oh!

kaoudis commented May 7, 2025 •

edited

Loading

Uh oh!

kaoudis commented May 8, 2025

Uh oh!

kaoudis commented May 8, 2025 •

edited

Loading

Uh oh!

kaoudis commented May 8, 2025

Uh oh!

kaoudis commented May 8, 2025

Uh oh!

kaoudis commented May 8, 2025 •

edited

Loading

Uh oh!

kaoudis commented May 8, 2025 •

edited

Loading

Uh oh!

kaoudis commented May 8, 2025

Uh oh!

kaoudis commented May 12, 2025 •

edited

Loading

Uh oh!

kaoudis commented May 12, 2025

Uh oh!

kaoudis commented May 27, 2025

Uh oh!

Quantify the loss of functionality at each translation step #62

Quantify the loss of functionality at each translation step #62

Comments

artemdinaburg commented Mar 12, 2025

kaoudis commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaoudis commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaoudis commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaoudis commented May 8, 2025

Uh oh!

kaoudis commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaoudis commented May 8, 2025

Uh oh!

kaoudis commented May 8, 2025

Uh oh!

kaoudis commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaoudis commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaoudis commented May 8, 2025

Uh oh!

kaoudis commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaoudis commented May 12, 2025

Uh oh!

kaoudis commented May 27, 2025

Uh oh!

kaoudis commented May 7, 2025 •

edited

Loading

kaoudis commented May 7, 2025 •

edited

Loading

kaoudis commented May 7, 2025 •

edited

Loading

kaoudis commented May 8, 2025 •

edited

Loading

kaoudis commented May 8, 2025 •

edited

Loading

kaoudis commented May 8, 2025 •

edited

Loading

kaoudis commented May 12, 2025 •

edited

Loading