Skip to content

Quantify the loss of functionality at each translation step #62

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
artemdinaburg opened this issue Mar 12, 2025 · 13 comments
Open

Quantify the loss of functionality at each translation step #62

artemdinaburg opened this issue Mar 12, 2025 · 13 comments
Assignees

Comments

@artemdinaburg
Copy link
Collaborator

We lose some information at each step of High Pcode -> Clang IR -> Tower of IRs -> LLVM -> Binary

There should be some way to quantify this loss, possibly by measuring re-translation on Clang's compilation test suite.

@kaoudis kaoudis self-assigned this May 7, 2025
@kaoudis
Copy link
Collaborator

kaoudis commented May 7, 2025

Notes from sync today:

  • does lifting from the binary produce reasonable p-code?
  • does lifting to C produce reasonable C (we believe so especially following the use of the Rellic optimization passes Akshay is currently extracting into a library)?
  • does teh C created from lifted p-code actually build?
  • does a binary built from the C function the same as the previous?

Can answering the question of "do we lift correctly" ultimately be automated, maybe through differential testing between the resulting binaries?

@kaoudis
Copy link
Collaborator

kaoudis commented May 7, 2025

Looks like PulseOX builds but Bloodlight doesn't, on the main branch at the pinned commits using the firmwares/build.sh script, going to sort out debugging this, since ideally I'd like to try lifting them both

@kaoudis
Copy link
Collaborator

kaoudis commented May 7, 2025

Looks like Bloodlight didn't build by default but it did once I changed the build cmd in the script for it locally a bit to be

docker run --rm \
-v "/home/kellykaoudis/patchestry/firmwares/repos/bloodlight-firmware":"/work/bloodlight-firmware" \
-v "/home/kellykaoudis/patchestry/firmwares/output":"/output" \
firmware-builder bash -c "git config --global --add safe.directory /work/bloodlight-firmware && \
                                             cd bloodlight-firmware && \
                                             make -C firmware/libopencm3 && \
                                             make -C firmware -j8 && \
                                             cp firmware/bloodlight-firmware.elf /output/bloodlight-firmware.elf"

I'm not a huge fan of using a build container with a single command like this since it's kinda messy (3-4 cmds that could be separate RUN commands in a Dockerfile instead of "loose" in a string executed by a bash shell created from another bash script like this), would prefer to have one base and two child builds that inherit from it. Might clean that up honestly, this is annoying since it's not oneshot and git yelled about ambiguous ownership, but might also be I'm the only one with such an issue.

@kaoudis
Copy link
Collaborator

kaoudis commented May 8, 2025

In decompiling to pcode, I observed the following errors for pulseox:

./scripts/ghidra/decompile-headless.sh --input firmwares/output/pulseox-firmware.elf --output firmwares/output/pulseox.pcode

yielded, among output that seemed correct,

ERROR DWARF data type remappings (DWARF data type definitions that changed meaning in different compile units): (DWARFImportSummary)  
ERROR   Data type -> changed to -> Data Type (DWARFImportSummary)  
ERROR   /uint -> /wchar_t (DWARFImportSummary)  
ERROR DWARF variable definitions that failed because they depended on the dynamic value of a register: 1 (DWARFImportSummary)  
ERROR DWARF variable definitions that failed because they are computed pseudo variables: 18 (DWARFImportSummary)  

@kaoudis
Copy link
Collaborator

kaoudis commented May 8, 2025

I also observed errors for bloodlight, but much different:

./scripts/ghidra/decompile-headless.sh --input firmwares/output/bloodlight-firmware.elf --output firmwares/output/bloodlight.pcode
ERROR DWARF static variables with missing address info: (DWARFImportSummary)  
ERROR   [Variable symbolic name  : variable data type] (DWARFImportSummary)  
ERROR   bl_spi_dma__rx:/DWARF/_UNCATEGORIZED_/bl_spi_dma_t (DWARFImportSummary)  
ERROR   bl_spi_dma_tx:/DWARF/_UNCATEGORIZED_/bl_spi_dma_t * (DWARFImportSummary)  
ERROR   rcc_hsi_configs:/DWARF/_UNCATEGORIZED_/rcc_clock_scale[2] (DWARFImportSummary)  
ERROR   rcc_hse8mhz_configs:/DWARF/_UNCATEGORIZED_/rcc_clock_scale[1] (DWARFImportSummary)  
ERROR   bl_spi_dma__tx:/DWARF/_UNCATEGORIZED_/bl_spi_dma_t (DWARFImportSummary)  
ERROR   bl_spi_dma_rx:/DWARF/_UNCATEGORIZED_/bl_spi_dma_t * (DWARFImportSummary)  
ERROR DWARF variable definitions that failed because they are computed pseudo variables: 50 (DWARFImportSummary)  
ERROR DWARF expression failed to read: 1 (DWARFImportSummary)  

and ultimately

-----------------------------------------------------
     Total Time   2 secs
-----------------------------------------------------
 (AutoAnalysisManager)  
INFO  PatchestryDecompileFunctions.java> Running in mode: all (GhidraScript)  
INFO  PatchestryDecompileFunctions.java> Error: Offset must be between 0x0 and 0xffffffff, got 0x400a856808000f3c instead! (GhidraScript)  
ghidra.program.model.address.AddressFormatException: Offset must be between 0x0 and 0xffffffff, got 0x400a856808000f3c instead!
	at ghidra.program.model.address.AbstractAddressSpace.getAddress(AbstractAddressSpace.java:224)
	at ghidra.program.model.address.GenericAddressSpace.getAddress(GenericAddressSpace.java:21)
	at ghidra.program.model.address.AbstractAddressSpace.getAddress(AbstractAddressSpace.java:199)
	at PatchestryDecompileFunctions$PcodeSerializer.convertAddressToRamSpace(PatchestryDecompileFunctions.java:849)
	at PatchestryDecompileFunctions$PcodeSerializer.getDataReferencedAsConstant(PatchestryDecompileFunctions.java:895)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeInput(PatchestryDecompileFunctions.java:976)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeIntrinsicCallOp(PatchestryDecompileFunctions.java:2028)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeCallOtherOp(PatchestryDecompileFunctions.java:2052)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2104)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2239)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2346)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeFunctions(PatchestryDecompileFunctions.java:2516)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2538)
	at PatchestryDecompileFunctions.serializeToFile(PatchestryDecompileFunctions.java:2599)
	at PatchestryDecompileFunctions.decompileAllFunctions(PatchestryDecompileFunctions.java:2626)
	at PatchestryDecompileFunctions.runHeadless(PatchestryDecompileFunctions.java:2664)
	at PatchestryDecompileFunctions.run(PatchestryDecompileFunctions.java:2724)
	at ghidra.app.script.GhidraScript.executeNormal(GhidraScript.java:405)
	at ghidra.app.script.GhidraScript.doExecute(GhidraScript.java:260)
	at ghidra.app.script.GhidraScript.execute(GhidraScript.java:238)
	at ghidra.app.util.headless.HeadlessAnalyzer.runScript(HeadlessAnalyzer.java:588)
	at ghidra.app.util.headless.HeadlessAnalyzer.runScriptsList(HeadlessAnalyzer.java:926)
	at ghidra.app.util.headless.HeadlessAnalyzer.analyzeProgram(HeadlessAnalyzer.java:1074)
	at ghidra.app.util.headless.HeadlessAnalyzer.processFileWithImport(HeadlessAnalyzer.java:1563)
	at ghidra.app.util.headless.HeadlessAnalyzer.processWithLoader(HeadlessAnalyzer.java:1745)
	at ghidra.app.util.headless.HeadlessAnalyzer.processWithImport(HeadlessAnalyzer.java:1ERROR REPORT SCRIPT ERROR:  (HeadlessAnalyzer) ghidra.program.model.address.AddressFormatException: Offset must be between 0x0 and 0xffffffff, got 0x400a856808000f3c instead!
	at ghidra.program.model.address.AbstractAddressSpace.getAddress(AbstractAddressSpace.java:224)
	at ghidra.program.model.address.GenericAddressSpace.getAddress(GenericAddressSpace.java:21)
	at ghidra.program.model.address.AbstractAddressSpace.getAddress(AbstractAddressSpace.java:199)
	at PatchestryDecompileFunctions$PcodeSerializer.convertAddressToRamSpace(PatchestryDecompileFunctions.java:849)
	at PatchestryDecompileFunctions$PcodeSerializer.getDataReferencedAsConstant(PatchestryDecompileFunctions.java:895)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeInput(PatchestryDecompileFunctions.java:976)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeIntrinsicCallOp(PatchestryDecompileFunctions.java:2028)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeCallOtherOp(PatchestryDecompileFunctions.java:2052)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2104)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2239)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2346)
	at PatchestryDecompileFunctions$PcodeSerializer.serializeFunctions(PatchestryDecompileFunctions.java:2516)
	at PatchestryDecompileFunctions$PcodeSerializer.serialize(PatchestryDecompileFunctions.java:2538)
	at PatchestryDecompileFunctions.serializeToFile(PatchestryDecompileFunctions.java:2599)
	at PatchestryDecompileFunctions.decompileAllFunctions(PatchestryDecompileFunctions.java:2626)
	at PatchestryDecompileFunctions.runHeadless(PatchestryDecompileFunctions.java:2664)
	at PatchestryDecompileFunctions.run(PatchestryDecompileFunctions.java:2724)
	at ghidra.app.script.GhidraScript.executeNormal(GhidraScript.java:405)
	at ghidra.app.script.GhidraScript.doExecute(GhidraScript.java:260)
	at ghidra.app.script.GhidraScript.execute(GhidraScript.java:238)
	at ghidra.app.util.headless.HeadlessAnalyzer.runScript(HeadlessAnalyzer.java:588)
	at ghidra.app.util.headless.HeadlessAnalyzer.runScriptsList(HeadlessAnalyzer.java:926)
	at ghidra.app.util.headless.HeadlessAnalyzer.analyzeProgram(HeadlessAnalyzer.java:1074)
	at ghidra.app.util.headless.HeadlessAnalyzer.processFileWithImport(HeadlessAnalyzer.java:1563)
	at ghidra.app.util.headless.HeadlessAnalyzer.processWithLoader(HeadlessAnalyzer.java:1745)
	at ghidra.app.util.headless.HeadlessAnalyzer.processWithImport(HeadlessAnalyzer.java:1686)
	at ghidra.app.util.headless.HeadlessAnalyzer.processWithImport(HeadlessAnalyzer.java:1770)
	at ghidra.app.util.headless.HeadlessAnalyzer.processLocal(HeadlessAnalyzer.java:457)
	at ghidra.app.util.headless.AnalyzeHeadless.launch(AnalyzeHeadless.java:198)
	at ghidra.GhidraLauncher.launch(GhidraLauncher.java:81)
	at ghidra.Ghidra.main(Ghidra.java:54)

@kaoudis
Copy link
Collaborator

kaoudis commented May 8, 2025

Lifting seems like it maybe worked for pulseox, the following command made a .cir and a .c file where directed for pulseox:

builds/default/tools/pcode-lifter/Debug/pcode-lifter --input firmwares/output/pulseox.pcode --emit-cir --output firmwares/output/pulseox-lifted --print-tu 

@kaoudis
Copy link
Collaborator

kaoudis commented May 8, 2025

Butttttt bloodlight has some problems:

builds/default/tools/pcode-lifter/Debug/pcode-lifter --input firmwares/output/bloodlight.pcode --emit-cir --output firmwares/output/bloodlight-lifted --print-tu

yields the following:

[ERROR] (/home/kellykaoudis/patchestry/tools/pcode-lifter/main.cpp:218) Failed to parse pcode JSON: [1:311296, byte=311296]: Unterminated stringProgram aborted due to an unhandled Error:
[1:311296, byte=311296]: Unterminated string
[1]    229892 IOT instruction (core dumped)  builds/default/tools/pcode-lifter/Debug/pcode-lifter --input  --emit-cir

@kaoudis
Copy link
Collaborator

kaoudis commented May 8, 2025

Making p-code is sort of the equivalent of making a whole-program statically compiled output I suppose, so I'm not really... sure how to compile the C down again in the way it originally was. The Patchestry proposal was hella light on build environment related details. So I guess I'll try the dumbest way possible (all deps are there so maybe just clang foo.c -o foo?) and then will see if there's prior art in the repo anywhere and if not I'll see if I can recycle the original build from pulseox... wondering if this will work since the whole thing has been linked before. What about dynamic dependencies/symbols that only get linked at runtime??? Are these even accounted for by Patchestry generally? Can they be? Can we force people to only run on static builds? That seems kind of silly for a non-sanitizer context (sanitizers are a place where you can say to people to do everything dynamically or everything statically - it's a test/debugging thing so it's not so weird) since dynamic dependencies are so normal in regular builds

moreover, what about a build made on a specialized build machine to which I don't have access? how would or could I go about reproducing that environment ever? or even reverse engineering what it would have been?

@kaoudis
Copy link
Collaborator

kaoudis commented May 8, 2025

For High Pcode -> Clang IR -> Tower of IRs -> LLVM -> Binary:

  1. How accurate vs lossy is each translation step?
  2. Did we lift correctly?
  3. Do we have something for each function in the previous step, since we're working at the function level?
  4. Does the recompiled C ultimately do what the original binary did?

@kaoudis
Copy link
Collaborator

kaoudis commented May 8, 2025

Working on figuring out if there's anything "cheap" I can do about Bloodlight not extracting properly

@kaoudis
Copy link
Collaborator

kaoudis commented May 12, 2025

Also have been taking notes on this in Slack. I so far have three problems that feed into "is the pcode we make accurate, and how accurate":

  1. neither the bloodlight nor the pulseox firmware architecture (environment it was compiled for) is default-identified by Ghidra.

For (1) as a short term hack I used file and readelf in the headless entrypoint script to better identify the arch in question (e.g. pulseox was decompiling as firmware of an Armv8 device and readelf says it's an Arm v6-M device with EABI; bloodlight had similar misidentification issues, but more divergent, which is why it doesn't fully extract for me). I think this impacts how accurate our pcode can be.

  1. There aren't any unit tests, or tests I can see, beyond e2e test cases, for the PatchestryDecompileFunctions.java or PatchestryListFunctions.java tests. I want test cases that tell me what the code does, that run in the same environment as the code is supposed to run in, so that I know more about what even the happy path is.

For (2) I'm slowly figuring out how to get some unit tests for PatchestryDecompileFunctions wedged in as an option to the decomp scripts and container, the goal being to at least be able to tell if we roughly are doing what we think we are doing and maybe even to add some property-based testing if I am very lucky. Maybe there's a way I can integrate this flow with LIT (which I do see runs tests of some sort right now, but apparently not on the decomp functionality), but I want to surface test-running in the same way / environment the code is run, rather than making the tests hard to find, at the very least.

  1. I'm learning Ghidra through this use of the headless API; I've never used this particular toolset before. What C/C++ RE I've done before has been with gdb, Radare2, and occasionally with IDA.

For (3) I've been working with the doc https://ghidra.re/ghidra_docs/api/ and with Claude to try to get some reasonable background and context

@kaoudis
Copy link
Collaborator

kaoudis commented May 12, 2025

Started a branch in service of this work: https://github.com/lifting-bits/patchestry/tree/kaoudis/pcode-gen-correctness-tests
I shall probably use this just for "binary to pcode" and make a new one for... working with the existing LIT tests or something for "pcode to LLVM" before I look at "IR to C"

@kaoudis
Copy link
Collaborator

kaoudis commented May 27, 2025

An issue discovered with p-code generation through this work is fixed in #86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants