Skip to content

Commit 992a494

Browse files
committed
Add tutorial on graph diffing
1 parent 3af7fb3 commit 992a494

12 files changed

+272
-0
lines changed

tutorials/graph-only-diffing.rst

Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
Diffing Graphs only
2+
===================
3+
4+
QBinDiff is able to perform graph matching, whether this graph is a Control Flow Graph (CFG),
5+
a Call Graph (CG), or something completely unrelated.
6+
We demonstrated this in our `introductory blog post <https://blog.quarkslab.com/qbindiff-a-modular-diffing-toolkit.html>`_
7+
with the protein-protein interaction (PPI) networks of different species in bioinformatics.
8+
9+
In this tutorial, we will focus our attention on the CFG of EVM smart contracts.
10+
11+
Motivation
12+
----------
13+
14+
The Ethereum Virtual Machine (EVM) is a RISC stack-based architecture.
15+
It is used by Ethereum and other compatible chains to execute smart contracts,
16+
the core programs of decentralized applications on these platforms.
17+
18+
In addition to the lack of native support from tools such as QBinDiff, BinExport, and Quokka,
19+
the EVM bytecode lacks an explicit structure to identify functions:
20+
all the control flow is performed with only conditional and unconditional jumps (``JUMPI``, ``JUMP``),
21+
that read the destination address on the stack.
22+
This design leads to particular patterns in the bytecode, for example:
23+
24+
* the start of the bytecode contains a dispatcher, acting like a giant switch statement,
25+
that is responsible for jumping into the called external/public function;
26+
* to call internal functions and other forms of duplicated code,
27+
the caller pushes a return address onto the stack before executing a jump instruction.
28+
29+
However, several tools let you recover the CFG of the smart contract
30+
(some even detect functions with varying degrees of success).
31+
From this information alone, this tutorial will guide you with diffing an example smart contract.
32+
33+
How to diff
34+
-----------
35+
36+
In this tutorial, we will diff two versions of an example smart contract.
37+
One allows its user's balance to go negative but not the other.
38+
39+
+---------------------------------------+----------------------------------+
40+
| ``vulnerable.sol`` | ``fixed.sol`` |
41+
+=======================================+==================================+
42+
|.. literalinclude:: res/vulnerable.sol |.. literalinclude:: res/fixed.sol |
43+
+---------------------------------------+----------------------------------+
44+
|.. literalinclude:: res/vulnerable.evm |.. literalinclude:: res/fixed.evm |
45+
+---------------------------------------+----------------------------------+
46+
47+
To this end, we will use QBinDiff's ``DiGraphDiffer`` to compare their CFG.
48+
49+
Generate the graphs
50+
^^^^^^^^^^^^^^^^^^^
51+
52+
The first step is to recover the CFG from the bytecode of the smart contracts.
53+
While not trivial, this has been solved by several tools already.
54+
We recommend `EtherSolve <https://github.com/SeUniVr/EtherSolve/>`_ (java)
55+
or `vandal <https://github.com/usyd-blockchain/vandal/>`_ (python).
56+
57+
.. code-block:: sh
58+
59+
java -jar EtherSolve.jar -r -j vulnerable.evm
60+
61+
The command above generates a file named ``Analysis_<datetime>.json``.
62+
In this file, the CFG can be found under the ``runtimeCfg`` field.
63+
Note that edges are stored under ``runtimeCfg.successors`` (and sometimes have duplicate entries).
64+
65+
QBinDiff's differ expects ``networkx.DiGraph`` as inputs, so we will need to adapt the data a little bit:
66+
67+
.. code-block:: python
68+
69+
def process_ethersolve(analysis: dict[str, Any]) -> networkx.DiGraph:
70+
# Filter the desired attributes and set the node ID as the basic block's offset
71+
nodes = [
72+
{
73+
"id": n["offset"],
74+
"length": n["length"],
75+
"type": n["type"],
76+
"stack_balance": n["stackBalance"],
77+
"bytecode": n["bytecodeHex"],
78+
"opcodes": n["parsedOpcodes"],
79+
} for n in analysis["runtimeCfg"]["nodes"]
80+
]
81+
82+
# Generate an edge for each successor of each node
83+
links = [
84+
[{"source": e["from"], "target": t} for t in e["to"]]
85+
for e in analysis["runtimeCfg"]["successors"]
86+
]
87+
# Flatten the list
88+
links = [item for sublist in links for item in sublist]
89+
90+
# Create the networkx.DiGraph
91+
nx_node_link = {"directed": True, "multigraph": False, "nodes": nodes, "links": links}
92+
return networkx.node_link_graph(nx_node_link)
93+
94+
95+
(optional) Write heuristics
96+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
97+
98+
When matching nodes, we can help the differ by setting an initial similarity score between each pair of nodes.
99+
These scores are gathered in the similarity matrix, initialized to all 1,
100+
meaning every node is initially believed to be similar to all nodes.
101+
102+
If you have access to some heuristic for similarity between nodes,
103+
you can add a prepass that will be executed before matching nodes to alter the similarity matrix.
104+
105+
For example, we have access to the stack balance of each basic block.
106+
This value indicates how many words the basic block pushes or pops from the stack.
107+
Intuitively, similar blocks should have the same stack balance.
108+
109+
You can find more information `here <../qbindiff/doc/source/api/differ.html#qbindiff.Differ.register_prepass>`_
110+
on how to create a prepass.
111+
112+
In this example, we first arrange nodes by stack balance in each graph,
113+
then reduce the similarity of nodes that do not share the same stack balance.
114+
115+
Note that the nodes' IDs are their offset, and do not correspond to the row or column in the similarity matrix.
116+
The correspondance is given by the ``primary_n2i`` and ``secondary_n2i`` mappings.
117+
118+
.. code-block:: python
119+
120+
def prepass_stack_balance(
121+
sim_matrix: SimMatrix,
122+
primary: qbindiff.GenericGraph,
123+
secondary: qbindiff.GenericGraph,
124+
primary_n2i: dict[int, int],
125+
secondary_n2i: dict[int, int],
126+
**kwargs,
127+
) -> None:
128+
# Arrange nodes indices by stack balance
129+
130+
## Primary node indices by stack balance
131+
primary_index: dict[int, list[int]] = {}
132+
## Secondary node indices by stack balance
133+
secondary_index: dict[int, list[int]] = {}
134+
135+
## Populate primary_index and secondary_index
136+
for graph, n2i, index in (
137+
(primary, primary_n2i, primary_index),
138+
(secondary, secondary_n2i, secondary_index),
139+
):
140+
for node_id in graph.nodes():
141+
node = graph.nodes[node_id]
142+
balance = node["stack_balance"]
143+
if balance not in index:
144+
index[balance] = []
145+
index[balance].append(n2i[node_id])
146+
147+
# Reduce the similarity of nodes that do not share the same stack balance by 60%
148+
for primary_balance, primary_indices in primary_index.items():
149+
for secondary_balance, secondary_indices in secondary_index.items():
150+
if primary_balance == secondary_balance:
151+
continue
152+
for i in primary_indices:
153+
sim_matrix[i, secondary_indices] *= 0.4
154+
155+
Perform the match
156+
^^^^^^^^^^^^^^^^^
157+
158+
Once you have the CFG in a ``networkx.DiGraph`` object,
159+
and have optionally written some prepasses, performing the mapping is simple:
160+
161+
.. code-block:: python
162+
163+
differ = qbindiff.DiGraphDiffer(
164+
primary_cfg,
165+
secondary_cfg,
166+
sparsity_ratio=0,
167+
tradeoff=0.5,
168+
epsilon=0.1,
169+
)
170+
differ.register_prepass(prepass_stack_balance) # optional
171+
mapping = differ.compute_matching()
172+
173+
You can experiment with the tradeoff and epsilon values,
174+
depending on the nature of the diffing performed.
175+
As general guidelines:
176+
177+
* ``tradeoff`` gives more weight to the topology when close to 0,
178+
and more weight to the similarity when close to 1.
179+
It should be set strictly between 0 and 1.
180+
The better your heuristics, the higher its value.
181+
* ``epsilon`` controls the convergence speed.
182+
It should not be set to 0, and be as close to 1 as you can afford to wait.
183+
For this simple example, a conservative low value is not an issue.
184+
* you should adjust how much your prepasses affect the similarity matrix,
185+
depending on the quality of your heuristics.
186+
187+
For this example, we performed an exhaustive search of these parameters,
188+
when compared to a ground truth matching.
189+
In these maps, ``epsilon`` and ``tradeoff`` correspond to the above parameters,
190+
while ``stack balance weight`` controls how much the stack balance prepass impacts the similarity matrix.
191+
192+
+-------------------------------------------------+--------------------------------------------------+-------------------------------------+
193+
| .. image:: res/stack-balance-weight-epsilon.png | .. image:: res/stack-balance-weight-tradeoff.png | .. image:: res/tradeoff-epsilon.png |
194+
+-------------------------------------------------+--------------------------------------------------+-------------------------------------+
195+
196+
Process the result
197+
^^^^^^^^^^^^^^^^^^
198+
199+
Now that you have a mapping between nodes of the primary and secondary graphs,
200+
you can process it however you like, for example to compute similarity score.
201+
202+
Here we show a visualization of the resulting diff, revealing interesting aspects of the modification:
203+
204+
+------------------------------------+-----------------------------+------------------------------------+
205+
| From signed to unsigned operations | CFG rewiring | Dispatcher update |
206+
+====================================+=============================+====================================+
207+
| .. image:: res/diff-signedness.png | .. image:: res/diff-cfg.png | .. image:: res/diff-dispatcher.png |
208+
+------------------------------------+-----------------------------+------------------------------------+

tutorials/res/diff-cfg.png

59.4 KB
Loading

tutorials/res/diff-dispatcher.png

79.9 KB
Loading

tutorials/res/diff-signedness.png

19.5 KB
Loading

tutorials/res/fixed.evm

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
608060405234801561001057600080fd5b50600436106100365760003560
2+
e01c806327e235e31461003b578063412664ae1461006d575b600080fd5b
3+
61005b6100493660046100f3565b60006020819052908152604090205481
4+
565b60405190815260200160405180910390f35b61008061007b36600461
5+
0115565b610082565b005b33600090815260208190526040812080548392
6+
906100a1908490610155565b90915550506001600160a01b038216600090
7+
815260208190526040812080548392906100ce90849061016e565b909155
8+
50505050565b80356001600160a01b03811681146100ee57600080fd5b91
9+
9050565b60006020828403121561010557600080fd5b61010e826100d756
10+
5b9392505050565b6000806040838503121561012857600080fd5b610131
11+
836100d7565b946020939093013593505050565b634e487b7160e01b6000
12+
52601160045260246000fd5b818103818111156101685761016861013f56
13+
5b92915050565b808201808211156101685761016861013f56fea2646970
14+
66735822122024ea090b39258ab97fbe93f3646eb22d1d2cc2aedc4b74e1
15+
c3861562ad940d0364736f6c63430008190033

tutorials/res/fixed.sol

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
// SPDX-License-Identifier: MIT
2+
3+
pragma solidity 0.8.25;
4+
5+
contract SimpleToken {
6+
mapping(address => uint) public balances;
7+
8+
constructor() {
9+
balances[msg.sender] += 1000e18;
10+
}
11+
12+
function sendToken(address _recipient, uint _amount) public {
13+
balances[msg.sender] -= _amount;
14+
balances[_recipient] += _amount;
15+
}
16+
}
37 KB
Loading
36 KB
Loading

tutorials/res/tradeoff-epsilon.png

39.2 KB
Loading

tutorials/res/vulnerable.evm

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
608060405234801561001057600080fd5b50600436106100365760003560
2+
e01c80630f7428b91461003b57806327e235e314610050575b600080fd5b
3+
61004e6100493660046100f3565b610082565b005b61007061005e366004
4+
61011d565b60006020819052908152604090205481565b60405190815260
5+
200160405180910390f35b33600090815260208190526040812080548392
6+
906100a1908490610155565b90915550506001600160a01b038216600090
7+
815260208190526040812080548392906100ce90849061017c565b909155
8+
50505050565b80356001600160a01b03811681146100ee57600080fd5b91
9+
9050565b6000806040838503121561010657600080fd5b61010f836100d7
10+
565b946020939093013593505050565b60006020828403121561012f5760
11+
0080fd5b610138826100d7565b9392505050565b634e487b7160e01b6000
12+
52601160045260246000fd5b818103600083128015838313168383128216
13+
17156101755761017561013f565b5092915050565b808201828112600083
14+
128015821682158216171561019c5761019c61013f565b50509291505056
15+
fea264697066735822122078995ea6ebd8bd5d1333cde4da1d4ade3eee62
16+
af048e42232dbc3ed6cd682aba64736f6c63430008190033

0 commit comments

Comments
 (0)