Type Inference on Stripped Binaries using Graph Neural Networks
Please find TYDA x86_64 binaries.
This dataset contains C and C++ binaries from Gentoo Linux, each compiled with one of four different compiler optimization flags: O0, O1, O2, and O3. Each set of binaries for a specific optimization level is associated with a .json file named C_CPP_binaries_{opt}.json, where opt represents the optimization level (O0, O1, O2, or O3). These JSON files include metadata about the binary, such as the language, whether it has inline functions or dwarf info, among other details.
Please find the rest of TYDA C binaries here. c binaries.
Make sure you install torch
, torch_geometric
, pyvex
, and elftools
.
It is recommended that you use conda
to manage your environment.
CUDA is supported during the training and testing process.
To use the provided environment.yml
, please use
conda env create -f environment.yml
This will create a conda environment called TYGR
.
You can then call
conda activate TYGR
to activate it.
It is expected that you may need to install different versions of the
above libraries (e.g. torch_geometric
).
In this case, you should manually install the dependencies.
Please find trained models in model
folder.
To test a model or train a new model, first try to generate dataset from binaries:
./TYGR datagen PATH/TO/YOUR/BINARY DATASET.pkl
Make sure the binaries being used in this stage is properly compiled with DWARF information available.
Note that one can specify a --method
flag which defaults to glow
, representing
our best model.
To try out other methods like Transformer or BiLSTM, you should feed another flag
to the datagen process to generate data suitable for the corresponding method.
If the binaries are larger than 1MB I would suggest to run in parallel mode since ANGR is slow. You should be able to specify the number of processors in src/methods/glow/datagen.py
. Be careful about the memory usage as ANGR may explode your memory.
You can combine different datasets by using the datamerge
command:
./TYGR datamerge DATASET1.pkl DATASET2.pkl -o MERGED.pkl
Then you can use scripts/split_model_pkl.py
to parse the dataset for MODEL_base and MODEL_struct.
Lastly, you can split the dataset into training, validation, and testing sets
./TYGR datasplit MERGED.pkl --train TRAIN.pkl --validation VALID.pkl --test TEST.pkl
By default, we use 80% of the data on training, 10% for validation and 10% for testing.
You can change this by feeding more command line flags.
Please consult ./TYGR datasplit --help
for more information.
You can train a new model given that you have the dataset setup. Then you should use
./TYGR train TRAIN.pkl VALID.pkl -o MODEL.model
Make sure you feed a desired --epoch
, --lr
, and --method
.
The training won't take very long for our default (best) model.
The best model, as determined by the validation set, will be stored along the way,
named MODEL.model.best
.
The final model will be stored using the provided name.
To test the trained model, simply do
./TYGR test MODEL.model TEST.pkl
where TEST.pkl
is the testing dataset and the MODEL.model
is the trained model generated
by the training process.
If --method
is specified during datagen and training phases, make sure you pass in the same
flag for testing as well.
TYDA data set still contains duplicate functions.
Please use the scripts/angr_func_hash.py
to get the list of duplicate functions, every function in the same hash are duplicate functions.
During datamerge
, skip functions in the duplicate list.