The authors of A Transformer-based Approach for Source Code Summarizatio n shared their code and dataset. In this repo., it offers original and runnable codes of Java dataset and therefore we can generate AST with Tree-Sitter.
However, as for Python dataset, its original codes are not runnable. An optional way to deal with such problem is that we can acquire runnable Python codes from raw data.
Download pre-processed and raw (java_hu) dataset.
bash dataset/java_hu/download.sh
Move code/code_tokens/docstring/docstring_tokens to ~/java_hu/flatten/*
.
python -m dataset.java_hu.flatten
Generating raw/bin data with multi-processing. Before generating datasets, plz make sure config file is set correctly.
# code_tokens/docstring_tokens
python -m dataset.java_hu.preprocess