Dataset: Java(hu)

The authors of A Transformer-based Approach for Source Code Summarizatio n shared their code and dataset. In this repo., it offers original and runnable codes of Java dataset and therefore we can generate AST with Tree-Sitter.

However, as for Python dataset, its original codes are not runnable. An optional way to deal with such problem is that we can acquire runnable Python codes from raw data.

Step 1

Download pre-processed and raw (java_hu) dataset.

bash dataset/java_hu/download.sh

Step 2

Move code/code_tokens/docstring/docstring_tokens to ~/java_hu/flatten/*.

python -m dataset.java_hu.flatten

Step 3

Generating raw/bin data with multi-processing. Before generating datasets, plz make sure config file is set correctly.

# code_tokens/docstring_tokens
python -m dataset.java_hu.preprocess

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dataset: Java(hu)

Step 1

Step 2

Step 3

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dataset: Java(hu)

Step 1

Step 2

Step 3