-
Notifications
You must be signed in to change notification settings - Fork 1
A few questions about ERule dataset and repair procedure #1
Description
Hi @gsakkas, I hope you are doing well. I am not sure if you recall, but we met briefly after your talk in New Zealand last December. I am working on reproducing the results on the 15k ERule and HumanEval dataset and had a few questions about the abstract sequences used in section 7.1-7.4 of the paper. Any suggestions or advice you could provide would be greatly appreciated.
- Anonymized dataset availability. Would it possible to release the data from the postprocessed PythonTutor dataset (possibly with obfuscated identifiers to preserve anonymity)? Is the only concrete source code available to evaluate are the 50 programs from the HumanEval dataset in
src/human_study, or is there another test set of source code snippets? - Ground truth abstract repair. Is it possible to recover the ground truth abstract fix sequence from the 15k ERule dataset? I see each row has three columns,
tokns,tok_chgs,dur,popular, andpredict_eccp_classifier_partials.pycompares the classifier predictiony_predwith thetok_chgsusing the labels fileerule_labels-partials-probs.json, however I am not quite sure how to obtain the ground truth abstract user fix from this information. For example, if we consider:
Stmts_Or_Newlines is _NAME_ == _NAME_ _NEWLINE_ _NEWLINE_ _ENDMARKER_ <||> Err_Literals -> H Literals <++> InsertErr -> is <||> 1 <||> 33.0 <||> popular
I understand tok_chgs is Err_Literals -> H Literals <++> InsertErr -> is which refers to [105, 323], but it is not yet clear to me how tokns are altered in the ground truth fix. Does the suffix after _ENDMARKER_ identify a unique abstract sequence fix?
- Mapping abstract tokens back to concrete source code. I see there is a procedure which decodes the abstract token sequence back to concrete tokens, but it seems to require the original code sequence and some corruption may be possible during post-repair decoding. For example, if a
_NAME_or whitespace is substituted, inserted or deleted in the abstract token sequence, this can introduce cosmetic changes to parts of the input which are lexically identical in the abstract token sequence. Is there a way to map the tokenwise edits back to the exact character subsequence in the concrete source code while preserving the original formatting?
It is also possible I am mistaken or misunderstanding an important detail. If so, any clarification would be welcome. Thank you!