Skip to content

A few questions about ERule dataset and repair procedure #1

@breandan

Description

@breandan

Hi @gsakkas, I hope you are doing well. I am not sure if you recall, but we met briefly after your talk in New Zealand last December. I am working on reproducing the results on the 15k ERule and HumanEval dataset and had a few questions about the abstract sequences used in section 7.1-7.4 of the paper. Any suggestions or advice you could provide would be greatly appreciated.

  • Anonymized dataset availability. Would it possible to release the data from the postprocessed PythonTutor dataset (possibly with obfuscated identifiers to preserve anonymity)? Is the only concrete source code available to evaluate are the 50 programs from the HumanEval dataset in src/human_study, or is there another test set of source code snippets?
  • Ground truth abstract repair. Is it possible to recover the ground truth abstract fix sequence from the 15k ERule dataset? I see each row has three columns, tokns, tok_chgs, dur, popular, and predict_eccp_classifier_partials.py compares the classifier prediction y_pred with the tok_chgs using the labels file erule_labels-partials-probs.json, however I am not quite sure how to obtain the ground truth abstract user fix from this information. For example, if we consider:
Stmts_Or_Newlines is _NAME_ == _NAME_ _NEWLINE_ _NEWLINE_ _ENDMARKER_ <||> Err_Literals -> H Literals <++> InsertErr -> is <||> 1 <||> 33.0 <||> popular

I understand tok_chgs is Err_Literals -> H Literals <++> InsertErr -> is which refers to [105, 323], but it is not yet clear to me how tokns are altered in the ground truth fix. Does the suffix after _ENDMARKER_ identify a unique abstract sequence fix?

  • Mapping abstract tokens back to concrete source code. I see there is a procedure which decodes the abstract token sequence back to concrete tokens, but it seems to require the original code sequence and some corruption may be possible during post-repair decoding. For example, if a _NAME_ or whitespace is substituted, inserted or deleted in the abstract token sequence, this can introduce cosmetic changes to parts of the input which are lexically identical in the abstract token sequence. Is there a way to map the tokenwise edits back to the exact character subsequence in the concrete source code while preserving the original formatting?

It is also possible I am mistaken or misunderstanding an important detail. If so, any clarification would be welcome. Thank you!

cc: @jin-guo @XujieSi

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions