A few questions about ERule dataset and repair procedure

Hi @gsakkas, I hope you are doing well. I am not sure if you recall, but we met briefly after your talk in New Zealand last December. I am working on reproducing the results on the 15k ERule and HumanEval dataset and had a few questions about the abstract sequences used in section 7.1-7.4 of the paper. Any suggestions or advice you could provide would be greatly appreciated.

* **Anonymized dataset availability**. Would it possible to release the data from the postprocessed PythonTutor dataset (possibly with obfuscated identifiers to preserve anonymity)? Is the only concrete source code available to evaluate are the 50 programs from the HumanEval dataset in [`src/human_study`](https://github.com/gsakkas/seq2parse/tree/main/src/human_study), or is there another test set of source code snippets?
* **Ground truth abstract repair**. Is it possible to recover the ground truth abstract fix sequence from the 15k ERule dataset? I see each row has three columns, `tokns`, `tok_chgs`, `dur`, `popular`, and [`predict_eccp_classifier_partials.py`](https://github.com/gsakkas/seq2parse/blob/main/src/predict_eccp_classifier_partials.py) compares the classifier prediction `y_pred` with the `tok_chgs` using the labels file `erule_labels-partials-probs.json`, however I am not quite sure how to obtain the ground truth abstract user fix from this information. For example, if we consider:

```
Stmts_Or_Newlines is _NAME_ == _NAME_ _NEWLINE_ _NEWLINE_ _ENDMARKER_ <||> Err_Literals -> H Literals <++> InsertErr -> is <||> 1 <||> 33.0 <||> popular
```

I understand `tok_chgs` is `Err_Literals -> H Literals <++> InsertErr -> is` which refers to `[105, 323]`, but it is not yet clear to me how `tokns` are altered in the ground truth fix. Does the suffix after `_ENDMARKER_` identify a unique abstract sequence fix?

* **Mapping abstract tokens back to concrete source code**. I see there is [a procedure](https://github.com/gsakkas/seq2parse/blob/7ae0681f1139cb873868727f035c1b7a369c3eb9/src/ecpp.py#L1252-#L1360) which decodes the abstract token sequence back to concrete tokens, but it seems to require the original code sequence and some corruption may be possible during post-repair decoding. For example, if a `_NAME_` or whitespace is substituted, inserted or deleted in the abstract token sequence, this can introduce cosmetic changes to parts of the input which are lexically identical in the abstract token sequence. Is there a way to map the tokenwise edits back to the exact character subsequence in the concrete source code while preserving the original formatting?

It is also possible I am mistaken or misunderstanding an important detail. If so, any clarification would be welcome. Thank you!

cc: @jin-guo @xujiesi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A few questions about ERule dataset and repair procedure #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

A few questions about ERule dataset and repair procedure #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions