Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问可以发布更多关于数据清洗的细节吗? #7

Open
Casi11as opened this issue Nov 3, 2023 · 7 comments
Open

请问可以发布更多关于数据清洗的细节吗? #7

Casi11as opened this issue Nov 3, 2023 · 7 comments
Labels
research question Research questions.

Comments

@Casi11as
Copy link

Casi11as commented Nov 3, 2023

目前第一步数据清洗是与starcoder相同,想学习了解后面是如何过滤掉低质量代码、语法错误或可读性差的代码的。

谢谢!

@guoday
Copy link
Collaborator

guoday commented Nov 4, 2023

之后会有技术报告出来的

@soloice soloice added the research question Research questions. label Nov 4, 2023
@Casi11as
Copy link
Author

Casi11as commented Nov 6, 2023

之后会有技术报告出来的
好的,多谢,会持续关注的

@i-love-doufunao
Copy link

i-love-doufunao commented Nov 9, 2023

We also are closely paying attention to how to preprocessing code dataset, especially how to handle the dependencies among code file

@Rosacess
Copy link

  • Step 2: Parsing the dependencies of files within the same repository to rearrange the file positions based on their dependencies.
  • Step 3: Concatenating dependent files to form a single example and employ repo-level minhash for deduplication.
    期待发布这两部分的更多细节

@wyjksyjs
Copy link

之后会有技术报告出来的

请问技术报告里包含SFT数据的构造方法吗,以及SFT数据是否开源?顺便问一下技术报告什么时候能出来,很期待👍

@i-love-doufunao
Copy link

请问这部分内容现在有更新吗?

@ali8zake
Copy link

ali8zake commented Dec 2, 2023

ding 一个

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research question Research questions.
Projects
None yet
Development

No branches or pull requests

7 participants