Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NetEaseCrowd dataset #101

Merged

Conversation

shenxiangzhuang
Copy link
Contributor

Checklist

  • I have read the CONTRIBUTING document
  • I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

Dataset info

Adding our open-source dataset, NetEaseCrowd(https://github.com/fuxiAIlab/NetEaseCrowd-Dataset).

NetEaseCrowd is a large-scale dataset for long-term and online crowdsourcing truth inference, which contains about 2,400 workers, 1,000,000 tasks, and 6,000,000 annotations collected over 6 months. We believe that this dataset could be an invaluable asset to the Toloka/crowd-kit community by providing a new benchmark for crowdsourcing-related research and development.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.96%. Comparing base (07c4240) to head (08440a2).
Report is 34 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #101      +/-   ##
==========================================
+ Coverage   92.80%   92.96%   +0.15%     
==========================================
  Files          47       47              
  Lines        2070     2216     +146     
==========================================
+ Hits         1921     2060     +139     
- Misses        149      156       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@pilot7747 pilot7747 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shenxiangzhuang! Thank you for contributing this dataset. Lgtm

@shenxiangzhuang
Copy link
Contributor Author

Besides the CI test, I also tested to use this dataset do categorical aggregation and it works well:

from crowdkit.aggregation import DawidSkene
from crowdkit.datasets import load_dataset

df, gt = load_dataset('netease_crowd')

ds = DawidSkene(10)
result = ds.fit_predict(df)

print(len(result))
# 999799

Copy link
Collaborator

@dustalov dustalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for a very well-done PR! I noticed a small imperfection in the dataset metadata. Could you please check my suggestion?

Co-authored-by: Dmitry Ustalov <[email protected]>
@shenxiangzhuang
Copy link
Contributor Author

Thank you for a very well-done PR! I noticed a small imperfection in the dataset metadata. Could you please check my suggestion?

Thanks a lot for your carefully review!

@dustalov dustalov merged commit ec05dcc into Toloka:main Mar 12, 2024
5 checks passed
@dustalov
Copy link
Collaborator

Great job, thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants