-
Notifications
You must be signed in to change notification settings - Fork 7
organizer_guidance
This guidance is intended for people involved in organizing the hackathon. It aims to document all of the workflows used in the hackathon.
When a participant wants to add a new dataset as part of the hackathon, they will use the issue the add dataset issue template.
When an issue has been created via this form, it will have the candidate-dataset
tag. At this point, one of the organizers should review the dataset to decide whether it is suitable for inclusion in the hackathon/Datasets Hub.
To get an overview of issues with the candidate-dataset
tag, you can use the following filter: https://github.com/bigscience-workshop/lam/issues?q=label%3Acandidate-dataset+is%3Aopen+sort%3Acreated-asc. This will show you all issues with the candidate-dataset
tag showing the oldest issue first.
This is a basic checklist of things to consider when approving a candidate-dataset
. If in doubt, ping other organizers and use the GitHub issue for the proposed dataset to ask questions/seek clarity from the proposer of the dataset.
- The dataset is not already available via the Hub
- The dataset is not already on the project board
- The dataset has an open license that allows it to be shared (if unsure ping one of the other organizers)
- The dataset has some relation to LAM
- The dataset fits into one of the categories of data we're looking for
- The complexity of adding the dataset won't be too extreme (for some important/high-impact datasets, we may justify the extra work required)
When a dataset is added through the dataset suggestion issue template it is first assigned with a candidate-dataset
label.
Once you are satisfied that a candidate-dataset
is suitable for inclusion you can move it to the project board by removing the candidate-dataset
label and assigning the dataset
label. This will add it to the project board for datasets.
Participants of the hackathon who want to work on a particular dataset will use the #self-assign
command to 'claim' that dataset. We will encourage people to work on only one dataset at a time so we don't end up with too many unfinished datasets that are not available for others to work on.
Participants can ask for help by commenting #help-wanted
on their issue. You can use this filter to find issues asking for help sorted by oldest first.
When a participant has finished the work of making a dataset available via the Hub they will use the #ask-for-review
command to ask for feedback. You can get an overview of issues seeking review sorted by oldest first using this filter.
If you are available to review a dataset you should assign yourself to it so other organizers know that you are reviewing it. You can of course ping other organizers to get additional input when reviewing a dataset.
There are a few main things we want to check when reviewing a dataset. This list may expand as we get more experience during the hackathon of potential issues.
- The name of the dataset is descriptive
- The name of the dataset reflects existing names for datasets (to help with SEO for datasets and to make links clearer)
- You are able to load (some) of the data
- Can you load some of the data using
load_dataset('biglam/NAMEOFDATASET')
? - If the dataset is very large you may load a portion of the data
- Can you load some of the data using
- The datasets viewer displays the dataset preview successfully
Note the preview will only work if either the dataset is streamable or if [the repo contains dataset_infos.json file + this file contains the sizes for each split + only for the splits whose size is lower than 100 MB]; in the future it will work for all datasets independently of their size and whether they are streamable or not, but this is not yet supported. You should therefore not take a failure for the preview to load as an indication that the dataset has an issue, however, if the preview works this is a good sign that the script does load data successfully.
- The features for a dataset make sense. Sometimes this won't be straightforward but we want to try and catch obvious things that could be improved.
- Labels are stored as
ClassLabel
- Labels have useful names encoded whenever possible i.e. prefer
positive
,negative
toLABEL_0
,LABEL_1
- Labels are stored as
- Column names are descriptive/easily understood
- If columns are not easily understood make sure they are documented in the Datacard.
- Check that all of the data appears to be available (comparing against a source if this is easily done)
- Check that data appears consistent (i.e. there aren't stray file types you wouldn't expect)
- Whilst not necessary, if there are many files/large files then compression might make sense using "gzip", "bz2" or "xz" compression.
The things to look out for in datasets using a dataset loading script include many of the same considerations as above. For simple scripts which load from a CSV file, directory, etc. the main things to check are:
- the script doesn't load too much into memory at once (if avoidable) i.e. prefer a script that loads data from a CSV a line at a time rather than loading a large CSV into memory
- the features make sense
- data isn't discarded without reason. Discarding data may sometimes be necessary but it should be justified why for example, a column isn't included in the dataset.
For more complex datasets the main things we will want to review include:
- the configurations (if any) make sense
- if the script uses an API try and ensure that the script isn't making multiple repeated requests where this can be avoided
- scripts dealing with APIs may also need additional error handling compared to scripts that load 'static' data.
This list isn't exhaustive. It's suggested that for more complex datasets a few people (or a datasets
maintainer) are involved in the review.
Once a dataset has been reviewed and is complete we can close the issue. This will mark the dataset as done
on the project board. At the same time, you should remove your assignment from that issue so it is clearer who worked on adding that dataset.