Skip to content

organizer_guidance

Daniel van Strien edited this page Jul 11, 2022 · 11 revisions

Guidance for organizers of the hackathon

This guidance is intended for people involved in organizing the hackathon. It aims to document all of the workflows used in the hackathon.

Reviewing candidate datasets

When a participant wants to add a new dataset as part of the hackathon, they will use the issue the add dataset issue template.

When an issue has been created via this form, it will have the candidate-dataset tag. At this point, one of the organizers should review the dataset to decide whether it is suitable for inclusion in the hackathon/Datasets Hub.

To get an overview of issues with the candidate-dataset tag, you can use the following filter: https://github.com/bigscience-workshop/lam/issues?q=label%3Acandidate-dataset+is%3Aopen+sort%3Acreated-asc. This will show you all issues with the candidate-dataset tag showing the oldest issue first.

Checklist for candidate datasets

This is a basic checklist of things to consider when approving a candidate-dataset. If in doubt, ping other organizers and use the GitHub issue for the proposed dataset to ask questions/seek clarity from the proposer of the dataset.

  • The dataset is not already available via the Hub
  • The dataset is not already on the project board
  • The dataset has an open license that allows it to be shared (if unsure ping one of the other organizers)
  • The dataset has some relation to LAM
  • The dataset fits into one of the categories of data we're looking for
  • The complexity of adding the dataset won't be too extreme (for some important/high-impact datasets, we may justify the extra work required)

Moving candidate datasets to the project board

When a dataset is added through the dataset suggestion issue template it is first assigned with a candidate-dataset label.

Once you are satisfied that a candidate-dataset is suitable for inclusion you can move it to the project board by removing the candidate-dataset label and assigning the dataset label. This will add it to the project board for datasets.

Assigning datasets to work on

Participants of the hackathon who want to work on a particular dataset will use the #self-assign command to 'claim' that dataset. We will encourage people to work on only one dataset at a time so we don't end up with too many unfinished datasets that are not available for others to work on.

Asking for help

Participants can ask for help by commenting #help-wanted on their issue. You can use this filter to find issues asking for help sorted by oldest first.

Reviewing uploaded datasets

When a participant has finished the work of making a dataset available via the Hub they will use the #ask-for-review command to ask for feedback. You can get an overview of issues seeking review sorted by oldest first using this filter.

If you are available to review a dataset you should assign yourself to it so other organizers know that you are reviewing it. You can of course ping other organizers to get additional input when reviewing a dataset.

There are a few main things we want to check when reviewing a dataset. This list may expand as we get more experience during the hackathon of potential issues.

General things to check

  • The name of the dataset is descriptive
  • The name of the dataset reflects existing names for datasets (to help with SEO for datasets and to make links clearer)
  • You are able to load (some) of the data
    • Can you load some of the data using load_dataset('biglam/NAMEOFDATASET')?
    • If the dataset is very large you may load a portion of the data
  • The datasets viewer displays the dataset preview successfully

Note the preview will only work if either the dataset is streamable or if [the repo contains dataset_infos.json file + this file contains the sizes for each split + only for the splits whose size is lower than 100 MB]; in the future it will work for all datasets independently of their size and whether they are streamable or not, but this is not yet supported. You should therefore not take a failure for the preview to load as an indication that the dataset has an issue, however, if the preview works this is a good sign that the script does load data successfully.

  • The features for a dataset make sense. Sometimes this won't be straightforward but we want to try and catch obvious things that could be improved.
    • Labels are stored as ClassLabel
    • Labels have useful names encoded whenever possible i.e. prefer positive, negative to LABEL_0, LABEL_1
  • Column names are descriptive/easily understood
    • If columns are not easily understood make sure they are documented in the Datacard.

For datasets that have uploaded the underlying data directly

  • Check that all of the data appears to be available (comparing against a source if this is easily done)
  • Check that data appears consistent (i.e. there aren't stray file types you wouldn't expect)
  • Whilst not necessary, if there are many files/large files then compression might make sense using "gzip", "bz2" or "xz" compression.

For datasets using dataset loading scripts

The things to look out for in datasets using a dataset loading script include many of the same considerations as above. For simple scripts which load from a CSV file, directory, etc. the main things to check are:

  • the script doesn't load too much into memory at once (if avoidable) i.e. prefer a script that loads data from a CSV a line at a time rather than loading a large CSV into memory
  • the features make sense
  • data isn't discarded without reason. Discarding data may sometimes be necessary but it should be justified why for example, a column isn't included in the dataset.

For more complex datasets the main things we will want to review include:

  • the configurations (if any) make sense
  • if the script uses an API try and ensure that the script isn't making multiple repeated requests where this can be avoided
  • scripts dealing with APIs may also need additional error handling compared to scripts that load 'static' data.

This list isn't exhaustive. It's suggested that for more complex datasets a few people (or a datasets maintainer) are involved in the review.

Marking as done

Once a dataset has been reviewed and is complete we can close the issue. This will mark the dataset as done on the project board. At the same time, you should remove your assignment from that issue so it is clearer who worked on adding that dataset.