Skip to content

Latest commit

 

History

History
24 lines (16 loc) · 1.85 KB

README.md

File metadata and controls

24 lines (16 loc) · 1.85 KB

ProtoQA Dataset

This repository contains the dataset for ProtoQA ("Family Feud"). See the paper for details on dataset creation.

Data Files:

Each line is a json dictionary, in which:

  • question contains the question (in original and a normalized form)
  • answers (where available) contains:
    • raw original answers provided by survey respondents (when available) with their counts
    • clusters which include the score for each cluster and the strings included in that cluster

For a full description of the data format, see DATAFORMAT.md.

File organization:

  • data/train/train.jsonl: 8781 instances for training or fine-tuning scraped from Family Feud fan sites (see paper). Scraped data has answer clusters with sizes, but only has a single string per cluster (corresponding to the original cluster name.
  • data/dev/dev.scraped.jsonl: 979 instances sampled from the same Family Feud data, for use in model validation and development.
  • data/dev/dev.crowdsourced.jsonl: 51 questions collected with exhaustive answer collection and manual clustering, matching the details of the eval test set (roughly 100 human answers per question).
  • data/test/test.questions.jsonl 102 questions for evaluation. (Note that the test set contains questions only.)

Notes:

This repository contains a data statement (based on Datasheets for Datasets (Gebru et al. 2020) and earlier NLP-specific work (Bender and Friedman 2018)) to provide transparency in data use and encourage others to do so. This is a preliminary version of the statement; please post issues in the repository or contact the authors if you have questions regarding the data details or suggestions regarding the dataset use.