Skip to content

Latest commit

 

History

History
111 lines (72 loc) · 6.11 KB

README.md

File metadata and controls

111 lines (72 loc) · 6.11 KB

FAERY

FAERY is a test collection for fine-grained dataset discovery, which is the task of answering an input with a ranked list of candidate datasets, along with the fields of each candidate dataset that are relevant to the input. We implement experiments on dataset discovery and explanation experiments. For details about this test collection, please refer to the following paper.

Datasets

We reused the 46,615 datasets collected from NTCIR. The "datasets.json" file (available at Zenodo provides the id, title, description, tags, author, and summary of each dataset in JSON format.

{ 
  "id": "0000de36-24e5-42c1-959d-2772a3c747e7", 
  "title": "Montezuma National Wildlife Refuge: January - April, 1943", 
  "description": "This narrative report for Montezuma National Wildlife Refuge outlines Refuge accomplishments from January through April of 1943. ...", 
  "tags": ["annual-narrative", "behavior", "populations"], 
  "author": "Fish and Wildlife Service", 
  "summary": "Almost continuous rains during April brought flood conditions to the Clyde River as well as to the refuge storage pool. Cayuga Lake is at its highest level in about ton years. ..."
}

Keyword Queries

The "./Data/queries.tsv" file provides 3,979 keyword queries. Each row represents a query with two "\t"-separated columns: query_id and query_text. The queries can be divided into generated queries created from the metadata of datasets and NTCIR queries imported from the English part of NTCIR. The IDs of generated queries start with "GEN_", which are used in LLM annotations, while IDs starting with "NTCIR_1" are NTCIR queries used in LLM annotations, and IDs starting with "NTCIR_2" are NTCIR queries used in human annotations.

Qrels

The "./Data/human_annotated_qrels.json" file contains 7,415 qrels, and the "./Data/llm_annotated_qrels.json" file contains 122,585 qrels. Each JSON object has eight keys: query_id, target_dataset_id, candidate_dataset_id, qdpair_id (the ID of the query-target dataset pair), qrel (relevance of a candidate dataset to a query, 0: irrelevant; 1: partially relevant; 2: highly relevant), query_explanation, drel (relevance of a candidate dataset to a target dataset, 0: irrelevant; 1: partially relevant; 2: highly relevant), and dataset_explanation. The query_explanation and dataset_explanation are both lists of length 5 consisting of 0 and 1, and the order of the corresponding fields is [title, description, tags, author, summary].

{
    "query_id": "NTCIR_200000", 
    "target_dataset_id": "002ece58-9603-43f1-8e2e-54e3d9649e84", 
    "candidate_dataset_id": "99e3b6a2-d097-463f-b6e1-3caceff300c9", 
    "qdpair_id": "1", 
    "qrel": 1, 
    "query_explanation": [1, 1, 1, 0, 0], 
    "drel": 2, 
    "dataset_explanation": [1, 1, 1, 1, 1]
}

Splits for Training, Validation, and Test Sets

To ensure that evaluation results are comparable, one should use the train-validation-test splits that we provide. There are two ways for splitting the data into training, validation, and test sets. The "./Data/Splits/5-Fold_split" folder contains five sub-folders. Each sub-folder provides three qrel files for training, validation, and test sets, respectively. The "./Data/Splits/Annotators_split" folder contains three qrel files for training, validation, and test sets, respectively.

Baselines for Discovery

We have evaluated two sparse retrieval models: (1) TF-IDF based cosine similarity, (2) BM25 and five dense retrieval models: (3) BGE, (4) GTE, (5) Contextualized late interaction over BERT (ColBERTv2), (6) coCondenser and (7) Dense Passage Retrieval (DPR). For reranking, we have evaluated three models: (1) Stella, (2) SFR-Embedding-Mistral, (3) GLM-4-Long, and (4) GLM-4-Air.

The details of the experiments are given in the corresponding section of our paper.

The "./Baselines" folder provides the results of each baseline method, where each JSON object is formatted as: {qdpair_id: {dataset_id: score, ...}, ...}.

Baselines for Explanation

We employed post-hoc explanation methods to identify which fields of the candidate dataset are relevant to the query or target dataset. We have evaluated four different explainers, (1) feature ablation explainer, (2) LIME, (3) SHAP, (4) LLM, using F1-score, and the first three methods need to be combined with the retrieval models.

The "./Baselines" folder provides the results of each explainers, where each JSON object is formatted as: {qdpair_id: {dataset_id: {explanaion_type: [0,1,1,0,0], ...}, ...}, ...}.

For specific experimental details and data, please refer to our paper.

Source Codes

All source codes of our implementation are provided in ./Code.

Dependencies

  • Python 3.9
  • rank-bm25
  • scikit-learn
  • sentence-transformers
  • faiss-gpu
  • ragatouille
  • tevatron
  • torch
  • shap
  • lime
  • zhipuai

Sparse Retrieval Models

See codes in ./Code/Retrieval/sparse.py for details.

Dense Retrieval Models

Unspervised Models

See codes in ./Code/Retrieval/unsupervised_dense.py for details.

Spervised Models

Explanation Methods

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.