MS MARCO v2.1 and v2.1 segmented for TREC 2024 RAG

**Dataset Information:**

It would be awesome to have the document corpus (and the segmented counterpart) used in [TREC RAG 2024](https://trec-rag.github.io/annoucements/2024-corpus-finalization/) as integration to ir_datasets. From the description on the web page, it should be no problem to add this, random access to documents should also be very efficient as the file and byte offset are already encoded in the document identifiers, so I think there should be no problem.

The only question that I would have is: As the document identifiers contain the offsets in the file where a document starts (but not the end), is there maybe already a functionality that seeks to the start and readys the json entry until the closing bracket? If not, I could add this as well with unit tests, should be no problem.

**Links to Resources:**

- The dataset description: https://trec-rag.github.io/annoucements/2024-corpus-finalization/
- The document dataset (28GB): https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2.1_doc.tar
- The segmented document dataset (25GB): https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2.1_doc_segmented.tar

**Dataset ID(s) & supported entities:**

 - `msmarco-document-v2.1`: for the original documents
 - `msmarco-document-v2.1/segmented`: for the segmented documents

**Checklist**

Mark each task once completed. All should be checked prior to merging a new dataset.
  
 - [x] Dataset definition (in `ir_datasets/datasets/[topid].py`)
 - [x] Tests (in `tests/integration/[topid].py`)
 - [x] Metadata generated (using `ir_datasets generate_metadata` command, should appear in `ir_datasets/etc/metadata.json`)
 - [ ] Documentation (in `ir_datasets/etc/[topid].yaml`)
   - [ ] Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
 - [x] Downloadable content (in `ir_datasets/etc/downloads.json`)
   - [ ] Download verification action (in `.github/workflows/verify_downloads.yml`). Only one needed per `topid`.
   - [x] Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in `downloads.json`.
  
**Additional comments/concerns/ideas/etc.**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MS MARCO v2.1 and v2.1 segmented for TREC 2024 RAG #267

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MS MARCO v2.1 and v2.1 segmented for TREC 2024 RAG #267

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions