CC-100 in statmt version is different from paper

Hi, first of all, thank you for your great work on multilingual NLP.
I'm trying to replicate XLM-R in my own reasearch, and I found that the corpus from [statmt](https://data.statmt.org/cc-100/) is very different from the description in XLM-R paper.
For example, in the case of Esperanto, there are 157M tokens in the paper, but in the statmt version there are actually about 290M tokens.
I tokenized with both sentencepiece + fairseq-preprocess and transformers tokenizer (`xlm-roberta-base`) for double-checking.

I guess the content of the corpus would be similar (know that CC was based on web-scrapping) since they have similar file size (which is 0.9GiB), but what makes them so different?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CC-100 in statmt version is different from paper #48

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CC-100 in statmt version is different from paper #48

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions