Skip to content
This repository was archived by the owner on Oct 31, 2023. It is now read-only.
This repository was archived by the owner on Oct 31, 2023. It is now read-only.

CC-100 in statmt version is different from paper #48

@nbqu

Description

@nbqu

Hi, first of all, thank you for your great work on multilingual NLP.
I'm trying to replicate XLM-R in my own reasearch, and I found that the corpus from statmt is very different from the description in XLM-R paper.
For example, in the case of Esperanto, there are 157M tokens in the paper, but in the statmt version there are actually about 290M tokens.
I tokenized with both sentencepiece + fairseq-preprocess and transformers tokenizer (xlm-roberta-base) for double-checking.

I guess the content of the corpus would be similar (know that CC was based on web-scrapping) since they have similar file size (which is 0.9GiB), but what makes them so different?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions