-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have some en-ha and en-ig parallel data from Gourmet and ParaCrawl #129
Comments
Thanks @kpu ! Will add it to our dataset list. Have noticed it's largely religious - I'd imagine it boils down to being the JW300 + Quran - do you have any sense of what else might have ended up in there? |
By the way if you want the really noisy stuff before cleaning https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ha.classified.gz https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ig.classified.gz . Taking a skim over the top sites: gospelgo.com just bible quotes We seem to have picked up a lot of MT output from the transposh.org plugin: newsrule.com, e-activo.org, transposh.org, and datemypet.com . I should feed back a detector for transposh and throw out all of that. Domains in en-ha:
And ig:
|
That's super cool, thanks for sharing @kpu! What's the license for this data? |
Hi! In a collaboration between https://gourmet-project.eu/ and https://paracrawl.eu/ , have some parallel corpora. It's so new we haven't linked to it from the website yet.
The raw data comes from Internet Archive WIDE0006, Internet Archive WIDE00015, and our own crawl. Our own crawl was targeted at sites in CommonCrawl that had enough of at least two EU languages but then we crawled the whole domain.
Text:
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ha.txt.gz
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ig.txt.gz
The same in TMX:
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ha.tmx.gz
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ig.tmx.gz
The text was updated successfully, but these errors were encountered: