Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have some en-ha and en-ig parallel data from Gourmet and ParaCrawl #129

Open
kpu opened this issue Oct 16, 2020 · 4 comments
Open

Have some en-ha and en-ig parallel data from Gourmet and ParaCrawl #129

kpu opened this issue Oct 16, 2020 · 4 comments

Comments

@kpu
Copy link
Contributor

kpu commented Oct 16, 2020

Hi! In a collaboration between https://gourmet-project.eu/ and https://paracrawl.eu/ , have some parallel corpora. It's so new we haven't linked to it from the website yet.

The raw data comes from Internet Archive WIDE0006, Internet Archive WIDE00015, and our own crawl. Our own crawl was targeted at sites in CommonCrawl that had enough of at least two EU languages but then we crawled the whole domain.

Text:
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ha.txt.gz
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ig.txt.gz

The same in TMX:
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ha.tmx.gz
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ig.tmx.gz

@jaderabbit
Copy link
Member

Thanks @kpu ! Will add it to our dataset list.

Have noticed it's largely religious - I'd imagine it boils down to being the JW300 + Quran - do you have any sense of what else might have ended up in there?

A quick wordcloud
image

@kpu
Copy link
Contributor Author

kpu commented Oct 16, 2020

By the way if you want the really noisy stuff before cleaning https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ha.classified.gz https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ig.classified.gz .

Taking a skim over the top sites:

gospelgo.com just bible quotes
islamhouse.com is religious but not necessarily quran quotes
builttobrag.com is religious but not necessarily bible quotes
jw.org you hopefully already have
www.grace-and-truth.net and www.waters-of-life.net sound pretty religious.

We seem to have picked up a lot of MT output from the transposh.org plugin: newsrule.com, e-activo.org, transposh.org, and datemypet.com . I should feed back a detector for transposh and throw out all of that.

Domains in en-ha:

  49062 gospelgo.com
  34275 islamhouse.com
  13150 builttobrag.com
   4332 www.grace-and-truth.net
   3428 www.waters-of-life.net
    974 wol.jw.org
    806 www.jw.org
    562 www.datemypet.com
    520 simonlawton.com
    439 wap.divinerevelations.info
    439 divinerevelations.info
    372 www.mcreveil.org
    353 www.gotquestions.org
    348 www.bitbybitbook.com
    328 therefugeecenter.org
    240 ukraine.admission.center
    204 www.wor.org
    192 gotquestions.org
    166 bellyfatlossreview.com
    164 www.gsm-dinleme.com
    129 jesusforafrica.net
    116 newsandfeaturesonindonesia.blogspot.com
    104 www.healthworksnewcastle.org.uk
     94 gospel.net
     93 macgateway.com
     88 lightministries.com
     84 newsrule.com
     78 www.mystorywithgod.com
     78 englishteacherfred.com
     76 www.faithfulwordbaptist.org
     60 shroudofturinnews.com
     50 www.dw.com
     43 alwilayahnews.net
     42 www.sayadi-al-nas.com
     42 www.sayadi-al-nas.ae
     42 sayadi-al-nas.ae
     34 dateandtime.info
     33 pastorpaulvbsblog.blogspot.in
     28 www.iroy.in
     25 ldsecrets.com
     25 kelleysview.com
     25 jesusministriesexhortations.blogspot.de
     22 nuriddeen.blogspot.it
     21 sayadi-al-nas.com
     19 www.juliettebaysham.co.uk
     18 www.morehacks.net
     17 davidstoutconsulting.com
     16 www.xyxx.com.au
     14 nuriddeen.blogspot.co.uk
     14 islamland.com
     14 advancedlawofattractiontraininginstitute.com
     13 nuriddeen.blogspot.co.il
     12 mt4indicators.com
     12 harkarmusulunci.org
     12 crushtheciaexam.com
     10 www.alhassanain.com
     10 www.agrosoftltd.com
     10 tambayadaamsa.blogspot.com
     10 icine.org
      9 archbishop-cranmer.blogspot.ie
      8 hinhdep.com.vn
      7 mormanity.blogspot.co.il
      7 kingswaybibleschool.co.za
      5 www.4laws.com
      5 sunnibook.blogspot.co.uk
      5 rasululaazam.org
      5 harunaabubakarshika.blogspot.com.ng
      5 anhducblogs.blogspot.fr
      5 abandonware.com
      4 www.caseguru.com
      4 www.bbc.com
      4 languagesoftheworld.co.uk
      4 kratomonline.org
      4 inamafita.blogspot.de
      4 global.bfsu.edu.cn
      4 fasahar-intanet.blogspot.com.ng
      4 alhassanain.com
      3 www.sathyaananda.it
      3 taskarkanywood.blogspot.com
      3 nuriddeen.blogspot.com
      3 ministryhouse.org
      3 manessmorrison2.blogspot.com
      3 jerusalemgraffiti.com
      3 israelect.com
      3 ismamedicalcareinitiatives.blogspot.com
      3 fasahar-intanet.blogspot.in
      3 fasahar-intanet.blogspot.co.uk
      3 ericmansfield.blogspot.de
      3 eastsidebaptistkm.org
      3 del-lords.com
      3 codingbytodesign.net
      3 azrefs.org
      2 www.marysrosaries.com
      2 www.havenproject-hull.org.uk
      2 www.equalityontrial.com
      2 thelatterdays.blogspot.nl
      2 sureofheaven.blogspot.com
      2 standrews-saskatoon.net
      2 sherilynshines.blogspot.ca
      2 ramadan-1428.blogspot.com
      2 ragmopandgoose.com
      2 munbarin-musulunci.blogspot.it
      2 meridianflights.com
      2 knittedbygodsplan.blogspot.com.br
      2 kimiyyah.blogspot.in
      2 hdfree.se
      2 hau.timegenie.com
      2 hausa.irib.ir
      2 hanaonline.co.uk
      2 halofanon.wikia.com
      2 freethoughtblogs.com
      2 espvisuals.blogspot.hk
      2 duniyarcomputer.com
      2 dickinsonadventures.com
      2 coastalresearch.org
      2 clevelandpriest.blogspot.com
      2 blbooks.blogspot.ch
      2 bahaushensabonkarni.blogspot.com
      2 agajingi1.blogspot.com
      1 www.yardsalebloodbath.com
      1 www.timegenie.com
      1 www.pillartopost.org
      1 www.hurog.com
      1 www.ewtn.com
      1 whitefieldsprayer.blogspot.jp
      1 trustyourlife.blogspot.com
      1 studentofmotherhood.blogspot.ca
      1 streema.com
      1 stmichael-delaware-oca.org
      1 sexualobjectification.blogspot.com.es
      1 royaparsay.blogspot.ca
      1 raisethethunderbeam.blogspot.fr
      1 quoradimonds.blogspot.in
      1 perfumedkisses2.blogspot.it
      1 news.bbc.co.uk
      1 mystical-politics.blogspot.de
      1 members.tripod.com
      1 inamafita.blogspot.co.uk
      1 huboutourvillegenealogy.com
      1 harunaabubakarshika.blogspot.it
      1 gidandabino.blogspot.de
      1 forumresor.se
      1 en.a9.com.tr
      1 ctvc.se
      1 cielodrive.com
      1 carmanlicciardello.blogspot.ca
      1 battleshippretension.com
      1 basicchristian.info
      1 atelim.com
      1 amen.dk
      1 aliyahbyaccident.blogspot.co.id
      1 abuashar.blogspot.com

And ig:

  19338 gospelgo.com
  17542 builttobrag.com
   9178 www.e-activo.org
   5530 newsrule.com
   4536 www.datemypet.com
   4388 transposh.org
   3000 www.waters-of-life.net
   2964 www.gfesport.com
   2842 ig.usa-casino-online.com
   2778 mt4indicators.com
   2190 trenboloneresults.com
   2114 www.bitbybitbook.com
   1960 www.healthworksnewcastle.org.uk
   1894 spyera.com
   1796 jobdescriptionsample.org
   1795 www.jw.org
   1748 crushtheciaexam.com
   1696 www.morehacks.net
   1650 www.the-tailoress.com
   1614 usa-casino-online.com
   1366 mobhax.com
   1346 ispyoo.com
   1276 fr.glosbe.com
   1151 wol.jw.org
   1121 bellyfatlossreview.com
   1104 newsandfeaturesonindonesia.blogspot.com
   1062 meridianflights.com
   1039 www.iroy.in
    929 autocarandinsurance.com
    914 tortlay.com
    856 dogma.swiftspirit.co.za
    852 www.parisdakar.it
    793 glosbe.com
    722 www.wayscan.com
    718 kelleysview.com
    623 thetopbestdeals365.co.uk
    622 simonlawton.com
    554 www.faithfulwordbaptist.org
    470 www.neu-presse.de
    464 it.glosbe.com
    414 www.hkmelamine.com
    393 3x247.com
    310 ms.glosbe.com
    306 hu.glosbe.com
    271 nootropicsreview.org
    262 www.gsm-dinleme.com
    256 powershell-guru.com
    250 www.chrysangifts.com
    248 softplug.com
    245 beththompsonmarketing.com
    228 www.promoearte.it
    227 datarecoverycompany.net
    218 englishteacherfred.com
    206 machannkay.com
    156 golftipreview.com
    156 es.glosbe.com
    154 hinhdep.com.vn
    151 nl.glosbe.com
    146 examprepbooks.com
    145 westbrookhousing.org
    140 www.atlantacleaningexperts.com
    138 de.glosbe.com
    136 abacre.com
    130 www.bestwaytowhitenteethguide.org
    126 vitalizedwater.net
    124 www.gotquestions.org
    114 advancedlawofattractiontraininginstitute.com
    100 cheers4health.com
     99 ur.glosbe.com
     96 www.mystorywithgod.com
     95 vitalizerplus.net
     94 gospel.net
     90 graphicsecurity.com
     88 ltool.net
     86 gotquestions.org
     84 www.abacre.com
     80 educationbro.com
     76 shroudofturinnews.com
     74 vitalizerplusmineralbasket.com
     73 en.glosbe.com
     72 thegarrisoncenter.org
     68 www.ltool.net
     62 www.hanskottke.de
     54 www.pennyauctionwizards.com
     54 installmobilespy.com
     54 celltechnutrition.com
     52 www.zhitov.ru
     52 www.realdevil.info
     52 sw.glosbe.com
     50 codingbytodesign.net
     50 amara.org
     42 landscapersgreenvillesc.com
     42 dancinginmyheels.com
     40 monsoonbiz.com
     38 www.english-video.net
     38 crushthecfpexam.com
     34 www.unicode.org
     34 www.lovemediasoft.com
     34 ewenchiabook.com
     33 www.kfflooring.com
     33 independentflorida.com
     33 growfunnel.com
     32 michaelhidalgo.net
     32 id.glosbe.com
     32 icine.org
     32 abacre.net
     31 vitalizerplus.me
     30 kairosplanet.web.tr
     28 ilanguages.org
     28 freshgamehacks.com
     26 learn101.org
     25 vitalizerplus.biz
     25 emergencywaterdamagecleanup.com
     24 www.lotteryextreme.com
     24 tl.glosbe.com
     24 ka.glosbe.com
     22 sanjoserealestatelosgatoshomes.com
     20 www.uk-business-plans.co.uk
     20 www.jobwhip.com
     20 vitalizerplus.info
     20 telefonnummervon.com
     20 kratompowders.org
     20 hcspeech.com
     19 hanaonline.co.uk
     18 www.promolux.com
     16 www.expertmortgage.biz
     16 waelbadawy.com
     16 tv-online.in
     16 statenislandpaintexperts.com
     16 rentonroofingcontractor.com
     15 igbounionfreiburg.de
     14 vitalizer-plus.vitalizerplus.net
     14 kratomonline.org
     13 yesmoneyyes.com
     12 obianyanwu.com
     10 www.districtcolumbia.com
     10 www.caseguru.com
     10 sde.tw
     10 mobilespytrial.com
     10 gayhub.ru
      9 www.bluesummary.com
      9 tv-online.im
      9 timeinnkwalini.onestophoteldeals.com
      9 timeinchillon.onestophoteldeals.com
      9 spintaxplrarticles.com
      8 support.mozilla.org
      8 ldspianohymns.com
      8 ketosisfatlossdiet.com
      8 jesusforafrica.net
      6 painafterrootcanalguide.net
      6 horrorwits.com
      6 hazzy.harrysoft.co.uk
      6 genuinelyabsurd.com
      6 avibase.bsc-eoc.org
      5 avilestoro.de
      5 alexandersalazarfineart.com
      4 www.havenproject-hull.org.uk
      4 www.euro2016-tickets.com
      4 vanipedia.org
      4 textclips.it
      4 recetadecupcakes.com
      4 painreliefpainpatches.com
      4 mapcarta.com
      4 is.usa-casino-online.com
      4 ig.wikipedia.org
      4 igbounionfreiburg.com
      4 gloria.tv
      4 en.wikipedia.org
      4 busindia.com
      4 7figureautomation.com
      3 yanthor.com
      3 www.timegenie.com
      3 www.lds.org
      3 www.javierartiles.com
      3 www.illustrators-online.net
      3 www.enterthehealingschool.org
      3 www.dlsoftware.net
      3 www.blackpeoplebusiness.com
      3 ibo.timegenie.com
      3 enterthehealingschool.org
      3 dragoncityhackandcheats.xyz
      2 www.womenpriests.org
      2 www.glbtguide.com
      2 www.econofrost.com
      2 www.carpepotentia.com
      2 www.bestdraincleaners.com
      2 www.answershack.com
      2 www.24faster.com
      2 qlranks.com
      2 nanoinformatica.com
      2 mobile.cardiffcityrumours.co.uk
      2 immolucky.com
      2 fighterstalk.com
      2 fighters-quest.com
      2 fatcutters.com
      2 dominicweb.eu
      2 crushthegretest.com
      2 animalcoloring.blogspot.com
      1 www.repetitivestrain.org
      1 www.auspisoft.com
      1 www.albionchoir.org.uk
      1 vitalizerplusmineralcube.com
      1 psychologytomorrowmagazine.com
      1 julie-compton.com
      1 imonews24.com
      1 guntherspaps.blogspot.ca
      1 dotted-carrier-798.appspot.com
      1 autolicenseplate.com
      1 angrydr.blogspot.sg

@juliakreutzer
Copy link
Collaborator

That's super cool, thanks for sharing @kpu! What's the license for this data?

@kpu
Copy link
Contributor Author

kpu commented Oct 20, 2020

License is the usual one on paracrawl.eu.

So much of this is machine translated though. Most likely they are watermarked by Google, but Google has not to my knowledge documented the hash function.

Some translation plugins leave calling cards in the HTML:

We're going to go back to the original HTML and throw out pages with machine translation indica like these.

After that, I'd really appreciate help from the community in identifying domains with obvious MT output (which should be easier for low-resource languages!) so we can ban them and release a cleaner corpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants