Skip to content

Notes on the wikipedia miner dump processing on Hadoop

Daan Odijk edited this page Mar 12, 2014 · 5 revisions

Following http://wikipedia-miner.cms.waikato.ac.nz/wiki/Wiki.jsp?page=Extracting%20CSV%20Summaries

  • download XXwiki-[DUMPDATE]-pages-articles.xml.bz2 from http://dumps.wikimedia.org/ (e.g., http://dumps.wikimedia.org/nlwiki/20130318/nlwiki-20130318-pages-articles.xml.bz2) and unzip it.
    • $ wget http://dumps.wikimedia.org/nlwiki/20130318/nlwiki-20130318-pages-articles.xml.bz2
    • $ bunzip2 nlwiki-20130318-pages-articles.xml.bz2
  • create dir on hdfs and upload the unzipped dump file
    • $ hadoop dfs -mkdir /user/emeij/wikipedia/nl/20130318
    • $ hadoop dfs -put nlwiki-20130318-pages-articles.xml /user/emeij/wikipedia/nl/20130318/
  • modify languages.xml (http://wikipedia-miner.cms.waikato.ac.nz/wiki/Wiki.jsp?page=Language%20dependent%20configuration), see wiki and upload to HDFS.
  • find sentence detection model and upload to HDFS
  • run it
    • $ hadoop dfs -mkdir wikipedia/nl/20130318-out
    • $ hadoop jar wikipedia-miner-1.2.0/build/jar/wikipedia-miner-hadoop.jar org.wikipedia.miner.extraction.DumpExtractor wikipedia/nlwiki-20140228/nlwiki-20140228-pages-articles.xml wikipedia-miner/languages.xml nl wikipedia-miner/nl-sent.bin wikipedia/nlwiki-20140228/out
    • German:
      • $ hadoop dfs -mkdir wikipedia/de/20130309-out
      • $ hadoop jar /zookst13_local_store/emeij/wikipedia-miner/build/jar/wikipedia-miner-hadoop.jar org.wikipedia.miner.extraction.DumpExtractor wikipedia/de/20130309/dewiki-20130309-pages-articles.xml wikipedia-miner/languages.xml de wikipedia-miner/de-sent.bin wikipedia/de/20130309-out
    • English:
      • $ hadoop dfs -mkdir wikipedia/en/20130304-out
      • $ hadoop jar /zookst13_local_store/emeij/wikipedia-miner/build/jar/wikipedia-miner-hadoop.jar org.wikipedia.miner.extraction.DumpExtractor wikipedia/en/20130304/enwiki-20130304-pages-articles.xml wikipedia-miner/languages.xml en wikipedia-miner/en-sent.bin wikipedia/en/20130304-out
  • get the output
    • $ hadoop dfs -get wikipedia/nl/20130318-out/final/* ./
  • create a copy of wikipedia.xml and update (http://wikipedia-miner.cms.waikato.ac.nz/wiki/Wiki.jsp?page=Installing%20the%20Java%20API)
  • edit build.properties (http://wikipedia-miner.cms.waikato.ac.nz/wiki/Wiki.jsp?page=Installing%20the%20Java%20API) and create Berkeley DB
    • $ ant build-database
  • update hub.xml and restart web service
Clone this wiki locally