-
Notifications
You must be signed in to change notification settings - Fork 15
Notes on the wikipedia miner dump processing on Hadoop
Daan Odijk edited this page Mar 12, 2014
·
5 revisions
Following http://wikipedia-miner.cms.waikato.ac.nz/wiki/Wiki.jsp?page=Extracting%20CSV%20Summaries
- download XXwiki-[DUMPDATE]-pages-articles.xml.bz2 from http://dumps.wikimedia.org/ (e.g., http://dumps.wikimedia.org/nlwiki/20130318/nlwiki-20130318-pages-articles.xml.bz2) and unzip it.
$ wget http://dumps.wikimedia.org/nlwiki/20130318/nlwiki-20130318-pages-articles.xml.bz2
$ bunzip2 nlwiki-20130318-pages-articles.xml.bz2
- create dir on hdfs and upload the unzipped dump file
$ hadoop dfs -mkdir /user/emeij/wikipedia/nl/20130318
$ hadoop dfs -put nlwiki-20130318-pages-articles.xml /user/emeij/wikipedia/nl/20130318/
- modify languages.xml (http://wikipedia-miner.cms.waikato.ac.nz/wiki/Wiki.jsp?page=Language%20dependent%20configuration), see wiki and upload to HDFS.
- find sentence detection model and upload to HDFS
- run it
$ hadoop dfs -mkdir wikipedia/nl/20130318-out
$ hadoop jar wikipedia-miner-1.2.0/build/jar/wikipedia-miner-hadoop.jar org.wikipedia.miner.extraction.DumpExtractor wikipedia/nlwiki-20140228/nlwiki-20140228-pages-articles.xml wikipedia-miner/languages.xml nl wikipedia-miner/nl-sent.bin wikipedia/nlwiki-20140228/out
- German:
$ hadoop dfs -mkdir wikipedia/de/20130309-out
$ hadoop jar /zookst13_local_store/emeij/wikipedia-miner/build/jar/wikipedia-miner-hadoop.jar org.wikipedia.miner.extraction.DumpExtractor wikipedia/de/20130309/dewiki-20130309-pages-articles.xml wikipedia-miner/languages.xml de wikipedia-miner/de-sent.bin wikipedia/de/20130309-out
- English:
$ hadoop dfs -mkdir wikipedia/en/20130304-out
$ hadoop jar /zookst13_local_store/emeij/wikipedia-miner/build/jar/wikipedia-miner-hadoop.jar org.wikipedia.miner.extraction.DumpExtractor wikipedia/en/20130304/enwiki-20130304-pages-articles.xml wikipedia-miner/languages.xml en wikipedia-miner/en-sent.bin wikipedia/en/20130304-out
- get the output
$ hadoop dfs -get wikipedia/nl/20130318-out/final/* ./
- create a copy of wikipedia.xml and update (http://wikipedia-miner.cms.waikato.ac.nz/wiki/Wiki.jsp?page=Installing%20the%20Java%20API)
- edit build.properties (http://wikipedia-miner.cms.waikato.ac.nz/wiki/Wiki.jsp?page=Installing%20the%20Java%20API) and create Berkeley DB
$ ant build-database
- update hub.xml and restart web service