WikiSpa

This is a simple wrapper around the dbpedia-extraction framework mainly to make sure each execution is independent. The project is focused in executing wikipedia queries locally or in a Spark cluster.

An example to print out all the wikipedia pageids and their categories separated by TAB is shown below.

object CategoryPerPage extends ElectricJob[WikiFileAndSerialization] with  WikiAccess  with FileAccess {


  override def execute(argument:WikiFileAndSerialization)(implicit ec: ElectricContext)={
   val categoriesCount=   wikiPages(argument.wikiFile, argument.serializationType)
        .map(f => Categories.extractByPage(f).getOrElse((0L, List(): List[String])))

        .filter(f => f._1 != 0 && f._2.nonEmpty)
        .map(f=> f._1 + "\t"+f._2.mkString("\u0001"))

    writeFile(categoriesCount,argument.output)

  }
}

An example output is given below.

290     ISO basic Latin letters,Vowel letters
334     Time scales

The code runs on a (16GB, OSX) laptop for the latest wikipedia data(enwiki-20151002-pages-articles-multistream.xml) in less than three hours. For the rich and the impatient, the code below can be deployed and executed in a Hadoop cluster.

Repository available at OSS releases

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
files		files
project		project
wikispa-core/src		wikispa-core/src
wikispa-spark/src		wikispa-spark/src
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
build.properties		build.properties
local.pubring.gpg.enc		local.pubring.gpg.enc
local.secring.gpg.enc		local.secring.gpg.enc
sbt		sbt
sbt-launch.jar		sbt-launch.jar
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiSpa

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WikiSpa

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages