Skip to content

Prior Probabilities and Index Generation

Dimitar Jovanov edited this page Apr 4, 2018 · 9 revisions

The prior probabilities are essential for any EL system as well as the generation of the index of surface forms against which we cross match the mentions. This step is done using Maven and modified like any other Java project.

Contents

The Java project is located under the prior_probabilities_and_index_generation folder. This folder contains 4 Java classes: EntityInOutRelationCounter.java, PriorProbabilityCalculation.java, LuceneIndexCreation.java and Helpers.java.

  1. Helpers.java - contains helpful methods used by the other classes.
  2. EntityInOutRelationCounter.java - counts the in and out going edges by iterating through all the Wikidata items and going though their relations and properties. (Output: vertex-degree-hash-map.ser)
  3. PriorProbabilityCalculation.java - uses the number of Wikidata sitelinks as the second part of the prior probability. (Output: prior-probability-hash-map.ser)
  4. LuceneIndexCreation.java - This step relies on the previous two steps, because while generating the index of surface forms, the prior probabilities are used to store them alongside the surface forms.

Connect the dump file, same as in the Analysis Wiki

The dump file 20170717.json.gz, which is explained in greater detail under the Analysis wiki, should be extracted in the Java Project under the path prior_probabilities_and_index_generation\dumpfiles\wikidatawiki.

Execution of Prior Probability Generation

The EntityInOutRelationCounter.java and the PriorProbabilityCalculation.java can be run on a local machine due to small RAM constraints and they will output their above mentioned result files under prior_probabilities_and_index_generation\results.

Multi Language Support

Before generating the index of surface forms, decide on the language that you prefer and modify the LuceneIndexCreation.java file. This file contains the LANGUAGE variable which is the Wikipedia representation of the language. (e.g. en - English, de - German, fr - French)

Generation of Index of Surface Forms

For the LuceneIndexCreation.java the researchers needs to follow these steps:

  1. (Online) Log into your server and create a directory. (e.g. lucene). Create a sub-folder. (e.g. results) Transfer the prior-probability-hash-map.ser and the vertex-degree-hash-map.ser to this folder.
  2. (Locally) Navigate to \prior_probabilities_and_index_generation
  3. (Locally) Run mvn clean install. (On a console with Maven preinstalled)
  4. (Online) Transfer the locally generated file under \prior_probabilities_and_index_generation\target\prior_probabilities_and_index_generation-0.0.1-SNAPSHOT.jar to your server in a new folder let's say lucene. (e.g. using the pscp command)
  5. (Online) Run the jar file.
    1. (Online) If you don't know how to run the file you can use our already written bash script that does it for you. Transfer the start_index_creation.sh script to the server. This script can be found in the *prior_probabilities_and_index_generation\index_generation_script* folder and should be copied to the lucene folder.
    2. (Online) Run sh start_index_creation.sh to begin the process of index generation. (This script can be modified to the needs of the user. It is a simple bash scripts that executes commands for directory cleaning and creation)
  6. (Online) Done! The index will be generated /lucene/results/wikidatawiki-20170717/lucene.