-
Notifications
You must be signed in to change notification settings - Fork 0
Prior Probabilities and Index Generation
The prior probabilities are essential for any EL system as well as the generation of the index of surface forms against which we cross match the mentions. This step is done using Maven and modified like any other Java project.
The Java project is located under the prior_probabilities_and_index_generation folder. This folder contains 4 Java classes: EntityInOutRelationCounter.java, PriorProbabilityCalculation.java, LuceneIndexCreation.java and Helpers.java.
- Helpers.java - contains helpful methods used by the other classes.
- EntityInOutRelationCounter.java - counts the in and out going edges by iterating through all the Wikidata items and going though their relations and properties. (Output: vertex-degree-hash-map.ser)
- PriorProbabilityCalculation.java - uses the number of Wikidata sitelinks as the second part of the prior probability. (Output: prior-probability-hash-map.ser)
- LuceneIndexCreation.java - This step relies on the previous two steps, because while generating the index of surface forms, the prior probabilities are used to store them alongside the surface forms.
The dump file 20170717.json.gz, which is explained in greater detail under the Analysis wiki, should be extracted in the Java Project under the path prior_probabilities_and_index_generation\dumpfiles\wikidatawiki.
The EntityInOutRelationCounter.java and the PriorProbabilityCalculation.java can be run on a local machine due to small RAM constraints and they will output their above mentioned result files under prior_probabilities_and_index_generation\results.
Before generating the index of surface forms, decide on the language that you prefer and modify the LuceneIndexCreation.java file. This file contains the LANGUAGE variable which is the Wikipedia representation of the language. (e.g. en - English, de - German, fr - French)
For the LuceneIndexCreation.java the researchers needs to follow these steps:
- (Online) Log into your server and create a directory. (e.g. lucene). Create a sub-folder. (e.g. results) Transfer the prior-probability-hash-map.ser and the vertex-degree-hash-map.ser to this folder.
- (Locally) Navigate to \prior_probabilities_and_index_generation
- (Locally) Run mvn clean install. (On a console with Maven preinstalled)
- (Online) Transfer the locally generated file under \prior_probabilities_and_index_generation\target\prior_probabilities_and_index_generation-0.0.1-SNAPSHOT.jar to your server in a new folder let's say lucene. (e.g. using the pscp command)
- (Online) Run the jar file.
- (Online) If you don't know how to run the file you can use our already written bash script that does it for you. Transfer the start_index_creation.sh script to the server. This script can be found in the *prior_probabilities_and_index_generation\index_generation_script* folder and should be copied to the lucene folder.
- (Online) Run sh start_index_creation.sh to begin the process of index generation. (This script can be modified to the needs of the user. It is a simple bash scripts that executes commands for directory cleaning and creation)
- (Online) Done! The index will be generated /lucene/results/wikidatawiki-20170717/lucene.
WELT - Dimitar Jovanov 2018