Asynchronous Training of Word Embeddings for large text corpora

How to use

In the code directory is the script execute_epochs.sh. It handles/sets:

calling hadoop streaming
number of VCOREs per reducer
number of reducers
name of the job
number of epochs

Additionally, the files mapper.py and reducer.py need to be modified.

In mapper.py one has to set the number of reducers/sampling rate and if required the sub sampling of the actual data.

In the reducer.py the active hadoop namenode has to be set, as well as the path for the models. That path needs to exist (and be word writeable). The correct path vocabulary (in form of a pickled python dict) also needs to be given and depending on the name, also added in execute_epochs.sh.

Before running execute_epochs.sh the KeyPartitioner has to be compiled to byte code:

export CLASSPATH=$(/usr/bin/hadoop classpath)
javac KeyPartitioner.java

The extra partitioner is needed since the default uses the hash of the key modulo the number of reducers to send the data to different reducers. But this will generate collisions and often also result in some reducers getting no data at all. Depending on the configured number of reducers.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
code		code
plots/sampling_plots		plots/sampling_plots
results		results
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Asynchronous Training of Word Embeddings for large text corpora

How to use

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Asynchronous Training of Word Embeddings for large text corpora

How to use

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages