Skip to content

Various models trained on parts of Webcorpus 2.0

Notifications You must be signed in to change notification settings

nytud/w2v_models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

w2v_models

A word2vec CBoW language model (Mikolov et al. 2013a, 2013b) trained on the first 999 file (21GB raw texts) of a Hungarian language corpus, the Webcorpus 2.0 (Nemeskey 2020) containing the normalized version of the original texts, cc. 170M sentences using the Gensim Python package (Rehurek and Sojka, 2011).

Model parameters:

  • dimension: 300
  • window size: 6
  • minimum frequency: 3

Since Hungarian is a highly inflective language and we trained embeddings on raw texts, this is not a pure bag-of-words model, as the abbreviation CBoW would imply. Our choice of input data was based on the presupposition that morphosyntactic information may contribute to the characterization of meanings. Roughly 8,5M word forms were assigned embeddings as the result of our training.

The language model can be downloaded from here (18GB).

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space.

Tomas Mikolov, Ilya Sutskever, Kai Chen, G.s Corrado, and Jeffrey Dean. 2013b. Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems 26.

Dávid Márk Nemeskey. 2020. Natural Language Processing Methods for Language Modeling. Ph.D. thesis, Eötvös Loránd University

Radim Rehurek and Petr Sojka. 2011. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, 3(2).

About

Various models trained on parts of Webcorpus 2.0

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published