A word2vec CBoW language model (Mikolov et al. 2013a, 2013b) trained on the first 999 file (21GB raw texts) of a Hungarian language corpus, the Webcorpus 2.0 (Nemeskey 2020) containing the normalized version of the original texts, cc. 170M sentences using the Gensim Python package (Rehurek and Sojka, 2011).
Model parameters:
- dimension: 300
- window size: 6
- minimum frequency: 3
Since Hungarian is a highly inflective language and we trained embeddings on raw texts, this is not a pure bag-of-words model, as the abbreviation CBoW would imply. Our choice of input data was based on the presupposition that morphosyntactic information may contribute to the characterization of meanings. Roughly 8,5M word forms were assigned embeddings as the result of our training.
The language model can be downloaded from here (18GB).
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space.
Tomas Mikolov, Ilya Sutskever, Kai Chen, G.s Corrado, and Jeffrey Dean. 2013b. Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems 26.
Dávid Márk Nemeskey. 2020. Natural Language Processing Methods for Language Modeling. Ph.D. thesis, Eötvös Loránd University
Radim Rehurek and Petr Sojka. 2011. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, 3(2).