w2v_models

A word2vec CBoW language model (Mikolov et al. 2013a, 2013b) trained on the first 999 file (21GB raw texts) of a Hungarian language corpus, the Webcorpus 2.0 (Nemeskey 2020) containing the normalized version of the original texts, cc. 170M sentences using the Gensim Python package (Rehurek and Sojka, 2011).

Model parameters:

dimension: 300
window size: 6
minimum frequency: 3

Since Hungarian is a highly inflective language and we trained embeddings on raw texts, this is not a pure bag-of-words model, as the abbreviation CBoW would imply. Our choice of input data was based on the presupposition that morphosyntactic information may contribute to the characterization of meanings. Roughly 8,5M word forms were assigned embeddings as the result of our training.

The language model can be downloaded from here (18GB).

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space.

Tomas Mikolov, Ilya Sutskever, Kai Chen, G.s Corrado, and Jeffrey Dean. 2013b. Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems 26.

Dávid Márk Nemeskey. 2020. Natural Language Processing Methods for Language Modeling. Ph.D. thesis, Eötvös Loránd University

Radim Rehurek and Petr Sojka. 2011. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, 3(2).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

w2v_models

About

Releases

Packages

nytud/w2v_models

Folders and files

Latest commit

History

Repository files navigation

w2v_models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages