Skip to content

Commit

Permalink
Merge branch 'release-3.7.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
menshikh-iv committed Jan 18, 2019
2 parents 355ecc6 + 7d84b7e commit 42e47a3
Show file tree
Hide file tree
Showing 187 changed files with 57,324 additions and 16,406 deletions.
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ jobs:
name: Build documentation
command: |
source venv/bin/activate
tox -e docs -vv
tox -e compile,docs -vv
- store_artifacts:
path: docs/src/_build
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
*.o
*.so
*.pyc
*.pyo
*.pyd

# Packages #
############
Expand Down
13 changes: 12 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ language: python
matrix:
include:
- python: '2.7'
env: TOXENV="flake8"
env: TOXENV="flake8,flake8-docs"

- python: '3.6'
env: TOXENV="flake8,flake8-docs"

- python: '2.7'
env: TOXENV="py27-linux"
Expand All @@ -24,5 +27,13 @@ matrix:
- python: '3.6'
env: TOXENV="py36-linux"

- python: '3.7'
env:
- TOXENV="py37-linux"
- BOTO_CONFIG="/dev/null"
dist: xenial
sudo: true


install: pip install tox
script: tox -vv
222 changes: 217 additions & 5 deletions CHANGELOG.md

Large diffs are not rendered by default.

19 changes: 19 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,36 @@ include CHANGELOG.md
include COPYING
include COPYING.LESSER
include ez_setup.py

include gensim/models/voidptr.h
include gensim/models/fast_line_sentence.h

include gensim/models/word2vec_inner.c
include gensim/models/word2vec_inner.pyx
include gensim/models/word2vec_inner.pxd
include gensim/models/word2vec_corpusfile.cpp
include gensim/models/word2vec_corpusfile.pyx
include gensim/models/word2vec_corpusfile.pxd

include gensim/models/doc2vec_inner.c
include gensim/models/doc2vec_inner.pyx
include gensim/models/doc2vec_inner.pxd
include gensim/models/doc2vec_corpusfile.cpp
include gensim/models/doc2vec_corpusfile.pyx

include gensim/models/fasttext_inner.c
include gensim/models/fasttext_inner.pyx
include gensim/models/fasttext_inner.pxd
include gensim/models/fasttext_corpusfile.cpp
include gensim/models/fasttext_corpusfile.pyx

include gensim/models/_utils_any2vec.c
include gensim/models/_utils_any2vec.pyx
include gensim/corpora/_mmreader.c
include gensim/corpora/_mmreader.pyx
include gensim/_matutils.c
include gensim/_matutils.pyx

include gensim/models/nmf_pgd.c
include gensim/models/nmf_pgd.pyx

40 changes: 17 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,29 +119,23 @@ Documentation
Adopters
--------



| Name | Logo | URL | Description |
|----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| RaRe Technologies | ![rare](docs/src/readme_images/rare.png) | [rare-technologies.com](http://rare-technologies.com) | Machine learning & NLP consulting and training. Creators and maintainers of Gensim. |
| Mindseye | ![mindseye](docs/src/readme_images/mindseye.png) | [mindseye.com](http://www.mindseyesolutions.com/) | Similarities in legal documents |
| Talentpair | ![talent-pair](docs/src/readme_images/talent-pair.png) | [talentpair.com](http://talentpair.com) | Data science driving high-touch recruiting |
| Tailwind | ![tailwind](docs/src/readme_images/tailwind.png)| [Tailwindapp.com](https://www.tailwindapp.com/)| Post interesting and relevant content to Pinterest |
| Issuu | ![issuu](docs/src/readme_images/issuu.png) | [Issuu.com](https://issuu.com/)| Gensim’s LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it’s all about.
| Sports Authority | ![sports-authority](docs/src/readme_images/sports-authority.png) | [sportsauthority.com](https://en.wikipedia.org/wiki/Sports_Authority)| Text mining of customer surveys and social media sources |
| Search Metrics | ![search-metrics](docs/src/readme_images/search-metrics.png) | [searchmetrics.com](http://www.searchmetrics.com/)| Gensim word2vec used for entity disambiguation in Search Engine Optimisation
| Cisco Security | ![cisco](docs/src/readme_images/cisco.png) | [cisco.com](http://www.cisco.com/c/en/us/products/security/index.html)| Large-scale fraud detection
| 12K Research | ![12k](docs/src/readme_images/12k.png)| [12k.co](https://12k.co/)| Document similarity analysis on media articles
| National Institutes of Health | ![nih](docs/src/readme_images/nih.png) | [github/NIHOPA](https://github.com/NIHOPA/pipeline_word2vec)| Processing grants and publications with word2vec
| Codeq LLC | ![codeq](docs/src/readme_images/codeq.png) | [codeq.com](https://codeq.com)| Document classification with word2vec
| Mass Cognition | ![mass-cognition](docs/src/readme_images/mass-cognition.png) | [masscognition.com](http://www.masscognition.com/) | Topic analysis service for consumer text data and general text data |
| Stillwater Supercomputing | ![stillwater](docs/src/readme_images/stillwater.png) | [stillwater-sc.com](http://www.stillwater-sc.com/) | Document comprehension and association with word2vec |
| Channel 4 | ![channel4](docs/src/readme_images/channel4.png) | [channel4.com](http://www.channel4.com/) | Recommendation engine |
| Amazon | ![amazon](docs/src/readme_images/amazon.png) | [amazon.com](http://www.amazon.com/) | Document similarity|
| SiteGround Hosting | ![siteground](docs/src/readme_images/siteground.png) | [siteground.com](https://www.siteground.com/) | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
| Juju | ![juju](docs/src/readme_images/juju.png) | [www.juju.com](http://www.juju.com/) | Provide non-obvious related job suggestions. |
| NLPub | ![nlpub](docs/src/readme_images/nlpub.png) | [nlpub.org](https://nlpub.org/) | Distributional semantic models including word2vec. |
|Capital One | ![capitalone](docs/src/readme_images/capitalone.png) | [www.capitalone.com](https://www.capitalone.com/) | Topic modeling for customer complaints exploration. |
| Company | Logo | Industry | Use of Gensim |
|---------|------|----------|---------------|
| [RARE Technologies](http://rare-technologies.com) | ![rare](docs/src/readme_images/rare.png) | ML & NLP consulting | Creators of Gensim – this is us! |
| [Amazon](http://www.amazon.com/) | ![amazon](docs/src/readme_images/amazon.png) | Retail | Document similarity. |
| [National Institutes of Health](https://github.com/NIHOPA/pipeline_word2vec) | ![nih](docs/src/readme_images/nih.png) | Health | Processing grants and publications with word2vec. |
| [Cisco Security](http://www.cisco.com/c/en/us/products/security/index.html) | ![cisco](docs/src/readme_images/cisco.png) | Security | Large-scale fraud detection. |
| [Mindseye](http://www.mindseyesolutions.com/) | ![mindseye](docs/src/readme_images/mindseye.png) | Legal | Similarities in legal documents. |
| [Channel 4](http://www.channel4.com/) | ![channel4](docs/src/readme_images/channel4.png) | Media | Recommendation engine. |
| [Talentpair](http://talentpair.com) | ![talent-pair](docs/src/readme_images/talent-pair.png) | HR | Candidate matching in high-touch recruiting. |
| [Juju](http://www.juju.com/) | ![juju](docs/src/readme_images/juju.png) | HR | Provide non-obvious related job suggestions. |
| [Tailwind](https://www.tailwindapp.com/) | ![tailwind](docs/src/readme_images/tailwind.png) | Media | Post interesting and relevant content to Pinterest. |
| [Issuu](https://issuu.com/) | ![issuu](docs/src/readme_images/issuu.png) | Media | Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about. |
| [Search Metrics](http://www.searchmetrics.com/) | ![search-metrics](docs/src/readme_images/search-metrics.png) | Content Marketing | Gensim word2vec used for entity disambiguation in Search Engine Optimisation. |
| [12K Research](https://12k.co/) | ![12k](docs/src/readme_images/12k.png)| Media | Document similarity analysis on media articles. |
| [Stillwater Supercomputing](http://www.stillwater-sc.com/) | ![stillwater](docs/src/readme_images/stillwater.png) | Hardware | Document comprehension and association with word2vec. |
| [SiteGround](https://www.siteground.com/) | ![siteground](docs/src/readme_images/siteground.png) | Web hosting | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
| [Capital One](https://www.capitalone.com/) | ![capitalone](docs/src/readme_images/capitalone.png) | Finance | Topic modeling for customer complaints exploration. |

-------

Expand Down
5 changes: 5 additions & 0 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ environment:
PYTHON_ARCH: "64"
TOXENV: "py36-win"

- PYTHON: "C:\\Python37-x64"
PYTHON_VERSION: "3.7.0"
PYTHON_ARCH: "64"
TOXENV: "py37-win"

init:
- "ECHO %PYTHON% %PYTHON_VERSION% %PYTHON_ARCH%"
- "ECHO \"%APPVEYOR_SCHEDULED_BUILD%\""
Expand Down
152 changes: 152 additions & 0 deletions docs/fasttext-notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
FastText Notes
==============

The implementation is split across several submodules:

- models.fasttext
- models.keyedvectors (includes FastText-specific code, not good)
- models.word2vec (superclasses)
- models.base_any2vec (superclasses)

The implementation consists of several key classes:

1. models.fasttext.FastTextVocab: the vocabulary
2. models.keyedvectors.FastTextKeyedVectors: the vectors
3. models.fasttext.FastTextTrainables: the underlying neural network
4. models.fasttext.FastText: ties everything together

FastTextVocab
-------------

Seems to be an entirely redundant class.
Inherits from models.word2vec.Word2VecVocab, adding no new functionality.

FastTextKeyedVectors
--------------------

Inheritance hierarchy:

1. FastTextKeyedVectors
2. WordEmbeddingsKeyedVectors. Implements word similarity e.g. cosine similarity, WMD, etc.
3. BaseKeyedVectors (abstract base class)
4. utils.SaveLoad

There are many attributes.

Inherited from BaseKeyedVectors:

- vectors: a 2D numpy array. Flexible number of rows (0 by default). Number of columns equals vector dimensionality.
- vocab: a dictionary. Keys are words. Items are Vocab instances: these are essentially namedtuples that contain an index and a count. The former is the index of a term in the entire vocab. The latter is the number of times the term occurs.
- vector_size (dimensionality)
- index2entity

Inherited from WordEmbeddingsKeyedVectors:

- vectors_norm
- index2word

Added by FastTextKeyedVectors:

- vectors_vocab: 2D array. Rows are vectors. Columns correspond to vector dimensions. Initialized in FastTextTrainables.init_ngrams_weights. Reset in reset_ngrams_weights. Referred to as syn0_vocab in fasttext_inner.pyx. These are vectors for every word in the vocabulary.
- vectors_vocab_norm: looks unused, see _clear_post_train method.
- vectors_ngrams: 2D array. Each row is a bucket. Columns correspond to vector dimensions. Initialized in init_ngrams_weights function. Initialized in _load_vectors method when reading from native FB binary. Modified in reset_ngrams_weights method. This is the first matrix loaded from the native binary files.
- vectors_ngrams_norm: looks unused, see _clear_post_train method.
- buckets_word: A hashmap. Keyed by the index of a term in the vocab. Each value is an array, where each element is an integer that corresponds to a bucket. Initialized in init_ngrams_weights function
- hash2index: A hashmap. Keys are hashes of ngrams. Values are the number of ngrams (?). Initialized in init_ngrams_weights function.
- min_n: minimum ngram length
- max_n: maximum ngram length
- num_ngram_vectors: initialized in the init_ngrams_weights function

The init_ngrams_method looks like an internal method of FastTextTrainables.
It gets called as part of the prepare_weights method, which is effectively part of the FastModel constructor.

The above attributes are initialized to None in the FastTextKeyedVectors class constructor.
Unfortunately, their real initialization happens in an entirely different module, models.fasttext - another indication of poor separation of concerns.

Some questions:

- What is the x_lockf stuff? Why is it used only by the fast C implementation?
- How are vectors_vocab and vectors_ngrams different?

vectors_vocab contains vectors for entire vocabulary.
vectors_ngrams contains vectors for each _bucket_.


FastTextTrainables
------------------

[Link](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastTextTrainables)

This is a neural network that learns the vectors for the FastText embedding.
Mostly inherits from its [Word2Vec parent](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2VecTrainables).
Adds logic for calculating and maintaining ngram weights.

Key attributes:

- hashfxn: function for randomly initializing weights. Defaults to the built-in hash()
- layer1_size: The size of the inner layer of the NN. Equal to the vector dimensionality. Set in the Word2VecTrainables constructor.
- seed: The random generator seed used in reset_weights and update_weights
- syn1: The inner layer of the NN. Each row corresponds to a term in the vocabulary. Columns correspond to weights of the inner layer. There are layer1_size such weights. Set in the reset_weights and update_weights methods, only if hierarchical sampling is used.
- syn1neg: Similar to syn1, but only set if negative sampling is used.
- vectors_lockf: A one-dimensional array with one element for each term in the vocab. Set in reset_weights to an array of ones.
- vectors_vocab_lockf: Similar to vectors_vocab_lockf, ones(len(model.trainables.vectors), dtype=REAL)
- vectors_ngrams_lockf = ones((self.bucket, wv.vector_size), dtype=REAL)

The lockf stuff looks like it gets used by the fast C implementation.

The inheritance hierarchy here is:

1. FastTextTrainables
2. Word2VecTrainables
3. utils.SaveLoad

FastText
--------

Inheritance hierarchy:

1. FastText
2. BaseWordEmbeddingsModel: vocabulary management plus a ton of deprecated attrs
3. BaseAny2VecModel: logging and training functionality
4. utils.SaveLoad: for loading and saving

Lots of attributes (many inherited from superclasses).

From BaseAny2VecModel:

- workers
- vector_size
- epochs
- callbacks
- batch_words
- kv
- vocabulary
- trainables

From BaseWordEmbeddingModel:

- alpha
- min_alpha
- min_alpha_yet_reached
- window
- random
- hs
- negative
- ns_exponent
- cbow_mean
- compute_loss
- running_training_loss
- corpus_count
- corpus_total_words
- neg_labels

FastText attributes:

- wv: FastTextWordVectors. Used instead of .kv

Logging
-------

The logging seems to be inheritance-based.
It may be better to refactor this using aggregation istead of inheritance in the future.
The benefits would be leaner classes with less responsibilities and better separation of concerns.
2 changes: 1 addition & 1 deletion docs/notebooks/FastText_Tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the folllowing parameters from the original word2vec - \n",
"Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the following parameters from the original word2vec - \n",
" - model: Training architecture. Allowed values: `cbow`, `skipgram` (Default `cbow`)\n",
" - size: Size of embeddings to be learnt (Default 100)\n",
" - alpha: Initial learning rate (Default 0.025)\n",
Expand Down
2 changes: 1 addition & 1 deletion docs/notebooks/Poincare Evaluation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1706,7 +1706,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. The model can be investigated further to understand why it doesn't produce results as good as the paper. It is possible that this might be due to training details not present in the paper, or due to us incorrectly interpreting some ambiguous parts of the paper. We have not been able to clarify all such ambiguitities in communication with the authors.\n",
"1. The model can be investigated further to understand why it doesn't produce results as good as the paper. It is possible that this might be due to training details not present in the paper, or due to us incorrectly interpreting some ambiguous parts of the paper. We have not been able to clarify all such ambiguities in communication with the authors.\n",
"2. Optimizing the training process further - with a model size of 50 dimensions and a dataset with ~700k relations and ~80k nodes, the Gensim implementation takes around 45 seconds to complete an epoch (~15k relations per second), whereas the open source C++ implementation takes around 1/6th the time (~95k relations per second).\n",
"3. Implementing the variant of the model mentioned in the paper for symmetric graphs and evaluating on the scientific collaboration datasets described earlier in the report."
]
Expand Down
Loading

0 comments on commit 42e47a3

Please sign in to comment.