Merge branch 'release-3.7.0'

piskvorky · Jan 18, 2019 · 42e47a3 · 42e47a3
2 parents 355ecc6 + 7d84b7e
commit 42e47a3
Show file tree

Hide file tree

Showing 187 changed files with 57,324 additions and 16,406 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -30,7 +30,7 @@ jobs:
           name: Build documentation
           command: |
             source venv/bin/activate
-            tox -e docs -vv
+            tox -e compile,docs -vv
 
       - store_artifacts:
           path:  docs/src/_build

diff --git a/.gitignore b/.gitignore
@@ -7,6 +7,8 @@
 *.o
 *.so
 *.pyc
+*.pyo
+*.pyd
 
 # Packages #
 ############

diff --git a/.travis.yml b/.travis.yml
@@ -13,7 +13,10 @@ language: python
 matrix:
   include:
     - python: '2.7'
-      env: TOXENV="flake8"
+      env: TOXENV="flake8,flake8-docs"
+
+    - python: '3.6'
+      env: TOXENV="flake8,flake8-docs"
 
     - python: '2.7'
       env: TOXENV="py27-linux"
@@ -24,5 +27,13 @@ matrix:
     - python: '3.6'
       env: TOXENV="py36-linux"
 
+    - python: '3.7'
+      env:
+        - TOXENV="py37-linux"
+        - BOTO_CONFIG="/dev/null"
+      dist: xenial
+      sudo: true
+
+
 install: pip install tox
 script: tox -vv
diff --git a/CHANGELOG.md b/CHANGELOG.md
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -4,17 +4,36 @@ include CHANGELOG.md
 include COPYING
 include COPYING.LESSER
 include ez_setup.py
+
 include gensim/models/voidptr.h
+include gensim/models/fast_line_sentence.h
+
 include gensim/models/word2vec_inner.c
 include gensim/models/word2vec_inner.pyx
 include gensim/models/word2vec_inner.pxd
+include gensim/models/word2vec_corpusfile.cpp
+include gensim/models/word2vec_corpusfile.pyx
+include gensim/models/word2vec_corpusfile.pxd
+
 include gensim/models/doc2vec_inner.c
 include gensim/models/doc2vec_inner.pyx
+include gensim/models/doc2vec_inner.pxd
+include gensim/models/doc2vec_corpusfile.cpp
+include gensim/models/doc2vec_corpusfile.pyx
+
 include gensim/models/fasttext_inner.c
 include gensim/models/fasttext_inner.pyx
+include gensim/models/fasttext_inner.pxd
+include gensim/models/fasttext_corpusfile.cpp
+include gensim/models/fasttext_corpusfile.pyx
+
 include gensim/models/_utils_any2vec.c
 include gensim/models/_utils_any2vec.pyx
 include gensim/corpora/_mmreader.c
 include gensim/corpora/_mmreader.pyx
 include gensim/_matutils.c
 include gensim/_matutils.pyx
+
+include gensim/models/nmf_pgd.c
+include gensim/models/nmf_pgd.pyx
+
diff --git a/README.md b/README.md
@@ -119,29 +119,23 @@ Documentation
 Adopters
 --------
 
-
-
-| Name                                   | Logo                                                                                                                           | URL                                                                                              | Description                                                                                                                                                                                                           |
-|----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|                                                                                                                         
-| RaRe Technologies                            | ![rare](docs/src/readme_images/rare.png) | [rare-technologies.com](http://rare-technologies.com)                                                           | Machine learning & NLP consulting and training. Creators and maintainers of Gensim. |
-| Mindseye                            | ![mindseye](docs/src/readme_images/mindseye.png) | [mindseye.com](http://www.mindseyesolutions.com/)                                                           | Similarities in legal documents                                                    | 
-| Talentpair                            | ![talent-pair](docs/src/readme_images/talent-pair.png) | [talentpair.com](http://talentpair.com)                                                           | Data science driving high-touch recruiting                                                    | 
-| Tailwind          | ![tailwind](docs/src/readme_images/tailwind.png)| [Tailwindapp.com](https://www.tailwindapp.com/)| Post interesting and relevant content to Pinterest              |
-| Issuu          | ![issuu](docs/src/readme_images/issuu.png) | [Issuu.com](https://issuu.com/)| Gensim’s LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it’s all about.
-| Sports Authority        |               ![sports-authority](docs/src/readme_images/sports-authority.png)               | [sportsauthority.com](https://en.wikipedia.org/wiki/Sports_Authority)| Text mining of customer surveys and social media sources |
-| Search Metrics        | ![search-metrics](docs/src/readme_images/search-metrics.png) | [searchmetrics.com](http://www.searchmetrics.com/)| Gensim word2vec used for entity disambiguation in Search Engine Optimisation
-| Cisco Security        | ![cisco](docs/src/readme_images/cisco.png) | [cisco.com](http://www.cisco.com/c/en/us/products/security/index.html)|  Large-scale fraud detection
-|  12K Research         | ![12k](docs/src/readme_images/12k.png)| [12k.co](https://12k.co/)|   Document similarity analysis on media articles
-|  National Institutes of Health         | ![nih](docs/src/readme_images/nih.png) | [github/NIHOPA](https://github.com/NIHOPA/pipeline_word2vec)|   Processing grants and publications with word2vec
-|  Codeq LLC         | ![codeq](docs/src/readme_images/codeq.png) | [codeq.com](https://codeq.com)|   Document classification with word2vec
-| Mass Cognition     | ![mass-cognition](docs/src/readme_images/mass-cognition.png) | [masscognition.com](http://www.masscognition.com/)                                  | Topic analysis service for consumer text data and general text data |
-| Stillwater Supercomputing     | ![stillwater](docs/src/readme_images/stillwater.png) | [stillwater-sc.com](http://www.stillwater-sc.com/)                                  | Document comprehension and association with word2vec |
-| Channel 4     | ![channel4](docs/src/readme_images/channel4.png) | [channel4.com](http://www.channel4.com/)                                  | Recommendation engine |
-| Amazon     |  ![amazon](docs/src/readme_images/amazon.png) | [amazon.com](http://www.amazon.com/)                                  |  Document similarity|
-| SiteGround Hosting     |  ![siteground](docs/src/readme_images/siteground.png) | [siteground.com](https://www.siteground.com/)                                  | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
-| Juju  | ![juju](docs/src/readme_images/juju.png) | [www.juju.com](http://www.juju.com/) | Provide non-obvious related job suggestions. |
-| NLPub | ![nlpub](docs/src/readme_images/nlpub.png) | [nlpub.org](https://nlpub.org/) | Distributional semantic models including word2vec. |
-|Capital One | ![capitalone](docs/src/readme_images/capitalone.png) | [www.capitalone.com](https://www.capitalone.com/) | Topic modeling for customer complaints exploration. |
+| Company | Logo | Industry | Use of Gensim |
+|---------|------|----------|---------------|                          
+| [RARE Technologies](http://rare-technologies.com) | ![rare](docs/src/readme_images/rare.png) | ML & NLP consulting | Creators of Gensim – this is us! |
+| [Amazon](http://www.amazon.com/) |  ![amazon](docs/src/readme_images/amazon.png) | Retail |  Document similarity. |
+| [National Institutes of Health](https://github.com/NIHOPA/pipeline_word2vec) | ![nih](docs/src/readme_images/nih.png) | Health | Processing grants and publications with word2vec. |
+| [Cisco Security](http://www.cisco.com/c/en/us/products/security/index.html) | ![cisco](docs/src/readme_images/cisco.png) | Security |  Large-scale fraud detection. |
+| [Mindseye](http://www.mindseyesolutions.com/) | ![mindseye](docs/src/readme_images/mindseye.png) | Legal | Similarities in legal documents. |
+| [Channel 4](http://www.channel4.com/) | ![channel4](docs/src/readme_images/channel4.png) | Media | Recommendation engine. |
+| [Talentpair](http://talentpair.com) | ![talent-pair](docs/src/readme_images/talent-pair.png) | HR | Candidate matching in high-touch recruiting. |
+| [Juju](http://www.juju.com/)  | ![juju](docs/src/readme_images/juju.png) | HR | Provide non-obvious related job suggestions. |
+| [Tailwind](https://www.tailwindapp.com/) | ![tailwind](docs/src/readme_images/tailwind.png) | Media | Post interesting and relevant content to Pinterest. |
+| [Issuu](https://issuu.com/) | ![issuu](docs/src/readme_images/issuu.png) | Media | Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about. |
+| [Search Metrics](http://www.searchmetrics.com/) | ![search-metrics](docs/src/readme_images/search-metrics.png) | Content Marketing | Gensim word2vec used for entity disambiguation in Search Engine Optimisation. |
+| [12K Research](https://12k.co/) | ![12k](docs/src/readme_images/12k.png)| Media |   Document similarity analysis on media articles. |
+| [Stillwater Supercomputing](http://www.stillwater-sc.com/) | ![stillwater](docs/src/readme_images/stillwater.png) | Hardware | Document comprehension and association with word2vec. |
+| [SiteGround](https://www.siteground.com/) |  ![siteground](docs/src/readme_images/siteground.png) | Web hosting | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
+| [Capital One](https://www.capitalone.com/) | ![capitalone](docs/src/readme_images/capitalone.png) | Finance | Topic modeling for customer complaints exploration. |
 
 -------
 

diff --git a/appveyor.yml b/appveyor.yml
@@ -28,6 +28,11 @@ environment:
       PYTHON_ARCH: "64"
       TOXENV: "py36-win"
 
+    - PYTHON: "C:\\Python37-x64"
+      PYTHON_VERSION: "3.7.0"
+      PYTHON_ARCH: "64"
+      TOXENV: "py37-win"
+
 init:
   - "ECHO %PYTHON% %PYTHON_VERSION% %PYTHON_ARCH%"
   - "ECHO \"%APPVEYOR_SCHEDULED_BUILD%\""

diff --git a/docs/fasttext-notes.md b/docs/fasttext-notes.md
@@ -0,0 +1,152 @@
+FastText Notes
+==============
+
+The implementation is split across several submodules:
+
+- models.fasttext
+- models.keyedvectors (includes FastText-specific code, not good)
+- models.word2vec (superclasses)
+- models.base_any2vec (superclasses)
+
+The implementation consists of several key classes:
+
+1. models.fasttext.FastTextVocab: the vocabulary
+2. models.keyedvectors.FastTextKeyedVectors: the vectors
+3. models.fasttext.FastTextTrainables: the underlying neural network
+4. models.fasttext.FastText: ties everything together
+
+FastTextVocab
+-------------
+
+Seems to be an entirely redundant class.
+Inherits from models.word2vec.Word2VecVocab, adding no new functionality.
+
+FastTextKeyedVectors
+--------------------
+
+Inheritance hierarchy:
+
+1. FastTextKeyedVectors
+2. WordEmbeddingsKeyedVectors.  Implements word similarity e.g. cosine similarity, WMD, etc.
+3. BaseKeyedVectors (abstract base class)
+4. utils.SaveLoad
+
+There are many attributes.
+
+Inherited from BaseKeyedVectors:
+
+- vectors: a 2D numpy array.  Flexible number of rows (0 by default).  Number of columns equals vector dimensionality.
+- vocab: a dictionary.  Keys are words.  Items are Vocab instances: these are essentially namedtuples that contain an index and a count.  The former is the index of a term in the entire vocab.  The latter is the number of times the term occurs.
+- vector_size (dimensionality)
+- index2entity
+
+Inherited from WordEmbeddingsKeyedVectors:
+
+- vectors_norm
+- index2word
+
+Added by FastTextKeyedVectors:
+
+- vectors_vocab: 2D array.  Rows are vectors.  Columns correspond to vector dimensions.  Initialized in FastTextTrainables.init_ngrams_weights.  Reset in reset_ngrams_weights.  Referred to as syn0_vocab in fasttext_inner.pyx.  These are vectors for every word in the vocabulary.
+- vectors_vocab_norm: looks unused, see _clear_post_train method.
+- vectors_ngrams: 2D array.  Each row is a bucket.  Columns correspond to vector dimensions.  Initialized in init_ngrams_weights function.  Initialized in _load_vectors method when reading from native FB binary.  Modified in reset_ngrams_weights method.  This is the first matrix loaded from the native binary files.
+- vectors_ngrams_norm: looks unused, see _clear_post_train method.
+- buckets_word: A hashmap.  Keyed by the index of a term in the vocab.  Each value is an array, where each element is an integer that corresponds to a bucket.  Initialized in init_ngrams_weights function
+- hash2index: A hashmap.  Keys are hashes of ngrams.  Values are the number of ngrams (?).  Initialized in init_ngrams_weights function.
+- min_n: minimum ngram length
+- max_n: maximum ngram length
+- num_ngram_vectors: initialized in the init_ngrams_weights function
+
+The init_ngrams_method looks like an internal method of FastTextTrainables.
+It gets called as part of the prepare_weights method, which is effectively part of the FastModel constructor.
+
+The above attributes are initialized to None in the FastTextKeyedVectors class constructor.
+Unfortunately, their real initialization happens in an entirely different module, models.fasttext - another indication of poor separation of concerns.
+
+Some questions:
+
+- What is the x_lockf stuff?  Why is it used only by the fast C implementation?
+- How are vectors_vocab and vectors_ngrams different?
+
+vectors_vocab contains vectors for entire vocabulary.
+vectors_ngrams contains vectors for each _bucket_.
+
+
+FastTextTrainables
+------------------
+
+[Link](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastTextTrainables)
+
+This is a neural network that learns the vectors for the FastText embedding.
+Mostly inherits from its [Word2Vec parent](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2VecTrainables).
+Adds logic for calculating and maintaining ngram weights.
+
+Key attributes:
+
+- hashfxn: function for randomly initializing weights.  Defaults to the built-in hash() 
+- layer1_size: The size of the inner layer of the NN.  Equal to the vector dimensionality.  Set in the Word2VecTrainables constructor.
+- seed: The random generator seed used in reset_weights and update_weights
+- syn1: The inner layer of the NN.  Each row corresponds to a term in the vocabulary.  Columns correspond to weights of the inner layer.  There are layer1_size such weights.  Set in the reset_weights and update_weights methods, only if hierarchical sampling is used.
+- syn1neg: Similar to syn1, but only set if negative sampling is used.
+- vectors_lockf: A one-dimensional array with one element for each term in the vocab.  Set in reset_weights to an array of ones.
+- vectors_vocab_lockf: Similar to vectors_vocab_lockf, ones(len(model.trainables.vectors), dtype=REAL)
+- vectors_ngrams_lockf = ones((self.bucket, wv.vector_size), dtype=REAL)
+
+The lockf stuff looks like it gets used by the fast C implementation.
+
+The inheritance hierarchy here is:
+
+1. FastTextTrainables
+2. Word2VecTrainables
+3. utils.SaveLoad
+
+FastText
+--------
+
+Inheritance hierarchy:
+
+1. FastText
+2. BaseWordEmbeddingsModel: vocabulary management plus a ton of deprecated attrs
+3. BaseAny2VecModel: logging and training functionality
+4. utils.SaveLoad: for loading and saving
+
+Lots of attributes (many inherited from superclasses).
+
+From BaseAny2VecModel:
+
+- workers
+- vector_size
+- epochs
+- callbacks
+- batch_words
+- kv
+- vocabulary
+- trainables
+
+From BaseWordEmbeddingModel:
+
+- alpha
+- min_alpha
+- min_alpha_yet_reached
+- window
+- random
+- hs
+- negative
+- ns_exponent
+- cbow_mean
+- compute_loss
+- running_training_loss
+- corpus_count
+- corpus_total_words
+- neg_labels
+
+FastText attributes:
+
+- wv: FastTextWordVectors.  Used instead of .kv
+
+Logging
+-------
+
+The logging seems to be inheritance-based.
+It may be better to refactor this using aggregation istead of inheritance in the future.
+The benefits would be leaner classes with less responsibilities and better separation of concerns.
diff --git a/docs/notebooks/FastText_Tutorial.ipynb b/docs/notebooks/FastText_Tutorial.ipynb
@@ -134,7 +134,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the folllowing parameters from the original word2vec - \n",
+    "Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the following parameters from the original word2vec - \n",
     "     - model: Training architecture. Allowed values: `cbow`, `skipgram` (Default `cbow`)\n",
     "     - size: Size of embeddings to be learnt (Default 100)\n",
     "     - alpha: Initial learning rate (Default 0.025)\n",

diff --git a/docs/notebooks/Poincare Evaluation.ipynb b/docs/notebooks/Poincare Evaluation.ipynb
@@ -1706,7 +1706,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "1. The model can be investigated further to understand why it doesn't produce results as good as the paper. It is possible that this might be due to training details not present in the paper, or due to us incorrectly interpreting some ambiguous parts of the paper. We have not been able to clarify all such ambiguitities in communication with the authors.\n",
+    "1. The model can be investigated further to understand why it doesn't produce results as good as the paper. It is possible that this might be due to training details not present in the paper, or due to us incorrectly interpreting some ambiguous parts of the paper. We have not been able to clarify all such ambiguities in communication with the authors.\n",
     "2. Optimizing the training process further - with a model size of 50 dimensions and a dataset with ~700k relations and ~80k nodes, the Gensim implementation takes around 45 seconds to complete an epoch (~15k relations per second), whereas the open source C++ implementation takes around 1/6th the time (~95k relations per second).\n",
     "3. Implementing the variant of the model mentioned in the paper for symmetric graphs and evaluating on the scientific collaboration datasets described earlier in the report."
    ]
-Original file line number
+Diff line change
@@ Expand Up / @@ -7,6 +7,8 @@ @@
     *.o
     *.so
     *.pyc
+    *.pyo
+    *.pyd
     # Packages #
     ############
@@ Expand Down @@