Feature/remove mongodict #179

flavioamieiro · 2015-09-15T02:47:53Z

This removes MongoDict from the backend. After this is merged, we'll need to adapt pypln.web to read from mongo directly.

@fccoelho and @israelst please take a close look at this PR, since I had to make a lot of choices here. Specially on 72a54a5

The idea behind this is to have less queries and to make the storing format less opaque. This should bring a performance improvement and also make it easier to inspect the database directly. Tests are still broken because they rely on the old format.

Includes changes for the freqdist worker tests.

PyPLNTask wasn't filtering by the document's "_id". It was returning the first document it found. This commit includes a regression test and fixes the bug.

This is necessary because the content of the file might be binary data (such as a pdf), and we don't want to change it by encoding/decoding. This was not an issue before because MongoDict used to pickle this before storing. This kind of adaptation is probably going to happen in other places where the mapping from python objects to json is not straight forward.

@israelst

Not only this means fixing tests, but also decoding the base64 encoded string that represents the contents of the file (see 38e3f7e for more info). Thanks @israelst for the help discussing this.

@israelst

This commit was co-authored by Israel Teixeira <[email protected]> We needed to store the trigram results as strings (since that's the only possible type for mongo keys). We decided to turn the tuples into strings joined by spaces because spaces are never going to be part of a token. Also, mongo keys cannot contain neither `.` nor `$`, so we decided to replace those with `\dot` and `\dollarsign` respectively. Thanks a lot @israelst for the help!

fccoelho · 2015-09-15T16:31:24Z

For me this looks ok, apart from the few comments I've made on specific commits. As soon as all tests are passing, we can merge

flavioamieiro · 2015-09-30T02:07:01Z

I've added tests for Bigrams and Trigrams with dollar signs and dots. I still didn't figure out why the Bigrams worker is not affected by the same issue.
With this, I think that this PR is ready to merge, unless we decide to follow @israelst 's (good) suggestion and replace $ and . close to the db for all workers. In any case, we should not merge this branch until pypln.web catches up.
I'll start working on pypln.web's side of things now.

fccoelho · 2015-09-30T10:14:04Z

I Think we should go ahead and merge it as is.

Regarding the bigram/trigram bug mystery, I think the answer may lie within NLTK, in the way the colocationfider objects work. We shoul run some tests in the python console and look at the output, and not only if it generates an exception or not.

@fccoelho

As @fccoelho pointed out, we were just running the same code as the worker. If there was an error in the way we call nltk (and we got 'None' for example), we would still have a valid assertion. We now get a known value and tests against that.

flavioamieiro · 2015-09-30T15:17:56Z

@fccoelho I've changed the tests for bigrams and trigrams. I agree with you on that.

I also think this is ready now but, as I said, let's leave the PR open for now. If we merge this in develop now, pypln.web won't be up-to-date with pypln.backend. This would mean we would not be able to deploy quickly if we needed to. I say we leave this branch here and, as soon as pypln.web is also ready, merge it.

flavioamieiro · 2015-10-22T15:18:56Z

@fccoelho and @israelst : for a comparison, I've created a small benchmarking script and ran it against this branch and the current develop branch. The results are pretty interesting (here they are: with mongodict and without mongodict). This branch is definitely faster than what we had before, not to mention the simplification of the code base.

fccoelho · 2015-10-22T19:36:25Z

Great results! now let's merge and run our favorite load test (the mediacloud collection) ;-)

Feature/remove mongodict

flavioamieiro added 21 commits August 6, 2015 17:59

Removes MongoDictAdapter

f8b4b87

The idea behind this is to have less queries and to make the storing format less opaque. This should bring a performance improvement and also make it easier to inspect the database directly. Tests are still broken because they rely on the old format.

Adapts test utils for the new database format

f8ecdd4

Includes changes for the freqdist worker tests.

Fixes the name of the property for the index id in elasticindexer

acca8e8

Adapts elasticsearch worker tests

3b735cd

Adapts tests for bigrams worker to the removal of MongoDict

4ae2654

Fixes huge bug in PyPLNTask

ad901fb

PyPLNTask wasn't filtering by the document's "_id". It was returning the first document it found. This commit includes a regression test and fixes the bug.

Adapts Extractor to the removal of MongoDict

2c7bb07

Not only this means fixing tests, but also decoding the base64 encoded string that represents the contents of the file (see 38e3f7e for more info). Thanks @israelst for the help discussing this.

Adapts GridFSFileDeleter to the removal of MongoDict

d3ca42e

Adapts Lemmatizer tests to the removal of MongoDict

a1880b8

Adapts NounPhrase worker to the removal of MongoDict

077d6a9

Adapts WordCloud worker to the removal of MongoDict

23db880

Removes trailing _ from Trigram worker test name

02c620d

Adapts Tokenizer test to the removal of MongoDict

565c380

Adapts the Statistics worker tests to the removal of MongoDict

d8f7435

Adapts Spellchecker worker tests to the removal of MongoDict

dfc69f0

Adapts POS worker tests to the removal of MongoDict

bc9a174

Adapts SemanticTagger worker tests to the removal of MongoDict

ba8b40f

Adapts PalavrasRaw worker tests to the removal of MongoDict

53515b4

Removes last references to MongoDict

48d41a7

flavioamieiro added 3 commits September 16, 2015 16:50

Adds w=1 to all inserts in tests as suggested by @fccoelho

b12833e

Adds test for trigrams with dots and dollar signs

1871148

Adds test for bigrams with dots and dollar signs

3bd3896

flavioamieiro mentioned this pull request Oct 19, 2015

Feature/remove mongodict NAMD/pypln.web#137

Merged

israelst added a commit that referenced this pull request Oct 28, 2015

Merge pull request #179 from flavioamieiro/feature/remove_mongodict

d3e1d78

Feature/remove mongodict

israelst merged commit d3e1d78 into NAMD:develop Oct 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/remove mongodict #179

Feature/remove mongodict #179

flavioamieiro commented Sep 15, 2015

fccoelho commented Sep 15, 2015

flavioamieiro commented Sep 30, 2015

fccoelho commented Sep 30, 2015

flavioamieiro commented Sep 30, 2015

flavioamieiro commented Oct 22, 2015

fccoelho commented Oct 22, 2015

Feature/remove mongodict #179

Feature/remove mongodict #179

Conversation

flavioamieiro commented Sep 15, 2015

fccoelho commented Sep 15, 2015

flavioamieiro commented Sep 30, 2015

fccoelho commented Sep 30, 2015

flavioamieiro commented Sep 30, 2015

flavioamieiro commented Oct 22, 2015

fccoelho commented Oct 22, 2015