Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/remove mongodict #179

Merged
merged 25 commits into from
Oct 28, 2015
Merged

Conversation

flavioamieiro
Copy link
Member

This removes MongoDict from the backend. After this is merged, we'll need to adapt pypln.web to read from mongo directly.

@fccoelho and @israelst please take a close look at this PR, since I had to make a lot of choices here. Specially on 72a54a5

The idea behind this is to have less queries and to make the storing format
less opaque. This should bring a performance improvement and also make it
easier to inspect the database directly.

Tests are still broken because they rely on the old format.
Includes changes for the freqdist worker tests.
PyPLNTask wasn't filtering by the document's "_id". It was returning the first
document it found. This commit includes a regression test and fixes the bug.
This is necessary because the content of the file might be binary data (such as
a pdf), and we don't want to change it by encoding/decoding. This was not an
issue before because MongoDict used to pickle this before storing. This kind of
adaptation is probably going to happen in other places where the mapping from
python objects to json is not straight forward.
Not only this means fixing tests, but also decoding the base64 encoded string
that represents the contents of the file (see 38e3f7e for more info).

Thanks @israelst for the help discussing this.
This commit was co-authored by Israel Teixeira <[email protected]>

We needed to store the trigram results as strings (since that's the only
possible type for mongo keys). We decided to turn the tuples into strings
joined by spaces because spaces are never going to be part of a token.

Also, mongo keys cannot contain neither `.` nor `$`, so we decided to replace
those with `\dot` and `\dollarsign` respectively.

Thanks a lot @israelst for the help!
@fccoelho
Copy link
Member

For me this looks ok, apart from the few comments I've made on specific commits. As soon as all tests are passing, we can merge

@flavioamieiro
Copy link
Member Author

I've added tests for Bigrams and Trigrams with dollar signs and dots. I still didn't figure out why the Bigrams worker is not affected by the same issue.
With this, I think that this PR is ready to merge, unless we decide to follow @israelst 's (good) suggestion and replace $ and . close to the db for all workers. In any case, we should not merge this branch until pypln.web catches up.
I'll start working on pypln.web's side of things now.

@fccoelho
Copy link
Member

I Think we should go ahead and merge it as is.

Regarding the bigram/trigram bug mystery, I think the answer may lie within NLTK, in the way the colocationfider objects work. We shoul run some tests in the python console and look at the output, and not only if it generates an exception or not.

As @fccoelho pointed out, we were just running the same code as the worker. If
there was an error in the way we call nltk (and we got 'None' for example), we
would still have a valid assertion. We now get a known value and tests against
that.
@flavioamieiro
Copy link
Member Author

@fccoelho I've changed the tests for bigrams and trigrams. I agree with you on that.

I also think this is ready now but, as I said, let's leave the PR open for now. If we merge this in develop now, pypln.web won't be up-to-date with pypln.backend. This would mean we would not be able to deploy quickly if we needed to. I say we leave this branch here and, as soon as pypln.web is also ready, merge it.

@flavioamieiro
Copy link
Member Author

@fccoelho and @israelst : for a comparison, I've created a small benchmarking script and ran it against this branch and the current develop branch. The results are pretty interesting (here they are: with mongodict and without mongodict). This branch is definitely faster than what we had before, not to mention the simplification of the code base.

@fccoelho
Copy link
Member

Great results! now let's merge and run our favorite load test (the mediacloud collection) ;-)

israelst added a commit that referenced this pull request Oct 28, 2015
@israelst israelst merged commit d3e1d78 into NAMD:develop Oct 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants