-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/remove mongodict #179
Feature/remove mongodict #179
Conversation
The idea behind this is to have less queries and to make the storing format less opaque. This should bring a performance improvement and also make it easier to inspect the database directly. Tests are still broken because they rely on the old format.
Includes changes for the freqdist worker tests.
PyPLNTask wasn't filtering by the document's "_id". It was returning the first document it found. This commit includes a regression test and fixes the bug.
This is necessary because the content of the file might be binary data (such as a pdf), and we don't want to change it by encoding/decoding. This was not an issue before because MongoDict used to pickle this before storing. This kind of adaptation is probably going to happen in other places where the mapping from python objects to json is not straight forward.
This commit was co-authored by Israel Teixeira <[email protected]> We needed to store the trigram results as strings (since that's the only possible type for mongo keys). We decided to turn the tuples into strings joined by spaces because spaces are never going to be part of a token. Also, mongo keys cannot contain neither `.` nor `$`, so we decided to replace those with `\dot` and `\dollarsign` respectively. Thanks a lot @israelst for the help!
For me this looks ok, apart from the few comments I've made on specific commits. As soon as all tests are passing, we can merge |
I've added tests for Bigrams and Trigrams with dollar signs and dots. I still didn't figure out why the Bigrams worker is not affected by the same issue. |
I Think we should go ahead and merge it as is. Regarding the bigram/trigram bug mystery, I think the answer may lie within NLTK, in the way the colocationfider objects work. We shoul run some tests in the python console and look at the output, and not only if it generates an exception or not. |
As @fccoelho pointed out, we were just running the same code as the worker. If there was an error in the way we call nltk (and we got 'None' for example), we would still have a valid assertion. We now get a known value and tests against that.
@fccoelho I've changed the tests for bigrams and trigrams. I agree with you on that. I also think this is ready now but, as I said, let's leave the PR open for now. If we merge this in |
@fccoelho and @israelst : for a comparison, I've created a small benchmarking script and ran it against this branch and the current develop branch. The results are pretty interesting (here they are: with mongodict and without mongodict). This branch is definitely faster than what we had before, not to mention the simplification of the code base. |
Great results! now let's merge and run our favorite load test (the mediacloud collection) ;-) |
This removes MongoDict from the backend. After this is merged, we'll need to adapt pypln.web to read from mongo directly.
@fccoelho and @israelst please take a close look at this PR, since I had to make a lot of choices here. Specially on 72a54a5