-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Bitdeli Badge to README #178
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
For some reason, in some pypln installations the text extracted from the document was not getting to the `PalavrasRaw` worker as unicode. This may be due to previous errors during the decoding process that we fixed earlier. That meant that, when we got a non-unicode string, python would try to decode it using the default codec (ascii) in `text.encode(PALAVRAS_ENCODING)`. Since we know the text came from mongodb, we can just decode it using utf-8 to make sure we have a unicode object.
In some instalations (in our production server, for example) `sys.getfilesystemencoding()` was not returning the correct encoding when the worker was run by pypelinin's broker. This meant that, when sending text to palavras stdin, we were using the wrong encoding, resulting in a `UnicodeEncodeError`. This commits forces utf-8 as the encoding to be used when communicating with the process and also adds some code to make sure we use the correct encoding everywhere in the workers that depend on palavras.
I'm sorry about so many commits, but in every one of them I thought I had fixed the bug, only to be suprised after deploy. It is incredibly frustrating to deal with bugs that only show up in production.
This will make it easier to know if palavras didn't run because it shouldn't (if the document is not in portuguese, for example) or if there was an error.
Feature/spellchecker Merging as it seems to be no objections left.
One of the reasons it was failing was because the test expectations were not up to date with the code. We have for a while been returning an empty string in case we can't coerce the content to unicode.
This is really a work in progress. The idea is to write a few workers this way, to extract common behaviour.
For some reason, using `apply` only worked after using `delay` once and reruning the tests.
This will keep the logic of getting the keys by the document id in mongodict.
This commit makes sure we: - Use CELERY_ALWAYS_EAGER to run the tests syncronously. - Drop the collection in beetween test cases. - Delete the testing database after the entire test suite has run.
This class encapsulates the logic of getting document data from mongo and saving it back, leaving to the tasks themselves only the logic of processing the data.
fixed minor typos
…lk insertions are not supported yet.
pymongo.Connection was removed in pymongo>=3. Since we will probably remove MongoDict as a dependency, we will use version 2.8.1 for now
…in a test This happened whenever the 'test' index didn't exist in the elasticsearch server.
This commit introduces a more specific index name. `test` is too generic, with `test_pypln` we have less chance of stepping on someone else's toes.
Feature/elastic indexer seems fine to me. Merging.
The idea is to use this for the indexing pipeline. Since the document will be stored in the elastic index anyways, it's better not to have it replicated.
Adds a worker that deletes a file from GridFS Seems fine. merging
We should not index the original file contents for two reasons: 1) they are not relevant to the search. The `text` attribute should include the relevant content and 2) they may be in a binary format that will not be serializable. Fixes #176
Fixes ElasticIndexer for binary files seems ok. Merging.
flavioamieiro
added a commit
that referenced
this pull request
Jul 7, 2015
Add a Bitdeli Badge to README
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull request made by @fccoelho at https://bitdeli.com