Add a Bitdeli Badge to README #178

bitdeli-chef · 2015-07-07T12:55:06Z

Pull request made by @fccoelho at https://bitdeli.com

…ist of errors.

For some reason, in some pypln installations the text extracted from the document was not getting to the `PalavrasRaw` worker as unicode. This may be due to previous errors during the decoding process that we fixed earlier. That meant that, when we got a non-unicode string, python would try to decode it using the default codec (ascii) in `text.encode(PALAVRAS_ENCODING)`. Since we know the text came from mongodb, we can just decode it using utf-8 to make sure we have a unicode object.

…develop

In some instalations (in our production server, for example) `sys.getfilesystemencoding()` was not returning the correct encoding when the worker was run by pypelinin's broker. This meant that, when sending text to palavras stdin, we were using the wrong encoding, resulting in a `UnicodeEncodeError`. This commits forces utf-8 as the encoding to be used when communicating with the process and also adds some code to make sure we use the correct encoding everywhere in the workers that depend on palavras.

I'm sorry about so many commits, but in every one of them I thought I had fixed the bug, only to be suprised after deploy. It is incredibly frustrating to deal with bugs that only show up in production.

This will make it easier to know if palavras didn't run because it shouldn't (if the document is not in portuguese, for example) or if there was an error.

…vras didn't run

Feature/spellchecker Merging as it seems to be no objections left.

One of the reasons it was failing was because the test expectations were not up to date with the code. We have for a while been returning an empty string in case we can't coerce the content to unicode.

This is really a work in progress. The idea is to write a few workers this way, to extract common behaviour.

For some reason, using `apply` only worked after using `delay` once and reruning the tests.

This will keep the logic of getting the keys by the document id in mongodict.

…ious commits)

This commit makes sure we: - Use CELERY_ALWAYS_EAGER to run the tests syncronously. - Drop the collection in beetween test cases. - Delete the testing database after the entire test suite has run.

This class encapsulates the logic of getting document data from mongo and saving it back, leaving to the tasks themselves only the logic of processing the data.

…dapter

fixed minor typos

…lk insertions are not supported yet.

pymongo.Connection was removed in pymongo>=3. Since we will probably remove MongoDict as a dependency, we will use version 2.8.1 for now

…in a test This happened whenever the 'test' index didn't exist in the elasticsearch server.

This commit introduces a more specific index name. `test` is too generic, with `test_pypln` we have less chance of stepping on someone else's toes.

Feature/elastic indexer seems fine to me. Merging.

The idea is to use this for the indexing pipeline. Since the document will be stored in the elastic index anyways, it's better not to have it replicated.

Adds a worker that deletes a file from GridFS Seems fine. merging

We should not index the original file contents for two reasons: 1) they are not relevant to the search. The `text` attribute should include the relevant content and 2) they may be in a binary format that will not be serializable. Fixes #176

Fixes ElasticIndexer for binary files seems ok. Merging.

Add a Bitdeli Badge to README

fccoelho and others added 30 commits February 22, 2013 17:07

Basic spellchecker worker ready

bdd2ae5

Tests passing on spellchecker worker

6c7657b

Added pre-instancing of checkers, as per sugestion of @turicas.

3a16e27

Added handling of unsupported language by returning None instead of l…

a0a23fb

…ist of errors.

added a test for english

33ff882

Removes workaround for nltk not returning unicode stopwords

c5aac9e

This was issue nltk/nltk#122

Merge branch 'bugfix/remove_workaround_for_nltk_issue' into develop

6c5b022

Merge branch 'bugfix/fix_unicodedecodeerror_in_palavras_worker' into …

f196c62

…develop

Fixes even more Unicode{En,De}codeErrors

48360d8

I'm sorry about so many commits, but in every one of them I thought I had fixed the bug, only to be suprised after deploy. It is incredibly frustrating to deal with bugs that only show up in production.

Adds a property to tell if PalavrasRaw was run for this document

63e430c

This will make it easier to know if palavras didn't run because it shouldn't (if the document is not in portuguese, for example) or if there was an error.

Makes sure workers that depend directly on palavras don't run if pala…

ab61046

…vras didn't run

Merge branch 'feature/fix_palavras_exceptions' into develop

5680345

Merge pull request #146 from fccoelho/feature/spellchecker

c7fe275

Feature/spellchecker Merging as it seems to be no objections left.

Fixes tokenizer test

35908ca

Fixes extractor test

cf36ac2

One of the reasons it was failing was because the test expectations were not up to date with the code. We have for a while been returning an empty string in case we can't coerce the content to unicode.

Adds pyenchant to requirements

3eef9df

Updates palavras related tests

641957f

Merge branch 'fix_tests' into develop

d049be3

WIP - First draft of worker using a celery task

1d18679

This is really a work in progress. The idea is to write a few workers this way, to extract common behaviour.

Removes unused code from the tokenizer (and it's test)

6b5706d

Uses task.delay().get() instead of task.apply()

acb90f1

For some reason, using `apply` only worked after using `delay` once and reruning the tests.

Adapts freqdist worker to use celery

2eeeff9

Adds first draft of a mongodict subclass to represent documents

a9c3ab4

This will keep the logic of getting the keys by the document id in mongodict.

adds the file that defines the celery app (this was missing from prev…

f7ca6dc

…ious commits)

Renames MongoDictById to MongoDictAdapter

3184f65

Finishes MongoDictAdapter

9e3d73a

Improve tests for freqdist worker

3244460

This commit makes sure we: - Use CELERY_ALWAYS_EAGER to run the tests syncronously. - Drop the collection in beetween test cases. - Delete the testing database after the entire test suite has run.

Creates a base class for all our workers

ec8ba0d

This class encapsulates the logic of getting document data from mongo and saving it back, leaving to the tasks themselves only the logic of processing the data.

flavioamieiro and others added 28 commits May 7, 2015 16:26

Adds script to run celery in production

3d06fb3

Makes run_celery.sh script executable

25ab920

Makes sure GridFSDataRetriever connects to the correct mongo database

e5fee58

Makes sure we use the correct hostname and port when using MongoDictA…

5ed6e42

…dapter

Merge branch 'feature/celery' into develop

217903d

Gets pypln storage configuration from config file if available

02dd767

Adds a small section to README.rst about creating new workers

fc9013a

Update README.rst

44454fe

fixed minor typos

Implements a worker to index documents in an elasticsearch server. Bu…

b609af4

…lk insertions are not supported yet.

Added test for elastic_indexer

6b032da

Pins the pymongo version for now

6d973c6

pymongo.Connection was removed in pymongo>=3. Since we will probably remove MongoDict as a dependency, we will use version 2.8.1 for now

Adds configuration for the result backend and the message broker

129c28c

Adds celery username and password to configuration

8e3324f

Adds index_name as a parameter to the indexing call

e1475a2

Ignores error when trying to delete a index that still doesn't exist …

9a7602f

…in a test This happened whenever the 'test' index didn't exist in the elasticsearch server.

Fixes typo in the Indexer test name and removes trailing whitespace

c8459c0

Changes test index name

b91523d

This commit introduces a more specific index name. `test` is too generic, with `test_pypln` we have less chance of stepping on someone else's toes.

Adds ElasticIndexer to the list of exported workers

1c693f2

Uses the file_id generated by gridfs instead of id generated by postgres

2aaa10f

Fixes ElasticIndexer test

e2d7748

Removes unnecessary trailing lines in elastic_indexer.py

21debe8

Merge pull request #174 from flavioamieiro/feature/elastic-indexer

983fcb4

Feature/elastic indexer seems fine to me. Merging.

Adds a worker that deletes a file from GridFS

e4d0cb8

The idea is to use this for the indexing pipeline. Since the document will be stored in the elastic index anyways, it's better not to have it replicated.

Merge pull request #175 from flavioamieiro/feature/delete-file-worker

9ae6f25

Adds a worker that deletes a file from GridFS Seems fine. merging

Removes unused variable declaration

8ab3b3e

Fixes ElasticIndexer for binary files

579e0bc

We should not index the original file contents for two reasons: 1) they are not relevant to the search. The `text` attribute should include the relevant content and 2) they may be in a binary format that will not be serializable. Fixes #176

Merge pull request #177 from flavioamieiro/bugfix/indexing_contents

cba555a

Fixes ElasticIndexer for binary files seems ok. Merging.

Add a Bitdeli badge to README

97ffeb1

flavioamieiro added a commit that referenced this pull request Jul 7, 2015

Merge pull request #178 from bitdeli-chef/master

c5732ec

Add a Bitdeli Badge to README

flavioamieiro merged commit c5732ec into NAMD:master Jul 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Bitdeli Badge to README #178

Add a Bitdeli Badge to README #178

bitdeli-chef commented Jul 7, 2015

Add a Bitdeli Badge to README #178

Add a Bitdeli Badge to README #178

Conversation

bitdeli-chef commented Jul 7, 2015