Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Bitdeli Badge to README #178

Merged
merged 92 commits into from
Jul 7, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
bdd2ae5
Basic spellchecker worker ready
fccoelho Feb 22, 2013
6c7657b
Tests passing on spellchecker worker
fccoelho Feb 22, 2013
3a16e27
Added pre-instancing of checkers, as per sugestion of @turicas.
fccoelho Feb 26, 2013
a0a23fb
Added handling of unsupported language by returning None instead of l…
fccoelho Feb 27, 2013
33ff882
added a test for english
fccoelho Feb 27, 2013
c5aac9e
Removes workaround for nltk not returning unicode stopwords
flavioamieiro Nov 5, 2014
6c5b022
Merge branch 'bugfix/remove_workaround_for_nltk_issue' into develop
flavioamieiro Nov 5, 2014
d4c6873
Fixes `UnicodeDecodeError` in `PalavrasRaw` worker
flavioamieiro Nov 11, 2014
f196c62
Merge branch 'bugfix/fix_unicodedecodeerror_in_palavras_worker' into …
flavioamieiro Nov 11, 2014
43048ac
Makes sure we use the correct codec in all Palavras workers
flavioamieiro Nov 11, 2014
48360d8
Fixes even more `Unicode{En,De}codeError`s
flavioamieiro Nov 11, 2014
63e430c
Adds a property to tell if PalavrasRaw was run for this document
flavioamieiro Jan 16, 2015
ab61046
Makes sure workers that depend directly on palavras don't run if pala…
flavioamieiro Jan 16, 2015
5680345
Merge branch 'feature/fix_palavras_exceptions' into develop
flavioamieiro Jan 16, 2015
c7fe275
Merge pull request #146 from fccoelho/feature/spellchecker
fccoelho Mar 22, 2015
35908ca
Fixes tokenizer test
flavioamieiro Apr 14, 2015
cf36ac2
Fixes extractor test
flavioamieiro Apr 14, 2015
3eef9df
Adds pyenchant to requirements
flavioamieiro Apr 14, 2015
641957f
Updates palavras related tests
flavioamieiro Apr 14, 2015
d049be3
Merge branch 'fix_tests' into develop
flavioamieiro Apr 14, 2015
1d18679
WIP - First draft of worker using a celery task
flavioamieiro Apr 14, 2015
6b5706d
Removes unused code from the tokenizer (and it's test)
flavioamieiro Apr 15, 2015
acb90f1
Uses `task.delay().get()` instead of `task.apply()`
flavioamieiro Apr 15, 2015
2eeeff9
Adapts freqdist worker to use celery
flavioamieiro Apr 15, 2015
a9c3ab4
Adds first draft of a mongodict subclass to represent documents
flavioamieiro Apr 15, 2015
f7ca6dc
adds the file that defines the celery app (this was missing from prev…
flavioamieiro Apr 15, 2015
3184f65
Renames `MongoDictById` to `MongoDictAdapter`
flavioamieiro Apr 16, 2015
9e3d73a
Finishes `MongoDictAdapter`
flavioamieiro Apr 16, 2015
3244460
Improve tests for freqdist worker
flavioamieiro Apr 16, 2015
ec8ba0d
Creates a base class for all our workers
flavioamieiro Apr 16, 2015
d09f03f
Creates base test class for pypln tasks
flavioamieiro Apr 16, 2015
e8c511e
Uses fake_id consistently in freqdist test
flavioamieiro Apr 16, 2015
58112f7
Renames `tokenizer` -> `Tokenizer`
flavioamieiro Apr 16, 2015
0ef0079
Adds note about the import that is holding the app togheter
flavioamieiro Apr 16, 2015
aefb3aa
Migrates the tokenizer test to the new class based approach
flavioamieiro Apr 16, 2015
72c9e22
Migrates wordcloud worker to Celery
flavioamieiro Apr 18, 2015
8b15114
Migrates the Statistics worker to a Celery task
flavioamieiro Apr 19, 2015
6aefdaa
Migrates Bigrams worker to a Celery Task
flavioamieiro Apr 19, 2015
6b41dbc
Migrates `PalavrasRaw` worker to a Celery task
flavioamieiro Apr 20, 2015
97af711
Migrates palavras NounPhrase worker to a Celery task
flavioamieiro Apr 21, 2015
5fe3bf1
Migrates palavras SemmanticTagger worker to a Celery task
flavioamieiro Apr 21, 2015
9c370bf
Migrates POS worker to a Celery task
flavioamieiro Apr 21, 2015
d1a8b71
Adds test to check if POS worker routes portuguese documents to the p…
flavioamieiro Apr 21, 2015
8f3d0b7
Migrates Trigram worker to Celery task
flavioamieiro Apr 22, 2015
d23201c
Removes unnecessary import in `test_worker_wordcloud.py`
flavioamieiro Apr 22, 2015
e25eed3
Migrates Spellchecker worker to a Celery Task
flavioamieiro Apr 22, 2015
0dbdfb4
Migrates Lemmatizer worker to a Celery task
flavioamieiro Apr 22, 2015
51648d6
commented out pyrex from requirements/development.txt
fccoelho Apr 22, 2015
a028a73
Renames Celery app to 'pypln_workers'
flavioamieiro Apr 23, 2015
bd119d6
WIP: starts to change Extractor worker into a Celery task
flavioamieiro Apr 23, 2015
2f6fe45
Fixes and documents the issue with the app import
flavioamieiro Apr 23, 2015
f6173dd
Changes all the Extractor tests to use it as a Celery task
flavioamieiro Apr 23, 2015
f790d98
Adds Task to retrieve filedata from GridFS
flavioamieiro Apr 24, 2015
89a995b
Removes pypelinin structure
flavioamieiro Apr 24, 2015
01aed6a
Adds copyright notice in the files that didn't have it
flavioamieiro Apr 24, 2015
d73a821
Adds a config module
flavioamieiro Apr 28, 2015
807f115
Moves GridFS config to config module
flavioamieiro Apr 28, 2015
a26aad7
Substitutes Make target `run` by `run-celery`
flavioamieiro Apr 29, 2015
3c1a4ab
Uses a dictionary with mongodb configuration
flavioamieiro Apr 29, 2015
b4b15f2
Makes sure tests only run if the database name starts with `test`
flavioamieiro Apr 29, 2015
1ffd33d
Adds the possibility of having a local configuration module
flavioamieiro Apr 29, 2015
1ef7252
Fixes 'tests' and 'tests-x' make targets
flavioamieiro Apr 29, 2015
b0a7e43
Updates README to reflect the changes in the project
flavioamieiro Apr 29, 2015
e2f3ec3
Adds GridFSDataRetriever to the exported attributes of pypln.backend.…
flavioamieiro Apr 30, 2015
3d06fb3
Adds script to run celery in production
flavioamieiro May 7, 2015
25ab920
Makes run_celery.sh script executable
flavioamieiro May 7, 2015
e5fee58
Makes sure `GridFSDataRetriever` connects to the correct mongo database
flavioamieiro May 11, 2015
5ed6e42
Makes sure we use the correct hostname and port when using MongoDictA…
flavioamieiro May 12, 2015
217903d
Merge branch 'feature/celery' into develop
flavioamieiro May 18, 2015
02dd767
Gets pypln storage configuration from config file if available
flavioamieiro May 18, 2015
fc9013a
Adds a small section to `README.rst` about creating new workers
flavioamieiro May 19, 2015
44454fe
Update README.rst
fccoelho May 20, 2015
b609af4
Implements a worker to index documents in an elasticsearch server. Bu…
fccoelho May 20, 2015
6b032da
Added test for elastic_indexer
fccoelho May 20, 2015
6d973c6
Pins the pymongo version for now
flavioamieiro May 20, 2015
129c28c
Adds configuration for the result backend and the message broker
flavioamieiro May 20, 2015
8e3324f
Adds celery username and password to configuration
flavioamieiro May 20, 2015
e1475a2
Adds index_name as a parameter to the indexing call
flavioamieiro May 25, 2015
9a7602f
Ignores error when trying to delete a index that still doesn't exist …
flavioamieiro May 25, 2015
c8459c0
Fixes typo in the Indexer test name and removes trailing whitespace
flavioamieiro May 25, 2015
b91523d
Changes test index name
flavioamieiro May 25, 2015
1c693f2
Adds `ElasticIndexer` to the list of exported workers
flavioamieiro Jun 16, 2015
2aaa10f
Uses the file_id generated by gridfs instead of id generated by postgres
flavioamieiro Jun 16, 2015
e2d7748
Fixes ElasticIndexer test
flavioamieiro Jun 19, 2015
21debe8
Removes unnecessary trailing lines in `elastic_indexer.py`
flavioamieiro Jun 19, 2015
983fcb4
Merge pull request #174 from flavioamieiro/feature/elastic-indexer
fccoelho Jun 22, 2015
e4d0cb8
Adds a worker that deletes a file from GridFS
flavioamieiro Jun 22, 2015
9ae6f25
Merge pull request #175 from flavioamieiro/feature/delete-file-worker
fccoelho Jun 22, 2015
8ab3b3e
Removes unused variable declaration
flavioamieiro Jun 22, 2015
579e0bc
Fixes ElasticIndexer for binary files
flavioamieiro Jun 26, 2015
cba555a
Merge pull request #177 from flavioamieiro/bugfix/indexing_contents
fccoelho Jun 26, 2015
97ffeb1
Add a Bitdeli badge to README
bitdeli-chef Jul 7, 2015
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ MANIFEST
.directory
*.db
.env
local_config.py
10 changes: 5 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,15 @@

test:
@clear
nosetests -dvs
nosetests -dvs tests/

test-workers:
@clear
nosetests -dsv tests/test_worker_*.py

test-x:
@clear
nosetests -dvsx
nosetests -dvsx tests/

doc:
@clear
Expand All @@ -37,8 +37,8 @@ clean:
find -regex '.*\.pyc' -exec rm {} \;
find -regex '.*~' -exec rm {} \;

run:
@./scripts/start_development_environment.sh
run-celery:
celery -A 'pypln.backend' worker --app=pypln.backend.celery_app:app -l info


.PHONY: test test-x doc clean test-workers run
.PHONY: test test-x doc clean test-workers run-celery
38 changes: 31 additions & 7 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,10 @@ PyPLN
=====

PyPLN is a distributed pipeline for natural language processing, made in Python.
We use `NLTK <http://nltk.org/>`_ and `ZeroMQ <http://www.zeromq.org/>`_ as
We use `NLTK <http://nltk.org/>`_ and `Celery <http://www.celeryproject.org>`_ as
our foundations. The goal of the project is to create an easy way to use NLTK
for processing big corpora, with a Web interface.

We don't have a production release yet, but it's scheduled on our
`next milestone <https://github.com/namd/pypln.backend/issues?milestone=1>`_.

PyPLN is sponsored by `Fundação Getulio Vargas <http://portal.fgv.br/>`_.

License
Expand Down Expand Up @@ -58,11 +55,38 @@ To run tests::

workon pypln.backend
pip install -r requirements/development.txt
echo "MONGODB_CONFIG = {'host': 'localhost', 'port': 27017, 'database': 'test_pypln_dev', 'gridfs_collection': files}" >> pypln/backend/local_config.py
make test

See our `code guidelines <https://github.com/namd/pypln.backend/blob/develop/CONTRIBUTING.rst>`_.

.. TODO: The PYTHONPATH issue should be fixed once we organize the directory
structure. As soon as this is fixed, we must update this instructions.
Creating a new Task
~~~~~~~~~~~~~~~~~~~

All analyses in PyPLN are performed by our workers. Every worker is a Celery
task that can be included in the canvas that will run when a document is
received in pypln.web.

New workers are very easy to create. All you need to do is write a subclass of `PyPLNTask <https://github.com/NAMD/pypln.backend/blob/develop/pypln/backend/celery_task.py#L36>`
that implements a "process" method. This method will receive the document as a
dictionary, and should return a dictionary that will be used to update the
existing document. As an example::


from pypln.backend.celery_task import PyPLNTask

class FreqDist(PyPLNTask):
def process(self, document):
value = document['value']
square = value ** 2
return {'squared_value': square}


This worker assumes that a previous worker has already included "value" in the
document and uses it to create a new one, called "squared_value".


.. image:: https://d2weczhvl823v0.cloudfront.net/NAMD/pypln.backend/trend.png
:alt: Bitdeli badge
:target: https://bitdeli.com/free

See our `code guidelines <https://github.com/namd/pypln.backend/blob/develop/CONTRIBUTING.rst>`_.
51 changes: 0 additions & 51 deletions pypln/backend/broker.py

This file was deleted.

28 changes: 28 additions & 0 deletions pypln/backend/celery_app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# coding: utf-8
#
# Copyright 2015 NAMD-EMAP-FGV
#
# This file is part of PyPLN. You can get more information at: http://pypln.org/.
#
# PyPLN is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# PyPLN is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with PyPLN. If not, see <http://www.gnu.org/licenses/>.

from celery import Celery
import config

app = Celery('pypln_workers', backend='mongodb',
broker='amqp://', include=['pypln.backend.workers'])
app.conf.update(
BROKER_URL=config.BROKER_URL,
CELERY_RESULT_BACKEND=config.CELERY_RESULT_BACKEND,
)
70 changes: 70 additions & 0 deletions pypln/backend/celery_task.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# coding: utf-8
#
# Copyright 2015 NAMD-EMAP-FGV
#
# This file is part of PyPLN. You can get more information at: http://pypln.org/.
#
# PyPLN is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# PyPLN is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with PyPLN. If not, see <http://www.gnu.org/licenses/>.

from celery import Task

from pypln.backend.mongodict_adapter import MongoDictAdapter

# This import may look like an unused imported, but it is not.
# When our base task class is defined, the Celery app must have already been
# instantiated, otherwise when this code is imported elsewhere (like in a
# client that will call a task, for example) celery will fallback to the
# default app, and our configuration will be ignored. This is not an issue in
# the documented project layout, because there they import the app in the
# module that define the tasks (to use the decorator in `app.task`).
from pypln.backend.celery_app import app

from pypln.backend import config


class PyPLNTask(Task):
"""
A base class for PyPLN tasks. It is in charge of getting the document
information based on the document id (that should be passed as an argument
by Celery), calling the `process` method, and saving this information on
the database. It will also return the document id, so the rest of the
pipeline has access to it.
"""

def run(self, document_id):
"""
This method is called by Celery, and should not be overridden.
It will call the `process` method with a dictionary containing all the
document information and will update de database with results.
"""
document = MongoDictAdapter(doc_id=document_id,
host=config.MONGODB_CONFIG['host'],
port=config.MONGODB_CONFIG['port'],
database=config.MONGODB_CONFIG['database'])
# Create a dictionary out of our document. We could simply pass
# it on to the process method, but for now we won't let the user
# manipulate the MongoDict directly.
dic = {k: v for k, v in document.iteritems()}
result = self.process(dic)
document.update(result)
return document_id

def process(self, document):
"""
This process should be implemented by subclasses. It is responsible for
performing the analysis itself. It will receive a dictionary as a
paramenter (containing all the current information on the document)
and must return a dictionary with the keys to be saved in the database.
"""
raise NotImplementedError
48 changes: 48 additions & 0 deletions pypln/backend/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import os
import ConfigParser
def get_store_config():
config_filename = os.path.expanduser('~/.pypln_store_config')
defaults = {'host': 'localhost',
'port': '27017',
'database': 'pypln_dev',
'gridfs_collection': 'files',
}
config = ConfigParser.ConfigParser(defaults=defaults)
config.add_section('store')
config.read(config_filename)
store_config = dict(config.items('store'))
# The database port needs to be an integer, but ConfigParser will treat
# everything as a string unless you use the specific method to retrieve the
# value.
store_config['port'] = config.getint('store', 'port')
return store_config

MONGODB_CONFIG = get_store_config()
ELASTICSEARCH_CONFIG = {
'hosts': ['127.0.0.1', '172.16.4.46', '172.16.4.52'],
}

def get_broker_config():
defaults = {
"host": "localhost",
"port": "5672",
"user": "guest",
"password": "guest",
}
celery_config = ConfigParser.ConfigParser(defaults=defaults)
celery_config.add_section('broker')
celery_config.read(os.path.expanduser('~/.pypln_celery_config'))
return dict(celery_config.items('broker'))

CELERY_BROKER_CONFIG = get_broker_config()

BROKER_URL = 'amqp://{}:{}@{}:{}//'.format(
CELERY_BROKER_CONFIG['user'], CELERY_BROKER_CONFIG['password'],
CELERY_BROKER_CONFIG['host'], CELERY_BROKER_CONFIG['port'])

CELERY_RESULT_BACKEND = 'mongodb://{}:{}'.format(MONGODB_CONFIG['host'],
MONGODB_CONFIG['port'])
try:
from local_config import *
except ImportError:
pass
Loading