Feature/bigrams #127

fccoelho · 2012-12-18T23:15:42Z

Guys,

I branched this feature off this repo instead of off my own fork by mistake. But the workers are working and tested. Could you please review this and merge?

thanks,

turicas · 2012-12-19T22:15:01Z

I've added a commit revising your code (PEP8 + legibility issues), but I don't know yet if storing the pickled object is the best option... I think we should discuss it before accepting this pull request. Some problems about this approach:

Needing of NLTK installed on pypln.web and I don't think it's a good idea;
If NLTK changes the interface, the pickled object won't work anymore;
Pull requests should be merged on develop only if the code inside it will be in the new release of the software, but as we don't have the visualization for these workers, we don't need them in the next release.

fccoelho · 2012-12-19T22:24:49Z

sure,

afterwards I thought the best option would be to return a json object with
all possible rankings for bigrams and trigrams, that way NLTK would not be
need in the web app.

AS for the visualizations for these workers, I think they wont be hard to
create.

On Wed, Dec 19, 2012 at 8:15 PM, Álvaro Justen [email protected]:

I've added a commit revising your code (PEP8 + legibility issues), but I
don't know yet if storing the pickled object is the best option... I think
we should discuss it before accepting this pull request. Some problems
about this approach:

Needing of NLTK installed on pypln.web and I don't think it's a good
idea;

If NLTK changes the interface, the pickled object won't work anymore;

Pull requests should be merged on develop only if the code inside it
will be in the new release of the software, but as we don't have the
visualization for these workers, we don't need them in the next release.

—
Reply to this email directly or view it on GitHubhttps://github.com/Feature/bigrams #127#issuecomment-11551287.

Flávio Codeço Coelho

+55(21) 3799-5567
Professor
Escola de Matemática Aplicada
Fundação Getúlio Vargas
Praia de Botafogo, 190 sala 312
Rio de Janeiro - RJ
22250-900
Brasil

turicas · 2012-12-19T22:36:30Z

@fccoelho, can you provide a possible JSON-structured format for these bigrams and trigrams?
If you can work on the visualizations and create a pull request this week, I think we can integrate these new features in the next release! :)

…feature/bigrams Conflicts: pypln/backend/workers/bigrams.py pypln/backend/workers/trigrams.py tests/test_worker_bigrams.py tests/test_worker_trigrams.py

fccoelho · 2012-12-22T15:18:41Z

Output ready to be saved in Mongodb and exported as a CSV is ready and tested. Need to work on the visualization now...

fccoelho · 2012-12-26T11:44:38Z

@turicas : I am still thinking about the visualization for this one: I am tending towards this one: http://bl.ocks.org/4063269

However, I think the visualization is much less important than the API. I think we should merge these features and get them accessible through our REST API. The visualization will be added later.
@flavioamieiro and @rsouza , what do you think?

rsouza · 2012-12-26T16:37:32Z

On n-gram visualizations, we have to deal with the fact that in a N
words sized text, we will have N-1 bigrams, N-2 trigrams, etc., and
there is always overlaps between each adjacent pair of n-grams.
For visualization, we will need something that let us inspect them using
a frequency criteria.

http://www.chrisharrison.net/index.php/Visualizations/WordSpectrum
http://www-958.ibm.com/software/data/cognos/manyeyes/visualizations/itinvolve-bigram

Can we discuss this in our next meeting?
Cheers!

Em 26-12-2012 09:44, Flávio Codeço Coelho escreveu:

@turicas https://github.com/turicas : I am still thinking about the
visualization for this one: I tending towards this one:
http://bl.ocks.org/4063269

However, I think the visualization is much less important than the
API. I think we should merge these features and get them accessible
through our REST API. The visualization will be added later.
@flavioamieiro https://github.com/flavioamieiro and @rsouza
https://github.com/rsouza , what do you think?

—
Reply to this email directly or view it on GitHub
#127 (comment).

Renato Rocha Souza

+55(21)3799-5529
Professor/Pesquisador
Escola de Matemática Aplicada
Fundação Getúlio Vargas

fccoelho · 2012-12-26T16:49:50Z

Sure, let's discuss it, but I like to hear your opinion on getting the API
ready before the visualization.

Answering you question,in the results generated by the workers, raw
frequency, is but one of the metrics generated for the bi/trigrams.

but since you mentioned the analysis, I think one thing we need to do to
improve bigram and trigram extraction is to support the ranking ata the
corpus level. Currently we have only the document level analysis.

On Wed, Dec 26, 2012 at 2:37 PM, Renato Rocha Souza <
[email protected]> wrote:

On n-gram visualizations, we have to deal with the fact that in a N
words sized text, we will have N-1 bigrams, N-2 trigrams, etc., and
there is always overlaps between each adjacent pair of n-grams.
For visualization, we will need something that let us inspect them using
a frequency criteria.

http://www.chrisharrison.net/index.php/Visualizations/WordSpectrum

http://www-958.ibm.com/software/data/cognos/manyeyes/visualizations/itinvolve-bigram

Can we discuss this in our next meeting?
Cheers!

Em 26-12-2012 09:44, Flávio Codeço Coelho escreveu:

@turicas https://github.com/turicas : I am still thinking about the
visualization for this one: I tending towards this one:
http://bl.ocks.org/4063269

However, I think the visualization is much less important than the
API. I think we should merge these features and get them accessible
through our REST API. The visualization will be added later.
@flavioamieiro https://github.com/flavioamieiro and @rsouza
https://github.com/rsouza , what do you think?

—
Reply to this email directly or view it on GitHub
#127 (comment).

Renato Rocha Souza

+55(21)3799-5529
Professor/Pesquisador

Escola de Matemática Aplicada
Fundação Getúlio Vargas

—

Reply to this email directly or view it on GitHubhttps://github.com//pull/127#issuecomment-11689399.

Flávio Codeço Coelho

+55(21) 3799-5567
Professor
Escola de Matemática Aplicada
Fundação Getúlio Vargas
Praia de Botafogo, 190 sala 312
Rio de Janeiro - RJ
22250-900
Brasil

flavioamieiro · 2012-12-27T09:44:48Z

@fccoelho regarding the API, I think it's a good idea to get it ready as soon as possible, but I think we have to be careful not to rush things. We'll need an API for the entire website, so we probably shouldn't have an ad hoc solution in this case. That being said, I see no reason for the bigrams/trigams to be the first use case for this API.

It's also important to remember that the somewhat major refactoring that I'm doing on the frontend (flavioamieiro/pypln.web@NAMD:develop...flavioamieiro:feature/refactor-document-form) may impact the API (mostly in the "upload through API" case, but also because I think the models might change a little bit, which is another point we should discuss).

Also, remember that we can already have a json or csv visualization for this worker, as with any other. We just need a template for the visualization we want, and in the html one (which will show up if the user clicks on the tab in the document view page) we can just put a download link for the available visualization. I think this may be a temporary solution while we work to get the API right.

fccoelho · 2012-12-27T11:21:58Z

The API will be at the top of the issues to discusse in our first meeting
of 2013. We need to define it soon, but with expansibility in mind, so
that we can roll it out incrementally.

we need to aim for greater uncoupling between the API and the web app, so
that we can develop each more freely. I know that it can be tricky with
restful APIs, but we need to search for best practices in this regard.

As for the text visualization for the bi/trigrams workers output, I think
they are pretty straight forward sinc they are just tables... let's just
use the same table widgets we already use for the documents list.

On Thu, Dec 27, 2012 at 7:44 AM, Flávio Amieiro [email protected]:

@fccoelho https://github.com/fccoelho regarding the API, I think it's a
good idea to get it ready as soon as possible, but I think we have to be
careful not to rush things. We'll need an API for the entire website, so we
probably shouldn't have an ad hoc solution in this case. That being said, I
see no reason for the bigrams/trigams to be the first use case for this API.

It's also important to remember that the somewhat major refactoring that
I'm doing on the frontend (
flavioamieiro/pypln.web@NAMD:develop...flavioamieiro:feature/refactor-document-form)
may impact the API (mostly in the "upload through API" case, but also
because I think the models might change a little bit, which is another
point we should discuss).

Also, remember that we can already have a json or csv visualization for
this worker, as with any other. We just need a template for the
visualization we want, and in the html one (which will show up if the user
clicks on the tab in the document view page) we can just put a download
link for the available visualization. I think this may be a temporary
solution while we work to get the API right.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/127#issuecomment-11704438.

Flávio Codeço Coelho

+55(21) 3799-5567
Professor
Escola de Matemática Aplicada
Fundação Getúlio Vargas
Praia de Botafogo, 190 sala 312
Rio de Janeiro - RJ
22250-900
Brasil

turicas · 2012-12-30T16:44:33Z

I think we should not merge it now since it's not complete (it's not a feature: we have just workers, we don't have the API code to answer HTTP requests with information generated by these workers). We should add the REST API as the priority to the next releases and accept this pull request when it becames a feature (there's no reason to merge branches into develop of something that will not be in the next release).

fccoelho · 2012-12-30T20:36:41Z

Agreed,

while we can't consume the results, there is no point in having it in the
development branch. The API will have to be in the next release though or
we will not be able to tackle any serious project.

On Sun, Dec 30, 2012 at 2:44 PM, Álvaro Justen [email protected]:

I think we should not merge it now since it's not complete (it's not a
feature: we have just workers, we don't have the API code to answer HTTP
requests with information generated by these workers). We should add the
REST API as the priority to the next releases and accept this pull request
when it becames a feature (there's no reason to merge branches into develop
of something that will not be in the next release).

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/127#issuecomment-11766239.

Flávio Codeço Coelho

+55(21) 3799-5567
Professor
Escola de Matemática Aplicada
Fundação Getúlio Vargas
Praia de Botafogo, 190 sala 312
Rio de Janeiro - RJ
22250-900
Brasil

flavioamieiro · 2013-01-15T20:21:19Z

I was trying to write a very simple visualization for this and ran into a problem. The brokers pickles the worker output when it does sock.send(result). For some reason this fails when trying to pickle the bigram rank (https://github.com/NAMD/pypln.backend/pull/127/files#L0R52). The error message says: "TypeError: can't pickle function objects", but AFAICT there is no function in bigram_rank. It is a defauldict with tuples as keys and lists as values, unless I'm mistaken. I'll look into it a little longer, but thought one of you might have an answer ready.

fccoelho · 2013-01-16T00:31:43Z

I forgot that workers where run via multiprocessing, which implies pickling of all input and output. It appears lambda functions can't be puckled. I can remove the defaultdicts, but this pickling "envelope" may come back to haunt us in the future on another worker... @turicas can you think of an alternative to spawn the workers?

flavioamieiro · 2013-01-16T15:23:27Z

Mongodb is treating the bigram_rank dictionary as a document, and documents need a string as the key, but we have a tuple: "Exception: documents must have only string keys, key was (u',', u'which')"

We could use repr(key), since repr(object) is supposed to be an 'official' representation of the object, and reversible with eval(). We could also use a string representation of what a bigram is (maybe just ", which" in the example is enough?).

Or we could just have brigram_rank be a list of tuples, instead of a document, so instead of something like

'bigram_rank': {(u'Esse', u'arquivo'): [6.0, 1.0, 1.0, 10.813469012791312, 1.0, 1.0, 2.584962500721156, 1.584962500721156, 0.16666666666666666, 0.8333333333333334], ...}

We'd have

'bigram_rank': [((u'Esse', u'arquivo'): [6.0, 1.0, 1.0, 10.813469012791312, 1.0, 1.0, 2.584962500721156, 1.584962500721156, 0.16666666666666666, 0.8333333333333334]), ...]

@fccoelho any thoughts on that?

Also these small errors stress the need for integration tests of all of our workers. Maybe we should have a test run a minimal pypeline with each worker before we consider it done. What do you think @turicas?

fccoelho · 2013-01-16T15:58:34Z

@flavioamieiro , I Think using repr is the best solution for the reasons you mentioned and also maintains readability for humans.

flavioamieiro · 2013-01-16T16:46:28Z

@fccoelho the repr solution will not work because document keys can't have '.' in them (and there are probably other restrictions as well) =(

I think we need another data structure. Having a list of (bigram, rank) tuples should work, but if we use that we can't find a specific bigram easily.

I'll start working on some of the bugs we found yesterday because they are more urgent than this (they are online right now, in fact). I'll come back to this later, but please share any thoughts.

fccoelho · 2013-01-16T23:07:25Z

You are definetly right @flavioamieiro . I'll give it some thougth. Unfortunately Json structure inherits some of the bad design of javascript as a whole. I think we won't need to search the bigram ranking inside Mongodb. We can store them as an array of arrays, and then turn it into a dict once the data is in Python. But thearray would have to be like this:

[["token1",[score1,score2,...]],
["toke2",[score1,score2,...]],
...
]

fccoelho and others added 5 commits December 18, 2012 19:38

Added bigram worker: returns a pickled bigram finder object

201d6f8

Added trigram worker as well

dee05e4

added tests for bigram and trigram workers

f105570

fixed bug, tests passing

caf235e

Fix PEP8 + legibility in {bi,tri}gram workers

e986ea9

fccoelho added 3 commits December 22, 2012 13:03

changed bigram and trigram workers to return tables in jSON format

38a2b55

Merge branch 'feature/bigrams' of github.com:NAMD/pypln.backend into …

79b4f3a

…feature/bigrams Conflicts: pypln/backend/workers/bigrams.py pypln/backend/workers/trigrams.py tests/test_worker_bigrams.py tests/test_worker_trigrams.py

fix docstrings

92ed922

flavioamieiro added 2 commits January 15, 2013 15:21

Merge branch 'develop' into test_bigrams

39da043

Adds Bigrams to available workers

8149731

flavioamieiro added 2 commits January 16, 2013 12:35

Adds regression tests for pickling of bigram worker output

86d018b

Cast defaultdict to dict to avoid pickling the generator function

f583dd8

flavioamieiro added 2 commits January 16, 2013 13:48

Fixes typo in test_mongo_store

b1f33d0

Adds regression test to the error when saving worker output

f85afbb

Convert bigram worker output into a list of tuples so it can be saved

ea6ca3d

flavioamieiro merged commit ea6ca3d into develop Jan 17, 2013

turicas deleted the feature/bigrams branch January 27, 2013 05:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/bigrams #127

Feature/bigrams #127

fccoelho commented Dec 18, 2012

turicas commented Dec 19, 2012

fccoelho commented Dec 19, 2012

turicas commented Dec 19, 2012

fccoelho commented Dec 22, 2012

fccoelho commented Dec 26, 2012

rsouza commented Dec 26, 2012

fccoelho commented Dec 26, 2012

Renato Rocha Souza

flavioamieiro commented Dec 27, 2012

fccoelho commented Dec 27, 2012

turicas commented Dec 30, 2012

fccoelho commented Dec 30, 2012

flavioamieiro commented Jan 15, 2013

fccoelho commented Jan 16, 2013

flavioamieiro commented Jan 16, 2013

fccoelho commented Jan 16, 2013

flavioamieiro commented Jan 16, 2013

fccoelho commented Jan 16, 2013

Feature/bigrams #127

Feature/bigrams #127

Conversation

fccoelho commented Dec 18, 2012

turicas commented Dec 19, 2012

fccoelho commented Dec 19, 2012

Flávio Codeço Coelho

turicas commented Dec 19, 2012

fccoelho commented Dec 22, 2012

fccoelho commented Dec 26, 2012

rsouza commented Dec 26, 2012

Renato Rocha Souza

fccoelho commented Dec 26, 2012

Renato Rocha Souza

Flávio Codeço Coelho

flavioamieiro commented Dec 27, 2012

fccoelho commented Dec 27, 2012

Flávio Codeço Coelho

turicas commented Dec 30, 2012

fccoelho commented Dec 30, 2012

Flávio Codeço Coelho

flavioamieiro commented Jan 15, 2013

fccoelho commented Jan 16, 2013

flavioamieiro commented Jan 16, 2013

fccoelho commented Jan 16, 2013

flavioamieiro commented Jan 16, 2013

fccoelho commented Jan 16, 2013