Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/bigrams #127

Merged
merged 15 commits into from
Jan 17, 2013
Merged

Feature/bigrams #127

merged 15 commits into from
Jan 17, 2013

Conversation

fccoelho
Copy link
Member

Guys,

I branched this feature off this repo instead of off my own fork by mistake. But the workers are working and tested. Could you please review this and merge?

thanks,

@turicas
Copy link
Contributor

turicas commented Dec 19, 2012

I've added a commit revising your code (PEP8 + legibility issues), but I don't know yet if storing the pickled object is the best option... I think we should discuss it before accepting this pull request. Some problems about this approach:

  • Needing of NLTK installed on pypln.web and I don't think it's a good idea;
  • If NLTK changes the interface, the pickled object won't work anymore;
  • Pull requests should be merged on develop only if the code inside it will be in the new release of the software, but as we don't have the visualization for these workers, we don't need them in the next release.

@fccoelho
Copy link
Member Author

sure,

afterwards I thought the best option would be to return a json object with
all possible rankings for bigrams and trigrams, that way NLTK would not be
need in the web app.

AS for the visualizations for these workers, I think they wont be hard to
create.

On Wed, Dec 19, 2012 at 8:15 PM, Álvaro Justen [email protected]:

I've added a commit revising your code (PEP8 + legibility issues), but I
don't know yet if storing the pickled object is the best option... I think
we should discuss it before accepting this pull request. Some problems
about this approach:

  • Needing of NLTK installed on pypln.web and I don't think it's a good
    idea;

  • If NLTK changes the interface, the pickled object won't work anymore;

  • Pull requests should be merged on develop only if the code inside it
    will be in the new release of the software, but as we don't have the
    visualization for these workers, we don't need them in the next release.


    Reply to this email directly or view it on GitHubhttps://github.com/Feature/bigrams #127#issuecomment-11551287.

Flávio Codeço Coelho

+55(21) 3799-5567
Professor
Escola de Matemática Aplicada
Fundação Getúlio Vargas
Praia de Botafogo, 190 sala 312
Rio de Janeiro - RJ
22250-900
Brasil

@turicas
Copy link
Contributor

turicas commented Dec 19, 2012

@fccoelho, can you provide a possible JSON-structured format for these bigrams and trigrams?
If you can work on the visualizations and create a pull request this week, I think we can integrate these new features in the next release! :)

…feature/bigrams

Conflicts:
	pypln/backend/workers/bigrams.py
	pypln/backend/workers/trigrams.py
	tests/test_worker_bigrams.py
	tests/test_worker_trigrams.py
@fccoelho
Copy link
Member Author

Output ready to be saved in Mongodb and exported as a CSV is ready and tested. Need to work on the visualization now...

@fccoelho
Copy link
Member Author

@turicas : I am still thinking about the visualization for this one: I am tending towards this one: http://bl.ocks.org/4063269

However, I think the visualization is much less important than the API. I think we should merge these features and get them accessible through our REST API. The visualization will be added later.
@flavioamieiro and @rsouza , what do you think?

@rsouza
Copy link
Member

rsouza commented Dec 26, 2012

On n-gram visualizations, we have to deal with the fact that in a N
words sized text, we will have N-1 bigrams, N-2 trigrams, etc., and
there is always overlaps between each adjacent pair of n-grams.
For visualization, we will need something that let us inspect them using
a frequency criteria.

http://www.chrisharrison.net/index.php/Visualizations/WordSpectrum
http://www-958.ibm.com/software/data/cognos/manyeyes/visualizations/itinvolve-bigram

Can we discuss this in our next meeting?
Cheers!

Em 26-12-2012 09:44, Flávio Codeço Coelho escreveu:

@turicas https://github.com/turicas : I am still thinking about the
visualization for this one: I tending towards this one:
http://bl.ocks.org/4063269

However, I think the visualization is much less important than the
API. I think we should merge these features and get them accessible
through our REST API. The visualization will be added later.
@flavioamieiro https://github.com/flavioamieiro and @rsouza
https://github.com/rsouza , what do you think?


Reply to this email directly or view it on GitHub
#127 (comment).

Renato Rocha Souza

+55(21)3799-5529
Professor/Pesquisador
Escola de Matemática Aplicada
Fundação Getúlio Vargas

@fccoelho
Copy link
Member Author

Sure, let's discuss it, but I like to hear your opinion on getting the API
ready before the visualization.

Answering you question,in the results generated by the workers, raw
frequency, is but one of the metrics generated for the bi/trigrams.

but since you mentioned the analysis, I think one thing we need to do to
improve bigram and trigram extraction is to support the ranking ata the
corpus level. Currently we have only the document level analysis.

On Wed, Dec 26, 2012 at 2:37 PM, Renato Rocha Souza <
[email protected]> wrote:

On n-gram visualizations, we have to deal with the fact that in a N
words sized text, we will have N-1 bigrams, N-2 trigrams, etc., and
there is always overlaps between each adjacent pair of n-grams.
For visualization, we will need something that let us inspect them using
a frequency criteria.

http://www.chrisharrison.net/index.php/Visualizations/WordSpectrum

http://www-958.ibm.com/software/data/cognos/manyeyes/visualizations/itinvolve-bigram

Can we discuss this in our next meeting?
Cheers!

Em 26-12-2012 09:44, Flávio Codeço Coelho escreveu:

@turicas https://github.com/turicas : I am still thinking about the
visualization for this one: I tending towards this one:
http://bl.ocks.org/4063269

However, I think the visualization is much less important than the
API. I think we should merge these features and get them accessible
through our REST API. The visualization will be added later.
@flavioamieiro https://github.com/flavioamieiro and @rsouza
https://github.com/rsouza , what do you think?


Reply to this email directly or view it on GitHub
#127 (comment).

Renato Rocha Souza

+55(21)3799-5529
Professor/Pesquisador

Escola de Matemática Aplicada
Fundação Getúlio Vargas

Reply to this email directly or view it on GitHubhttps://github.com//pull/127#issuecomment-11689399.

Flávio Codeço Coelho

+55(21) 3799-5567
Professor
Escola de Matemática Aplicada
Fundação Getúlio Vargas
Praia de Botafogo, 190 sala 312
Rio de Janeiro - RJ
22250-900
Brasil

@flavioamieiro
Copy link
Member

@fccoelho regarding the API, I think it's a good idea to get it ready as soon as possible, but I think we have to be careful not to rush things. We'll need an API for the entire website, so we probably shouldn't have an ad hoc solution in this case. That being said, I see no reason for the bigrams/trigams to be the first use case for this API.

It's also important to remember that the somewhat major refactoring that I'm doing on the frontend (flavioamieiro/pypln.web@NAMD:develop...flavioamieiro:feature/refactor-document-form) may impact the API (mostly in the "upload through API" case, but also because I think the models might change a little bit, which is another point we should discuss).

Also, remember that we can already have a json or csv visualization for this worker, as with any other. We just need a template for the visualization we want, and in the html one (which will show up if the user clicks on the tab in the document view page) we can just put a download link for the available visualization. I think this may be a temporary solution while we work to get the API right.

@fccoelho
Copy link
Member Author

The API will be at the top of the issues to discusse in our first meeting
of 2013. We need to define it soon, but with expansibility in mind, so
that we can roll it out incrementally.

we need to aim for greater uncoupling between the API and the web app, so
that we can develop each more freely. I know that it can be tricky with
restful APIs, but we need to search for best practices in this regard.

As for the text visualization for the bi/trigrams workers output, I think
they are pretty straight forward sinc they are just tables... let's just
use the same table widgets we already use for the documents list.

On Thu, Dec 27, 2012 at 7:44 AM, Flávio Amieiro [email protected]:

@fccoelho https://github.com/fccoelho regarding the API, I think it's a
good idea to get it ready as soon as possible, but I think we have to be
careful not to rush things. We'll need an API for the entire website, so we
probably shouldn't have an ad hoc solution in this case. That being said, I
see no reason for the bigrams/trigams to be the first use case for this API.

It's also important to remember that the somewhat major refactoring that
I'm doing on the frontend (
flavioamieiro/pypln.web@NAMD:develop...flavioamieiro:feature/refactor-document-form)
may impact the API (mostly in the "upload through API" case, but also
because I think the models might change a little bit, which is another
point we should discuss).

Also, remember that we can already have a json or csv visualization for
this worker, as with any other. We just need a template for the
visualization we want, and in the html one (which will show up if the user
clicks on the tab in the document view page) we can just put a download
link for the available visualization. I think this may be a temporary
solution while we work to get the API right.


Reply to this email directly or view it on GitHubhttps://github.com//pull/127#issuecomment-11704438.

Flávio Codeço Coelho

+55(21) 3799-5567
Professor
Escola de Matemática Aplicada
Fundação Getúlio Vargas
Praia de Botafogo, 190 sala 312
Rio de Janeiro - RJ
22250-900
Brasil

@turicas
Copy link
Contributor

turicas commented Dec 30, 2012

I think we should not merge it now since it's not complete (it's not a feature: we have just workers, we don't have the API code to answer HTTP requests with information generated by these workers). We should add the REST API as the priority to the next releases and accept this pull request when it becames a feature (there's no reason to merge branches into develop of something that will not be in the next release).

@fccoelho
Copy link
Member Author

Agreed,

while we can't consume the results, there is no point in having it in the
development branch. The API will have to be in the next release though or
we will not be able to tackle any serious project.

On Sun, Dec 30, 2012 at 2:44 PM, Álvaro Justen [email protected]:

I think we should not merge it now since it's not complete (it's not a
feature: we have just workers, we don't have the API code to answer HTTP
requests with information generated by these workers). We should add the
REST API as the priority to the next releases and accept this pull request
when it becames a feature (there's no reason to merge branches into develop
of something that will not be in the next release).


Reply to this email directly or view it on GitHubhttps://github.com//pull/127#issuecomment-11766239.

Flávio Codeço Coelho

+55(21) 3799-5567
Professor
Escola de Matemática Aplicada
Fundação Getúlio Vargas
Praia de Botafogo, 190 sala 312
Rio de Janeiro - RJ
22250-900
Brasil

@flavioamieiro
Copy link
Member

I was trying to write a very simple visualization for this and ran into a problem. The brokers pickles the worker output when it does sock.send(result). For some reason this fails when trying to pickle the bigram rank (https://github.com/NAMD/pypln.backend/pull/127/files#L0R52). The error message says: "TypeError: can't pickle function objects", but AFAICT there is no function in bigram_rank. It is a defauldict with tuples as keys and lists as values, unless I'm mistaken. I'll look into it a little longer, but thought one of you might have an answer ready.

@fccoelho
Copy link
Member Author

I forgot that workers where run via multiprocessing, which implies pickling of all input and output. It appears lambda functions can't be puckled. I can remove the defaultdicts, but this pickling "envelope" may come back to haunt us in the future on another worker... @turicas can you think of an alternative to spawn the workers?

@flavioamieiro
Copy link
Member

Mongodb is treating the bigram_rank dictionary as a document, and documents need a string as the key, but we have a tuple: "Exception: documents must have only string keys, key was (u',', u'which')"

We could use repr(key), since repr(object) is supposed to be an 'official' representation of the object, and reversible with eval(). We could also use a string representation of what a bigram is (maybe just ", which" in the example is enough?).

Or we could just have brigram_rank be a list of tuples, instead of a document, so instead of something like

'bigram_rank': {(u'Esse', u'arquivo'): [6.0, 1.0, 1.0, 10.813469012791312, 1.0, 1.0, 2.584962500721156, 1.584962500721156, 0.16666666666666666, 0.8333333333333334], ...}

We'd have

'bigram_rank': [((u'Esse', u'arquivo'): [6.0, 1.0, 1.0, 10.813469012791312, 1.0, 1.0, 2.584962500721156, 1.584962500721156, 0.16666666666666666, 0.8333333333333334]), ...]

@fccoelho any thoughts on that?

Also these small errors stress the need for integration tests of all of our workers. Maybe we should have a test run a minimal pypeline with each worker before we consider it done. What do you think @turicas?

@fccoelho
Copy link
Member Author

@flavioamieiro , I Think using repr is the best solution for the reasons you mentioned and also maintains readability for humans.

@flavioamieiro
Copy link
Member

@fccoelho the repr solution will not work because document keys can't have '.' in them (and there are probably other restrictions as well) =(

I think we need another data structure. Having a list of (bigram, rank) tuples should work, but if we use that we can't find a specific bigram easily.

I'll start working on some of the bugs we found yesterday because they are more urgent than this (they are online right now, in fact). I'll come back to this later, but please share any thoughts.

@fccoelho
Copy link
Member Author

You are definetly right @flavioamieiro . I'll give it some thougth. Unfortunately Json structure inherits some of the bad design of javascript as a whole. I think we won't need to search the bigram ranking inside Mongodb. We can store them as an array of arrays, and then turn it into a dict once the data is in Python. But thearray would have to be like this:

[["token1",[score1,score2,...]],
["toke2",[score1,score2,...]],
...
]

@flavioamieiro flavioamieiro merged commit ea6ca3d into develop Jan 17, 2013
@turicas turicas deleted the feature/bigrams branch January 27, 2013 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants