-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/bigrams #127
Feature/bigrams #127
Conversation
I've added a commit revising your code (PEP8 + legibility issues), but I don't know yet if storing the pickled object is the best option... I think we should discuss it before accepting this pull request. Some problems about this approach:
|
sure, afterwards I thought the best option would be to return a json object with AS for the visualizations for these workers, I think they wont be hard to On Wed, Dec 19, 2012 at 8:15 PM, Álvaro Justen [email protected]:
Flávio Codeço Coelho+55(21) 3799-5567 |
@fccoelho, can you provide a possible JSON-structured format for these bigrams and trigrams? |
…feature/bigrams Conflicts: pypln/backend/workers/bigrams.py pypln/backend/workers/trigrams.py tests/test_worker_bigrams.py tests/test_worker_trigrams.py
Output ready to be saved in Mongodb and exported as a CSV is ready and tested. Need to work on the visualization now... |
@turicas : I am still thinking about the visualization for this one: I am tending towards this one: http://bl.ocks.org/4063269 However, I think the visualization is much less important than the API. I think we should merge these features and get them accessible through our REST API. The visualization will be added later. |
On n-gram visualizations, we have to deal with the fact that in a N http://www.chrisharrison.net/index.php/Visualizations/WordSpectrum Can we discuss this in our next meeting? Em 26-12-2012 09:44, Flávio Codeço Coelho escreveu:
Renato Rocha Souza+55(21)3799-5529 |
Sure, let's discuss it, but I like to hear your opinion on getting the API Answering you question,in the results generated by the workers, raw but since you mentioned the analysis, I think one thing we need to do to On Wed, Dec 26, 2012 at 2:37 PM, Renato Rocha Souza <
Flávio Codeço Coelho+55(21) 3799-5567 |
@fccoelho regarding the API, I think it's a good idea to get it ready as soon as possible, but I think we have to be careful not to rush things. We'll need an API for the entire website, so we probably shouldn't have an ad hoc solution in this case. That being said, I see no reason for the bigrams/trigams to be the first use case for this API. It's also important to remember that the somewhat major refactoring that I'm doing on the frontend (flavioamieiro/pypln.web@NAMD:develop...flavioamieiro:feature/refactor-document-form) may impact the API (mostly in the "upload through API" case, but also because I think the models might change a little bit, which is another point we should discuss). Also, remember that we can already have a json or csv visualization for this worker, as with any other. We just need a template for the visualization we want, and in the html one (which will show up if the user clicks on the tab in the document view page) we can just put a download link for the available visualization. I think this may be a temporary solution while we work to get the API right. |
The API will be at the top of the issues to discusse in our first meeting we need to aim for greater uncoupling between the API and the web app, so As for the text visualization for the bi/trigrams workers output, I think On Thu, Dec 27, 2012 at 7:44 AM, Flávio Amieiro [email protected]:
Flávio Codeço Coelho+55(21) 3799-5567 |
I think we should not merge it now since it's not complete (it's not a feature: we have just workers, we don't have the API code to answer HTTP requests with information generated by these workers). We should add the REST API as the priority to the next releases and accept this pull request when it becames a feature (there's no reason to merge branches into develop of something that will not be in the next release). |
Agreed, while we can't consume the results, there is no point in having it in the On Sun, Dec 30, 2012 at 2:44 PM, Álvaro Justen [email protected]:
Flávio Codeço Coelho+55(21) 3799-5567 |
I was trying to write a very simple visualization for this and ran into a problem. The brokers pickles the worker output when it does sock.send(result). For some reason this fails when trying to pickle the bigram rank (https://github.com/NAMD/pypln.backend/pull/127/files#L0R52). The error message says: "TypeError: can't pickle function objects", but AFAICT there is no function in bigram_rank. It is a defauldict with tuples as keys and lists as values, unless I'm mistaken. I'll look into it a little longer, but thought one of you might have an answer ready. |
I forgot that workers where run via multiprocessing, which implies pickling of all input and output. It appears lambda functions can't be puckled. I can remove the defaultdicts, but this pickling "envelope" may come back to haunt us in the future on another worker... @turicas can you think of an alternative to spawn the workers? |
Mongodb is treating the bigram_rank dictionary as a document, and documents need a string as the key, but we have a tuple: "Exception: documents must have only string keys, key was (u',', u'which')" We could use repr(key), since repr(object) is supposed to be an 'official' representation of the object, and reversible with eval(). We could also use a string representation of what a bigram is (maybe just ", which" in the example is enough?). Or we could just have brigram_rank be a list of tuples, instead of a document, so instead of something like
We'd have
@fccoelho any thoughts on that? Also these small errors stress the need for integration tests of all of our workers. Maybe we should have a test run a minimal pypeline with each worker before we consider it done. What do you think @turicas? |
@flavioamieiro , I Think using repr is the best solution for the reasons you mentioned and also maintains readability for humans. |
@fccoelho the repr solution will not work because document keys can't have '.' in them (and there are probably other restrictions as well) =( I think we need another data structure. Having a list of (bigram, rank) tuples should work, but if we use that we can't find a specific bigram easily. I'll start working on some of the bugs we found yesterday because they are more urgent than this (they are online right now, in fact). I'll come back to this later, but please share any thoughts. |
You are definetly right @flavioamieiro . I'll give it some thougth. Unfortunately Json structure inherits some of the bad design of javascript as a whole. I think we won't need to search the bigram ranking inside Mongodb. We can store them as an array of arrays, and then turn it into a dict once the data is in Python. But thearray would have to be like this: [["token1",[score1,score2,...]],
["toke2",[score1,score2,...]],
...
] |
Guys,
I branched this feature off this repo instead of off my own fork by mistake. But the workers are working and tested. Could you please review this and merge?
thanks,