-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented basic Sphinxsearch indexer worker. #167
Conversation
requires Sphinxsearch to be installed on a reachable host
@flavioamieiro , @turicas , Could you have a look at this code? |
cursor = connection.cursor() | ||
|
||
def process(self, document): | ||
ID = int('0x'+str(document['_id']), 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have here the same problem we had in mediacloud. There we decided to reindex the entire collection every time, but in PyPLN's case this makes less sense (because of the way documents are usually inserted). We may have to find another solution for this.
Other than the id issue I just commented on I see no problem with this code. |
I have a suggestion for the id issue: since we are not going to use this ID for anything, and we have mongodb's id as an attribute anyway, we could use the ordinal position of the document in the collection (sorted by ID) as an id for Sphinx. But I have no Idea of how the indexing worker could get hold of this information. |
By the way the reason the strategy of using the _id is failing is because it is a 12 byte number (96 bits) and Sphinx accepts ids up to 64 bits only. Let's just use a 64 bit slice of the objectId |
|
||
def process(self, document): | ||
ID = int('0x'+str(document['_id']), 16) | ||
self.cursor().execute("INSERT INTO {} (id, text) values ({}, {})".format(SPHINX_INDEX, ID, document['text'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.cursor
is already a cursor object so we don't need to call it, just use its method execute
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 492ea8e
We need to add a index configuration file -- sphinx.conf -- somewhere. @turicas, where should it go? |
@fccoelho I think we should have a broker-wide configuration file and specify the index configuration file path inside this broker configuration file (we still don't have a broker configuration file). @flavioamieiro, what do you think? |
This sounds like a good idea. |
+1 to this.
|
I think this PR should be abandoned, as I don't think that indexing belong in PyPLN anymore. Do you all agree? |
I do. I think that, since we are focusing PyPLN more on document analysis than on corpora, indexing is out of scope, at least for now. |
Change of Plans. Given the possibie establishment of indexing services through the API, let's try to finish this feature later |
Abandoning this line of work |
Requires Sphinxsearch to be installed on a reachable host.
Other fields can be indexed as well, Suggestions are welcome.