Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented basic Sphinxsearch indexer worker. #167

Closed
wants to merge 2 commits into from

Conversation

fccoelho
Copy link
Member

Requires Sphinxsearch to be installed on a reachable host.

Other fields can be indexed as well, Suggestions are welcome.

requires Sphinxsearch to be installed on a reachable host
@fccoelho
Copy link
Member Author

@flavioamieiro , @turicas , Could you have a look at this code?

cursor = connection.cursor()

def process(self, document):
ID = int('0x'+str(document['_id']), 16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have here the same problem we had in mediacloud. There we decided to reindex the entire collection every time, but in PyPLN's case this makes less sense (because of the way documents are usually inserted). We may have to find another solution for this.

@flavioamieiro
Copy link
Member

Other than the id issue I just commented on I see no problem with this code.

@fccoelho
Copy link
Member Author

I have a suggestion for the id issue: since we are not going to use this ID for anything, and we have mongodb's id as an attribute anyway, we could use the ordinal position of the document in the collection (sorted by ID) as an id for Sphinx. But I have no Idea of how the indexing worker could get hold of this information.

@fccoelho
Copy link
Member Author

By the way the reason the strategy of using the _id is failing is because it is a 12 byte number (96 bits) and Sphinx accepts ids up to 64 bits only.

Let's just use a 64 bit slice of the objectId


def process(self, document):
ID = int('0x'+str(document['_id']), 16)
self.cursor().execute("INSERT INTO {} (id, text) values ({}, {})".format(SPHINX_INDEX, ID, document['text']))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.cursor is already a cursor object so we don't need to call it, just use its method execute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 492ea8e

@fccoelho
Copy link
Member Author

We need to add a index configuration file -- sphinx.conf -- somewhere. @turicas, where should it go?

@turicas
Copy link
Contributor

turicas commented Jun 29, 2014

@fccoelho I think we should have a broker-wide configuration file and specify the index configuration file path inside this broker configuration file (we still don't have a broker configuration file). @flavioamieiro, what do you think?

@flavioamieiro
Copy link
Member

This sounds like a good idea.

@fccoelho
Copy link
Member Author

+1 to this.
Em 29/06/2014 18:17, "Álvaro Justen" [email protected] escreveu:

@fccoelho https://github.com/fccoelho I think we should have a
broker-wide configuration file and specify the index configuration file
path inside this broker configuration file (we still don't have a broker
configuration file). @flavioamieiro https://github.com/flavioamieiro,
what do you think?


Reply to this email directly or view it on GitHub
#167 (comment).

@flavioamieiro flavioamieiro modified the milestone: Next Release Jul 7, 2014
@fccoelho
Copy link
Member Author

I think this PR should be abandoned, as I don't think that indexing belong in PyPLN anymore.

Do you all agree?

@flavioamieiro
Copy link
Member

I do. I think that, since we are focusing PyPLN more on document analysis than on corpora, indexing is out of scope, at least for now.

@fccoelho
Copy link
Member Author

Change of Plans. Given the possibie establishment of indexing services through the API, let's try to finish this feature later

@fccoelho
Copy link
Member Author

fccoelho commented Dec 1, 2016

Abandoning this line of work

@fccoelho fccoelho closed this Dec 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants