Implemented basic Sphinxsearch indexer worker. #167

fccoelho · 2014-04-13T11:54:38Z

Requires Sphinxsearch to be installed on a reachable host.

Other fields can be indexed as well, Suggestions are welcome.

requires Sphinxsearch to be installed on a reachable host

fccoelho · 2014-04-24T17:05:33Z

@flavioamieiro , @turicas , Could you have a look at this code?

flavioamieiro · 2014-04-24T17:43:49Z

pypln/backend/workers/sphinxsearch.py

+    cursor = connection.cursor()
+
+    def process(self, document):
+        ID = int('0x'+str(document['_id']), 16)


We may have here the same problem we had in mediacloud. There we decided to reindex the entire collection every time, but in PyPLN's case this makes less sense (because of the way documents are usually inserted). We may have to find another solution for this.

flavioamieiro · 2014-04-24T17:44:13Z

Other than the id issue I just commented on I see no problem with this code.

fccoelho · 2014-04-24T18:28:57Z

I have a suggestion for the id issue: since we are not going to use this ID for anything, and we have mongodb's id as an attribute anyway, we could use the ordinal position of the document in the collection (sorted by ID) as an id for Sphinx. But I have no Idea of how the indexing worker could get hold of this information.

fccoelho · 2014-04-24T18:43:36Z

By the way the reason the strategy of using the _id is failing is because it is a 12 byte number (96 bits) and Sphinx accepts ids up to 64 bits only.

Let's just use a 64 bit slice of the objectId

turicas · 2014-04-29T04:46:10Z

pypln/backend/workers/sphinxsearch.py

+
+    def process(self, document):
+        ID = int('0x'+str(document['_id']), 16)
+        self.cursor().execute("INSERT INTO {} (id, text) values ({}, {})".format(SPHINX_INDEX, ID, document['text']))


self.cursor is already a cursor object so we don't need to call it, just use its method execute.

fixed in 492ea8e

fccoelho · 2014-06-26T14:26:22Z

We need to add a index configuration file -- sphinx.conf -- somewhere. @turicas, where should it go?

turicas · 2014-06-29T21:17:41Z

@fccoelho I think we should have a broker-wide configuration file and specify the index configuration file path inside this broker configuration file (we still don't have a broker configuration file). @flavioamieiro, what do you think?

flavioamieiro · 2014-06-30T14:32:55Z

This sounds like a good idea.

fccoelho · 2014-06-30T23:05:31Z

+1 to this.
Em 29/06/2014 18:17, "Álvaro Justen" [email protected] escreveu:

@fccoelho https://github.com/fccoelho I think we should have a
broker-wide configuration file and specify the index configuration file
path inside this broker configuration file (we still don't have a broker
configuration file). @flavioamieiro https://github.com/flavioamieiro,
what do you think?

—
Reply to this email directly or view it on GitHub
#167 (comment).

fccoelho · 2015-03-22T18:17:19Z

I think this PR should be abandoned, as I don't think that indexing belong in PyPLN anymore.

Do you all agree?

flavioamieiro · 2015-03-24T22:28:40Z

I do. I think that, since we are focusing PyPLN more on document analysis than on corpora, indexing is out of scope, at least for now.

fccoelho · 2015-03-27T16:10:21Z

Change of Plans. Given the possibie establishment of indexing services through the API, let's try to finish this feature later

fccoelho · 2016-12-01T11:44:26Z

Abandoning this line of work

Implemented basic Sphinxsearch indexer worker.

b333fc6

requires Sphinxsearch to be installed on a reachable host

flavioamieiro reviewed Apr 24, 2014
View reviewed changes

turicas reviewed Apr 29, 2014
View reviewed changes

fixed issues pointed out by @turicas

492ea8e

flavioamieiro modified the milestone: Next Release Jul 7, 2014

fccoelho closed this Dec 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented basic Sphinxsearch indexer worker. #167

Implemented basic Sphinxsearch indexer worker. #167

fccoelho commented Apr 13, 2014

fccoelho commented Apr 24, 2014

flavioamieiro Apr 24, 2014

flavioamieiro commented Apr 24, 2014

fccoelho commented Apr 24, 2014

fccoelho commented Apr 24, 2014

turicas Apr 29, 2014

fccoelho Jun 26, 2014

fccoelho commented Jun 26, 2014

turicas commented Jun 29, 2014

flavioamieiro commented Jun 30, 2014

fccoelho commented Jun 30, 2014

fccoelho commented Mar 22, 2015

flavioamieiro commented Mar 24, 2015

fccoelho commented Mar 27, 2015

fccoelho commented Dec 1, 2016

Implemented basic Sphinxsearch indexer worker. #167

Implemented basic Sphinxsearch indexer worker. #167

Conversation

fccoelho commented Apr 13, 2014

fccoelho commented Apr 24, 2014

flavioamieiro Apr 24, 2014

Choose a reason for hiding this comment

flavioamieiro commented Apr 24, 2014

fccoelho commented Apr 24, 2014

fccoelho commented Apr 24, 2014

turicas Apr 29, 2014

Choose a reason for hiding this comment

fccoelho Jun 26, 2014

Choose a reason for hiding this comment

fccoelho commented Jun 26, 2014

turicas commented Jun 29, 2014

flavioamieiro commented Jun 30, 2014

fccoelho commented Jun 30, 2014

fccoelho commented Mar 22, 2015

flavioamieiro commented Mar 24, 2015

fccoelho commented Mar 27, 2015

fccoelho commented Dec 1, 2016