Indexing PDF documents does not work. #176

flavioamieiro · 2015-06-25T16:39:40Z

When trying to index a PDF file in the new indexing pipeline, the worker raises an error. Looking at the full stack trace we can see that the problem is that the worker is sending all the properties of the document to elastic. In the case of a PDF, the contents property contains binary data, and the worker cannot serialize that properly in order to send it.

I think we should probably not send contents together with the rest of the data, but that would mean we would loose the original file contents (since in e4d0cb8 we started deleting the file from GridFS).

A shorter version of the stack trace (with the contents property snipped) is bellow:

[2015-06-25 18:20:55,521: ERROR/MainProcess] Task
pypln.backend.workers.elastic_indexer.ElasticIndexer[76b5682a-660b-480f-b533-48f619246362]
raised unexpected: SerializationError({u'mimetype': 'application/pdf',
u'upload_date': datetime.datetime(2015, 6, 25, 16, 20, 54, 169000),
u'forced_decoding': False, u'language': 'un', u'text': u'This is a minimal
pdf.\n1', u'filename': u'minimal_pdf.pdf_1435249254.17', u'length': 12876,
u'file_id': '558c2a66798ebd634b0b249f', u'file_metadata':
{'UserProperties': 'no', 'Tagged': 'no', 'Form': 'none', 'Producer':
'pdfTeX-1.40.15', 'Creator': 'TeX', 'Encrypted': 'no', 'JavaScript': 'no',
'Suspects': 'no', 'Optimized': 'no', 'PDF version': '1.5', 'ModDate': 'Thu
Jun 25 18:17:25 2015', 'Page size': '612 x 792 pts (letter)',
'CreationDate': 'Thu Jun 25 18:17:25 2015', 'Pages': '1', 'Page rot': '0'},
u'contents': '%PDF-1.5\n%\xd0\xd4\xc5\xd8\n3 0 obj\n<<\n/Length 105
\n/Filter /FlateDecode\n>>\nstream\nx\xda%\x8c\xbb\n\x8 [...]
dobj\nstartxref\n12576\n%%EOF\n', 10, 11, 'invalid continuation byte'))

@fccoelho @israelst any ideas on how to handle this?

The text was updated successfully, but these errors were encountered:

fccoelho · 2015-06-25T17:51:05Z

I think only the indexable fields should be persisted in elastic. Pypln is
not and should not be a document store. The responsibility for persisting
the original document is the user's.
Em 25/06/2015 13:39, "Flávio Amieiro" [email protected] escreveu:

When trying to index a PDF file in the new indexing pipeline, the worker
raises an error. Looking at the (full stack trace)[
https://gist.github.com/flavioamieiro/d95b2813a9cdd9a8645c] we can see
that the problem is that the worker is sending all the properties of the
document to elastic. In the case of a PDF, the contents property contains
binary data, and the worker cannot serialize that properly in order to send
it.

I think we should probably not send contents together with the rest of
the data, but that would mean we would loose the original file contents
(since in e4d0cb8
e4d0cb8
we started deleting the file from GridFS).

A shorter version of the stack trace (with the contents property snipped)
is bellow:

[2015-06-25 18:20:55,521: ERROR/MainProcess] Task
pypln.backend.workers.elastic_indexer.ElasticIndexer[76b5682a-660b-480f-b533-48f619246362]
raised unexpected: SerializationError({u'mimetype': 'application/pdf',
u'upload_date': datetime.datetime(2015, 6, 25, 16, 20, 54, 169000),
u'forced_decoding': False, u'language': 'un', u'text': u'This is a minimal
pdf.\n1', u'filename': u'minimal_pdf.pdf_1435249254.17', u'length': 12876,
u'file_id': '558c2a66798ebd634b0b249f', u'file_metadata':
{'UserProperties': 'no', 'Tagged': 'no', 'Form': 'none', 'Producer':
'pdfTeX-1.40.15', 'Creator': 'TeX', 'Encrypted': 'no', 'JavaScript': 'no',
'Suspects': 'no', 'Optimized': 'no', 'PDF version': '1.5', 'ModDate': 'Thu
Jun 25 18:17:25 2015', 'Page size': '612 x 792 pts (letter)',
'CreationDate': 'Thu Jun 25 18:17:25 2015', 'Pages': '1', 'Page rot': '0'},
u'contents': '%PDF-1.5\n%\xd0\xd4\xc5\xd8\n3 0 obj\n<<\n/Length 105
\n/Filter /FlateDecode\n>>\nstream\nx\xda%\x8c\xbb\n\x8 [...]
dobj\nstartxref\n12576\n%%EOF\n', 10, 11, 'invalid continuation byte'))

@fccoelho https://github.com/fccoelho @israelst
https://github.com/israelst any ideas on how to handle this?

—
Reply to this email directly or view it on GitHub
#176.

We should not index the original file contents for two reasons: 1) they are not relevant to the search. The `text` attribute should include the relevant content and 2) they may be in a binary format that will not be serializable. Fixes NAMD#176

flavioamieiro · 2015-06-26T15:34:09Z

@fccoelho I agree. I've opened #177 which fixes the issue.

flavioamieiro added bug pypln-backend worker labels Jun 25, 2015

flavioamieiro mentioned this issue Jun 26, 2015

Fixes ElasticIndexer for binary files #177

Merged

fccoelho closed this as completed in #177 Jun 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing PDF documents does not work. #176

Indexing PDF documents does not work. #176

flavioamieiro commented Jun 25, 2015

fccoelho commented Jun 25, 2015

flavioamieiro commented Jun 26, 2015

Indexing PDF documents does not work. #176

Indexing PDF documents does not work. #176

Comments

flavioamieiro commented Jun 25, 2015

fccoelho commented Jun 25, 2015

flavioamieiro commented Jun 26, 2015