Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing PDF documents does not work. #176

Closed
flavioamieiro opened this issue Jun 25, 2015 · 2 comments
Closed

Indexing PDF documents does not work. #176

flavioamieiro opened this issue Jun 25, 2015 · 2 comments

Comments

@flavioamieiro
Copy link
Member

When trying to index a PDF file in the new indexing pipeline, the worker raises an error. Looking at the full stack trace we can see that the problem is that the worker is sending all the properties of the document to elastic. In the case of a PDF, the contents property contains binary data, and the worker cannot serialize that properly in order to send it.

I think we should probably not send contents together with the rest of the data, but that would mean we would loose the original file contents (since in e4d0cb8 we started deleting the file from GridFS).

A shorter version of the stack trace (with the contents property snipped) is bellow:

[2015-06-25 18:20:55,521: ERROR/MainProcess] Task
pypln.backend.workers.elastic_indexer.ElasticIndexer[76b5682a-660b-480f-b533-48f619246362]
raised unexpected: SerializationError({u'mimetype': 'application/pdf',
u'upload_date': datetime.datetime(2015, 6, 25, 16, 20, 54, 169000),
u'forced_decoding': False, u'language': 'un', u'text': u'This is a minimal
pdf.\n1', u'filename': u'minimal_pdf.pdf_1435249254.17', u'length': 12876,
u'file_id': '558c2a66798ebd634b0b249f', u'file_metadata':
{'UserProperties': 'no', 'Tagged': 'no', 'Form': 'none', 'Producer':
'pdfTeX-1.40.15', 'Creator': 'TeX', 'Encrypted': 'no', 'JavaScript': 'no',
'Suspects': 'no', 'Optimized': 'no', 'PDF version': '1.5', 'ModDate': 'Thu
Jun 25 18:17:25 2015', 'Page size': '612 x 792 pts (letter)',
'CreationDate': 'Thu Jun 25 18:17:25 2015', 'Pages': '1', 'Page rot': '0'},
u'contents': '%PDF-1.5\n%\xd0\xd4\xc5\xd8\n3 0 obj\n<<\n/Length 105
\n/Filter /FlateDecode\n>>\nstream\nx\xda%\x8c\xbb\n\x8 [...]
dobj\nstartxref\n12576\n%%EOF\n', 10, 11, 'invalid continuation byte'))

@fccoelho @israelst any ideas on how to handle this?

@fccoelho
Copy link
Member

I think only the indexable fields should be persisted in elastic. Pypln is
not and should not be a document store. The responsibility for persisting
the original document is the user's.
Em 25/06/2015 13:39, "Flávio Amieiro" [email protected] escreveu:

When trying to index a PDF file in the new indexing pipeline, the worker
raises an error. Looking at the (full stack trace)[
https://gist.github.com/flavioamieiro/d95b2813a9cdd9a8645c] we can see
that the problem is that the worker is sending all the properties of the
document to elastic. In the case of a PDF, the contents property contains
binary data, and the worker cannot serialize that properly in order to send
it.

I think we should probably not send contents together with the rest of
the data, but that would mean we would loose the original file contents
(since in e4d0cb8
e4d0cb8
we started deleting the file from GridFS).

A shorter version of the stack trace (with the contents property snipped)
is bellow:

[2015-06-25 18:20:55,521: ERROR/MainProcess] Task
pypln.backend.workers.elastic_indexer.ElasticIndexer[76b5682a-660b-480f-b533-48f619246362]
raised unexpected: SerializationError({u'mimetype': 'application/pdf',
u'upload_date': datetime.datetime(2015, 6, 25, 16, 20, 54, 169000),
u'forced_decoding': False, u'language': 'un', u'text': u'This is a minimal
pdf.\n1', u'filename': u'minimal_pdf.pdf_1435249254.17', u'length': 12876,
u'file_id': '558c2a66798ebd634b0b249f', u'file_metadata':
{'UserProperties': 'no', 'Tagged': 'no', 'Form': 'none', 'Producer':
'pdfTeX-1.40.15', 'Creator': 'TeX', 'Encrypted': 'no', 'JavaScript': 'no',
'Suspects': 'no', 'Optimized': 'no', 'PDF version': '1.5', 'ModDate': 'Thu
Jun 25 18:17:25 2015', 'Page size': '612 x 792 pts (letter)',
'CreationDate': 'Thu Jun 25 18:17:25 2015', 'Pages': '1', 'Page rot': '0'},
u'contents': '%PDF-1.5\n%\xd0\xd4\xc5\xd8\n3 0 obj\n<<\n/Length 105
\n/Filter /FlateDecode\n>>\nstream\nx\xda%\x8c\xbb\n\x8 [...]
dobj\nstartxref\n12576\n%%EOF\n', 10, 11, 'invalid continuation byte'))

@fccoelho https://github.com/fccoelho @israelst
https://github.com/israelst any ideas on how to handle this?


Reply to this email directly or view it on GitHub
#176.

flavioamieiro added a commit to flavioamieiro/pypln.backend that referenced this issue Jun 26, 2015
We should not index the original file contents for two reasons: 1) they are not
relevant to the search. The `text` attribute should include the relevant
content and 2) they may be in a binary format that will not be serializable.

Fixes NAMD#176
@flavioamieiro
Copy link
Member Author

@fccoelho I agree. I've opened #177 which fixes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants