-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing PDF documents does not work. #176
Comments
I think only the indexable fields should be persisted in elastic. Pypln is
|
We should not index the original file contents for two reasons: 1) they are not relevant to the search. The `text` attribute should include the relevant content and 2) they may be in a binary format that will not be serializable. Fixes NAMD#176
When trying to index a PDF file in the new indexing pipeline, the worker raises an error. Looking at the full stack trace we can see that the problem is that the worker is sending all the properties of the document to elastic. In the case of a PDF, the
contents
property contains binary data, and the worker cannot serialize that properly in order to send it.I think we should probably not send
contents
together with the rest of the data, but that would mean we would loose the original file contents (since in e4d0cb8 we started deleting the file from GridFS).A shorter version of the stack trace (with the
contents
property snipped) is bellow:@fccoelho @israelst any ideas on how to handle this?
The text was updated successfully, but these errors were encountered: