Nokogiri XML Namespaces and gzip decoding#13
Nokogiri XML Namespaces and gzip decoding#13nathanstitt wants to merge 2 commits intoabhay:masterfrom
Conversation
Either newer version of Nokogiri is stricter on parsing attribute namespaces, or Calais has radically changed their schema. Either way, just about all the parts of the code that read attribute values was broken. This is a first pass at correcting the issues and takes it to the point that all the tests now pass successfully. There are probably additional issues lurking that the tests aren't covering. I'll fix those as DocumentCloud encounters them.
I don't think anything we're doing is triggering this. I've verified that
the Accept-Encoding header isn't present, but Calais is still sending
gzip'ed xml as the reply.
Probably a mis-configuration on their end but is easy enough to handle.
A future enhancement might be to set the {"Accept-Encoding" => "gzip"}
header on the request, then we should get gzip data all the time.
|
@nathanstitt, I'm looking for someone to properly take ownership of this project since I don't have the cycles to do it myself. Any thoughts on DocumentCloud or yourself taking this on? I could see you guys actually running with it. |
|
@abhay I totally understand, stuff can get crazy and sometimes there's just not enough days in the week. We'd be very interested in taking over the project. I think it would fit very well with DocumentCloud's mission since we depend on it quite a bit for our entity support. I'm not 100% sure on how that would go down, but I'm assuming you could just transfer the repo to documentcloud's github account and transfer the ruby gem to us. Feel free to email me directly, or swing by the #documentcloud irc channel if you'd like to discuss in real-time. |
|
Hi Abhay, Have you given any further thought to allowing DocumentCloud to take over support of the Gem? We're still attempting to cleanup the Gemfile. Please let us know if we can help further. Thanks very much. |
This is needed for fixing a few issues that DocumentCloud has encountered while using calais for our entity extraction.
The first is that Calias sometimes returns gzipped content. When that occurs an exception is thrown since the content can't be decoded (of course). This may have been an intermittent issue with the api, but our thoughts were that it can't hurt to attempt to handle it. A further enhancement would be to request gzip encoding on the request so it would be more efficient.
The second is more pressing. It has to do with newer nokogiri differing on how it handles namespace prefixes. I believe issues #10 and #11 are attempting to fix the same bug. #11 indicates that the bug started with Nokogiri 1.5.6, but I haven't tracked down when the change occurred.
DocumentCloud has been running with this branch in production for several months now without issue (https://github.com/documentcloud/documentcloud/blob/master/Gemfile#L5). We'd really like to get it merged and a new gem cut so we can remove the "git" references out of our Gemfile.
Thanks for the excellent job you've done with the gem thus far. If I can help with any further testing or merging, please let me know.