-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fits files that match glob expressions which might contain empty HDU tables (NAXIS=0) causes fatal error #81
Comments
Here is another idea I had that could be a nice solution in spark-fits: Would it be possible to have something similar to the mode option used in the databricks csv reader? This would be amazing in spark-fits to allow the user to handle bad fits files: mode: the parsing mode. By default it is PERMISSIVE. Possible values are:
Cheers, |
Hi @jacobic, thanks for the detailed report as usual :-) I agree dealing with empty files would be a nice feature to have, and I will have a deeper look early next week. Having said that, this will require some change of the codebase as the header checks are currently performed on the driver and not the executors. We used to perform some extensive checks on headers before, but they were introducing huge latency when dealing with thousands of files (see #56). If we want such a feature, the header checks would need to be distributed. |
Thanks @JulienPeloton I am really looking forward to this enhancement as it is somewhat of a blocker for the project I am working on. Please let me know if you need any assistance with testing as I am more than happy to help out. Cheers, |
Issue 81: discarding empty HDU without failing
Fixed in #82 |
Hi @JulienPeloton
I have some additional feedback about spark-fits, perhaps this is an issue too specific to my particular use case but thought it was worth mentioning anyway in case there is a possibility to improve stability :)
I am loading many files at the same time using a glob expression with spark-fits. For 99.9% of these files, my spark pipeline runs smoothly. For some rare cases the whole pipeline is brought to a halt because of some files which match the glob expression but have
NAXIS= 0
in their header i.e. an empty table. These files have an almost identical format as all the other files that I want to load into my master dataframe but when the data is actually loaded from the file (along with all the other good files withNAXIS=2
) the following error occurs:This is clearly because in this section of the code,
NAXIS
is not expected to be 0:spark-fits/src/main/scala/com/astrolabsoftware/sparkfits/FitsHduImage.scala
Line 76 in 1cc93d6
Whereas the offending part of the header looks like
As I am ingesting hundreds of thousands of files it is very tricky for me to manually find out which ones have the empty tables since it is like finding needles in a haystack (and spark-fits does not point to the offending file). The only way I can think to avoid this is to filter by filesize or check all the headers in advance to remove the offending files which is cumbersome.
As an example here are some example fits files that are 1) normal 2) empty. The latter are the type of files that I would like to be able to handle (and preferably warned about) without spark-fits crashing.
normal_and_empty_table_example.zip
Would it be possible to add a try/except code block in
FitsHduImage.scala
such that spark-fits is able to ignore empty tables and/or provide some sort of warning (with verbose=True in spark-fits) so the user is aware of such files being skipped or that such files are the ones causing the error? This sort of behaviour would be very useful.Please let me know what you think.
Thanks as always, keep up the good work!
Jacob
The text was updated successfully, but these errors were encountered: