-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve speed by using .section attribute rather than .data attribute #132
Conversation
Accessing the .data attribute on a compressed image causes the entire image to be decompressed and read into memory. That defeats the ability of astrocut to reduce the IO requirements by reading only the necessary data. This change instead uses the .section attribute to access the data, both when reading the pixels and when checking to see which extensions have images. The only remaining use of the .data attribute is for access to the cutout pixels. Also fixed a bug in the application of the memory_only flag. When true, it should not write any output files to disk. However, that was ignored when single_outfile is False. This fixes that bug. It was already working correctly when single_outfile is True.
@snbianco @falkben Just tagging you to keep this fix on your radar. I added a bit of text to the description. I only recently realized this is not just an impact on speed. Cutouts returned in memory also apparently return a view into the full (uncompressed) array, meaning the memory usage of the underlying data can be much larger than the cutout. I have an example that extracts 250 cutouts of size 256x256 pixels (total size about 65 MB). The resulting list from the current code was going to take 39 GB (!), which overflowed the 15 GB memory limit on the machine I was using. The new code using I think we should try to get this merged soon as possible since it has some serious impacts on performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Rick! Sorry it's taken me a while to get to this - I've been at ADASS this past week. Thank you for this PR; I'm excited about these changes and this was a really great catch. I'm a little concerned that we never noticed the bug with the memory_only
flag, so I'll probably make another PR to add test cases sometime next week.
I had one minor comment/question, but these changes look good to me.
astrocut/cutouts.py
Outdated
@@ -279,7 +279,7 @@ def fits_cut(input_files, coordinates, cutout_size, correct_wcs=False, extension | |||
with fits.open(in_fle, mode='denywrite', memmap=True, fsspec_kwargs=fsspec_kwargs) as hdulist: | |||
|
|||
# Sorting out which extension(s) to cutout | |||
all_inds = np.where([x.is_image and (x.data is not None) for x in hdulist])[0] | |||
all_inds = np.where([x.is_image and (x.section.shape != ()) for x in hdulist])[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this line and 540, what was the rationale for changing this comparison? Could we use x.section is not None
or not x.section.size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Sam! This is the line I wondered about most myself. I did some testing with .section
and found that x.section
is not None, even for empty primary HDUs. So I know a simple x.section is not None
won't work. The x.section.shape != ()
weeds out the empty primary HDU. But x.size != 0
would also eliminate the empty primary HDU.
It is necessary to include the x.is_image
test to avoid selecting table HDUs. I tested a file with a BinTableHDU and found that (1) x.size
is not zero, (2) x.is_image
is False, and (3) x.section
is not defined (raises an AttributeError).
So that's how I came up with my code. I think either of these could work:
x.is_image and (x.section.shape != ())
or
x.is_image and (x.size > 0)
I do kinda like the x.size
version of that better. The paired parentheses ()
are ugly and hard to read. So I'd be happy with that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made that change, see what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I agree that it's a bit more readable.
Now that I think about it more, I don't think that the innermost parentheses around x.size > 0
are good style, so we may as well remove those while we're working on these lines. I'll make some suggestions on the relevant lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me - I committed those changes.
Co-authored-by: Sam Bianco <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #132 +/- ##
==========================================
+ Coverage 94.61% 95.82% +1.20%
==========================================
Files 9 11 +2
Lines 1524 1653 +129
==========================================
+ Hits 1442 1584 +142
+ Misses 82 69 -13 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I'll make another PR with updated tests to address the coverage issue. Thanks again for this PR, this is a huge help!
Very nice. Sorry it took a while for me to take a look at this. I wonder if we need similar change ( https://github.com/spacetelescope/astrocut/blob/main/astrocut/cutout_processing.py#L463 and maybe that would speed up the moving target cutouts considerably? |
@falkben That's a good question. I think the answer is yes, we should make the same change there. If this code that pieces together images from different skycells gets applied to Pan-STARRS images (which is on my list of things to work on), it will be much faster if I believe these changes will not make much difference in speed for uncompressed images (e.g., TESS data cubes and HAP-MVM images). But I have not actually done timing tests on uncompressed images in the S3 bucket to see whether it has an effect on those images too. And it is possible that the |
Agreed! I'll make this change on the same MR as the updated unit tests and do some benchmarking as well. |
@rlwastro @falkben After looking into this a bit more, I don't think that changing line 463 in |
@snbianco I have not quite been able to figure out what the parameters are for img_arrs = np.array([hdu.data for hdu in template_hdu_arr]) ... only really makes sense if the So yes, I agree with you! |
Accessing the
.data
attribute on a compressed image causes the entire image to be decompressed and read into memory. That defeats the ability of astrocut to reduce the IO requirements by reading only the necessary data.Another benefit of the change is that it greatly reduces the memory requirements for the returned cutout. The current version using the data section returns a subset of the full image array, which winds up dragging the entire array along with it. The
.section
approach does not include the entire array.This change instead uses the
.section
attribute to access the data, both when reading the pixels and when checking to see which extensions have images. The only remaining use of the.data
attribute is for access to the cutout pixels. The implementation of thesection
attribute makes it efficient for accessing data in AWS S3 buckets as well.The speedup ranges from a factor of 2 (for large cutouts) to at least a factor of 10 (for small cutouts).
This change also fixes a bug in the application of the
memory_only
flag. When true, it should not write any output files to disk. However, that was ignored whensingle_outfile
is false. This fixes that bug. It was already working correctly whensingle_outfile
is true.