Improve speed by using .section attribute rather than .data attribute #132

rlwastro · 2024-11-11T23:19:22Z

Accessing the .data attribute on a compressed image causes the entire image to be decompressed and read into memory. That defeats the ability of astrocut to reduce the IO requirements by reading only the necessary data.

Another benefit of the change is that it greatly reduces the memory requirements for the returned cutout. The current version using the data section returns a subset of the full image array, which winds up dragging the entire array along with it. The .section approach does not include the entire array.

This change instead uses the .section attribute to access the data, both when reading the pixels and when checking to see which extensions have images. The only remaining use of the .data attribute is for access to the cutout pixels. The implementation of the section attribute makes it efficient for accessing data in AWS S3 buckets as well.

The speedup ranges from a factor of 2 (for large cutouts) to at least a factor of 10 (for small cutouts).

This change also fixes a bug in the application of the memory_only flag. When true, it should not write any output files to disk. However, that was ignored when single_outfile is false. This fixes that bug. It was already working correctly when single_outfile is true.

Accessing the .data attribute on a compressed image causes the entire image to be decompressed and read into memory. That defeats the ability of astrocut to reduce the IO requirements by reading only the necessary data. This change instead uses the .section attribute to access the data, both when reading the pixels and when checking to see which extensions have images. The only remaining use of the .data attribute is for access to the cutout pixels. Also fixed a bug in the application of the memory_only flag. When true, it should not write any output files to disk. However, that was ignored when single_outfile is False. This fixes that bug. It was already working correctly when single_outfile is True.

rlwastro · 2024-11-15T21:50:41Z

@snbianco @falkben Just tagging you to keep this fix on your radar. I added a bit of text to the description. I only recently realized this is not just an impact on speed. Cutouts returned in memory also apparently return a view into the full (uncompressed) array, meaning the memory usage of the underlying data can be much larger than the cutout. I have an example that extracts 250 cutouts of size 256x256 pixels (total size about 65 MB). The resulting list from the current code was going to take 39 GB (!), which overflowed the 15 GB memory limit on the machine I was using. The new code using .section works correctly and avoids the memory issues.

I think we should try to get this merged soon as possible since it has some serious impacts on performance.

snbianco

Hi Rick! Sorry it's taken me a while to get to this - I've been at ADASS this past week. Thank you for this PR; I'm excited about these changes and this was a really great catch. I'm a little concerned that we never noticed the bug with the memory_only flag, so I'll probably make another PR to add test cases sometime next week.

I had one minor comment/question, but these changes look good to me.

snbianco · 2024-11-15T22:02:31Z

astrocut/cutouts.py

@@ -279,7 +279,7 @@ def fits_cut(input_files, coordinates, cutout_size, correct_wcs=False, extension
        with fits.open(in_fle, mode='denywrite', memmap=True, fsspec_kwargs=fsspec_kwargs) as hdulist:

            # Sorting out which extension(s) to cutout
-            all_inds = np.where([x.is_image and (x.data is not None) for x in hdulist])[0]
+            all_inds = np.where([x.is_image and (x.section.shape != ()) for x in hdulist])[0]


For this line and 540, what was the rationale for changing this comparison? Could we use x.section is not None or not x.section.size?

Hi Sam! This is the line I wondered about most myself. I did some testing with .section and found that x.section is not None, even for empty primary HDUs. So I know a simple x.section is not None won't work. The x.section.shape != () weeds out the empty primary HDU. But x.size != 0 would also eliminate the empty primary HDU.

It is necessary to include the x.is_image test to avoid selecting table HDUs. I tested a file with a BinTableHDU and found that (1) x.size is not zero, (2) x.is_image is False, and (3) x.section is not defined (raises an AttributeError).

So that's how I came up with my code. I think either of these could work:

x.is_image and (x.section.shape != ())

or

x.is_image and (x.size > 0)

I do kinda like the x.size version of that better. The paired parentheses () are ugly and hard to read. So I'd be happy with that change.

I just made that change, see what you think.

Looks great! I agree that it's a bit more readable.

Now that I think about it more, I don't think that the innermost parentheses around x.size > 0 are good style, so we may as well remove those while we're working on these lines. I'll make some suggestions on the relevant lines.

Looks good to me - I committed those changes.

astrocut/cutouts.py

Co-authored-by: Sam Bianco <[email protected]>

codecov · 2024-11-16T18:12:48Z

Codecov Report

Attention: Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.

Project coverage is 95.82%. Comparing base (11b0567) to head (667fd83).
Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
astrocut/cutouts.py	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #132      +/-   ##
==========================================
+ Coverage   94.61%   95.82%   +1.20%     
==========================================
  Files           9       11       +2     
  Lines        1524     1653     +129     
==========================================
+ Hits         1442     1584     +142     
+ Misses         82       69      -13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

snbianco

Looks good to me! I'll make another PR with updated tests to address the coverage issue. Thanks again for this PR, this is a huge help!

falkben · 2024-11-18T16:51:33Z

Very nice. Sorry it took a while for me to take a look at this. I wonder if we need similar change (.data -> .section) for this part of code:

https://github.com/spacetelescope/astrocut/blob/main/astrocut/cutout_processing.py#L463

and maybe that would speed up the moving target cutouts considerably?

rlwastro · 2024-11-18T17:52:16Z

@falkben That's a good question. I think the answer is yes, we should make the same change there. If this code that pieces together images from different skycells gets applied to Pan-STARRS images (which is on my list of things to work on), it will be much faster if .data is changed to .section in that code.

I believe these changes will not make much difference in speed for uncompressed images (e.g., TESS data cubes and HAP-MVM images). But I have not actually done timing tests on uncompressed images in the S3 bucket to see whether it has an effect on those images too. And it is possible that the .section approach will improve memory handling even for uncompressed images. I think it can't hurt in any case to make the change you suggest.

snbianco · 2024-11-19T14:37:06Z

Agreed! I'll make this change on the same MR as the updated unit tests and do some benchmarking as well.

snbianco · 2024-11-22T19:24:09Z

@rlwastro @falkben After looking into this a bit more, I don't think that changing line 463 in cutout_processing.py will work. build_default_combine_function initializes a function to combine an array of cutouts, and the entirety of each cutout has to be loaded in to memory to check for NaN values and compute the pixel weights.

rlwastro · 2024-11-22T22:44:20Z

@snbianco I have not quite been able to figure out what the parameters are for build_default_define_function. But assuming that they are already image cutouts that have been extracted from the larger images, I think you are correct. And in fact that line of code:

img_arrs = np.array([hdu.data for hdu in template_hdu_arr])

... only really makes sense if the data attributes are used rather than the section attributes. It implies that we are stacking up all the individual cutouts into a big cube.

So yes, I agree with you!

snbianco reviewed Nov 15, 2024

View reviewed changes

Change test for non-empty hdu to size>0

4eed99c

snbianco reviewed Nov 15, 2024

View reviewed changes

astrocut/cutouts.py Outdated Show resolved Hide resolved

astrocut/cutouts.py Outdated Show resolved Hide resolved

Apply suggestions from code review

667fd83

Co-authored-by: Sam Bianco <[email protected]>

snbianco approved these changes Nov 18, 2024

View reviewed changes

snbianco merged commit 7430d79 into spacetelescope:main Nov 18, 2024
10 checks passed

rlwastro deleted the rlwastro-section branch November 18, 2024 16:13

snbianco mentioned this pull request Nov 25, 2024

Better tests for writing to memory only #134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speed by using .section attribute rather than .data attribute #132

Improve speed by using .section attribute rather than .data attribute #132

rlwastro commented Nov 11, 2024 •

edited

Loading

rlwastro commented Nov 15, 2024

snbianco left a comment

snbianco Nov 15, 2024 •

edited

Loading

rlwastro Nov 15, 2024 •

edited

Loading

rlwastro Nov 15, 2024

snbianco Nov 15, 2024

rlwastro Nov 16, 2024

codecov bot commented Nov 16, 2024 •

edited

Loading

snbianco left a comment

falkben commented Nov 18, 2024

rlwastro commented Nov 18, 2024

snbianco commented Nov 19, 2024 •

edited

Loading

snbianco commented Nov 22, 2024

rlwastro commented Nov 22, 2024 •

edited

Loading

Improve speed by using .section attribute rather than .data attribute #132

Improve speed by using .section attribute rather than .data attribute #132

Conversation

rlwastro commented Nov 11, 2024 • edited Loading

rlwastro commented Nov 15, 2024

snbianco left a comment

Choose a reason for hiding this comment

snbianco Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

rlwastro Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

rlwastro Nov 15, 2024

Choose a reason for hiding this comment

snbianco Nov 15, 2024

Choose a reason for hiding this comment

rlwastro Nov 16, 2024

Choose a reason for hiding this comment

codecov bot commented Nov 16, 2024 • edited Loading

Codecov Report

snbianco left a comment

Choose a reason for hiding this comment

falkben commented Nov 18, 2024

rlwastro commented Nov 18, 2024

snbianco commented Nov 19, 2024 • edited Loading

snbianco commented Nov 22, 2024

rlwastro commented Nov 22, 2024 • edited Loading

rlwastro commented Nov 11, 2024 •

edited

Loading

snbianco Nov 15, 2024 •

edited

Loading

rlwastro Nov 15, 2024 •

edited

Loading

codecov bot commented Nov 16, 2024 •

edited

Loading

snbianco commented Nov 19, 2024 •

edited

Loading

rlwastro commented Nov 22, 2024 •

edited

Loading