Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed by using .section attribute rather than .data attribute #132

Merged
merged 3 commits into from
Nov 18, 2024

Conversation

rlwastro
Copy link
Contributor

@rlwastro rlwastro commented Nov 11, 2024

Accessing the .data attribute on a compressed image causes the entire image to be decompressed and read into memory. That defeats the ability of astrocut to reduce the IO requirements by reading only the necessary data.

Another benefit of the change is that it greatly reduces the memory requirements for the returned cutout. The current version using the data section returns a subset of the full image array, which winds up dragging the entire array along with it. The .section approach does not include the entire array.

This change instead uses the .section attribute to access the data, both when reading the pixels and when checking to see which extensions have images. The only remaining use of the .data attribute is for access to the cutout pixels. The implementation of the section attribute makes it efficient for accessing data in AWS S3 buckets as well.

The speedup ranges from a factor of 2 (for large cutouts) to at least a factor of 10 (for small cutouts).

This change also fixes a bug in the application of the memory_only flag. When true, it should not write any output files to disk. However, that was ignored when single_outfile is false. This fixes that bug. It was already working correctly when single_outfile is true.

Accessing the .data attribute on a compressed image causes the entire
image to be decompressed and read into memory.  That defeats the ability
of astrocut to reduce the IO requirements by reading only the necessary
data.

This change instead uses the .section attribute to access the data,
both when reading the pixels and when checking to see which extensions
have images. The only remaining use of the .data attribute is for access
to the cutout pixels.

Also fixed a bug in the application of the memory_only flag. When true,
it should not write any output files to disk.  However, that was
ignored when single_outfile is False.  This fixes that bug.  It was
already working correctly when single_outfile is True.
@rlwastro
Copy link
Contributor Author

@snbianco @falkben Just tagging you to keep this fix on your radar. I added a bit of text to the description. I only recently realized this is not just an impact on speed. Cutouts returned in memory also apparently return a view into the full (uncompressed) array, meaning the memory usage of the underlying data can be much larger than the cutout. I have an example that extracts 250 cutouts of size 256x256 pixels (total size about 65 MB). The resulting list from the current code was going to take 39 GB (!), which overflowed the 15 GB memory limit on the machine I was using. The new code using .section works correctly and avoids the memory issues.

I think we should try to get this merged soon as possible since it has some serious impacts on performance.

Copy link
Collaborator

@snbianco snbianco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Rick! Sorry it's taken me a while to get to this - I've been at ADASS this past week. Thank you for this PR; I'm excited about these changes and this was a really great catch. I'm a little concerned that we never noticed the bug with the memory_only flag, so I'll probably make another PR to add test cases sometime next week.

I had one minor comment/question, but these changes look good to me.

@@ -279,7 +279,7 @@ def fits_cut(input_files, coordinates, cutout_size, correct_wcs=False, extension
with fits.open(in_fle, mode='denywrite', memmap=True, fsspec_kwargs=fsspec_kwargs) as hdulist:

# Sorting out which extension(s) to cutout
all_inds = np.where([x.is_image and (x.data is not None) for x in hdulist])[0]
all_inds = np.where([x.is_image and (x.section.shape != ()) for x in hdulist])[0]
Copy link
Collaborator

@snbianco snbianco Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this line and 540, what was the rationale for changing this comparison? Could we use x.section is not None or not x.section.size?

Copy link
Contributor Author

@rlwastro rlwastro Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sam! This is the line I wondered about most myself. I did some testing with .section and found that x.section is not None, even for empty primary HDUs. So I know a simple x.section is not None won't work. The x.section.shape != () weeds out the empty primary HDU. But x.size != 0 would also eliminate the empty primary HDU.

It is necessary to include the x.is_image test to avoid selecting table HDUs. I tested a file with a BinTableHDU and found that (1) x.size is not zero, (2) x.is_image is False, and (3) x.section is not defined (raises an AttributeError).

So that's how I came up with my code. I think either of these could work:

x.is_image and (x.section.shape != ())

or

x.is_image and (x.size > 0)

I do kinda like the x.size version of that better. The paired parentheses () are ugly and hard to read. So I'd be happy with that change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made that change, see what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I agree that it's a bit more readable.

Now that I think about it more, I don't think that the innermost parentheses around x.size > 0 are good style, so we may as well remove those while we're working on these lines. I'll make some suggestions on the relevant lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - I committed those changes.

astrocut/cutouts.py Outdated Show resolved Hide resolved
astrocut/cutouts.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Nov 16, 2024

Codecov Report

Attention: Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.

Project coverage is 95.82%. Comparing base (11b0567) to head (667fd83).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
astrocut/cutouts.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #132      +/-   ##
==========================================
+ Coverage   94.61%   95.82%   +1.20%     
==========================================
  Files           9       11       +2     
  Lines        1524     1653     +129     
==========================================
+ Hits         1442     1584     +142     
+ Misses         82       69      -13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@snbianco snbianco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I'll make another PR with updated tests to address the coverage issue. Thanks again for this PR, this is a huge help!

@snbianco snbianco merged commit 7430d79 into spacetelescope:main Nov 18, 2024
10 checks passed
@rlwastro rlwastro deleted the rlwastro-section branch November 18, 2024 16:13
@falkben
Copy link
Member

falkben commented Nov 18, 2024

Very nice. Sorry it took a while for me to take a look at this. I wonder if we need similar change (.data -> .section) for this part of code:

https://github.com/spacetelescope/astrocut/blob/main/astrocut/cutout_processing.py#L463

and maybe that would speed up the moving target cutouts considerably?

@rlwastro
Copy link
Contributor Author

@falkben That's a good question. I think the answer is yes, we should make the same change there. If this code that pieces together images from different skycells gets applied to Pan-STARRS images (which is on my list of things to work on), it will be much faster if .data is changed to .section in that code.

I believe these changes will not make much difference in speed for uncompressed images (e.g., TESS data cubes and HAP-MVM images). But I have not actually done timing tests on uncompressed images in the S3 bucket to see whether it has an effect on those images too. And it is possible that the .section approach will improve memory handling even for uncompressed images. I think it can't hurt in any case to make the change you suggest.

@snbianco
Copy link
Collaborator

snbianco commented Nov 19, 2024

Agreed! I'll make this change on the same MR as the updated unit tests and do some benchmarking as well.

@snbianco
Copy link
Collaborator

@rlwastro @falkben After looking into this a bit more, I don't think that changing line 463 in cutout_processing.py will work. build_default_combine_function initializes a function to combine an array of cutouts, and the entirety of each cutout has to be loaded in to memory to check for NaN values and compute the pixel weights.

@rlwastro
Copy link
Contributor Author

rlwastro commented Nov 22, 2024

@snbianco I have not quite been able to figure out what the parameters are for build_default_define_function. But assuming that they are already image cutouts that have been extracted from the larger images, I think you are correct. And in fact that line of code:

img_arrs = np.array([hdu.data for hdu in template_hdu_arr])

... only really makes sense if the data attributes are used rather than the section attributes. It implies that we are stacking up all the individual cutouts into a big cube.

So yes, I agree with you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants