Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3: open()/read() does too many HEAD requests #1407

Open
craigds opened this issue May 30, 2024 · 2 comments
Open

s3: open()/read() does too many HEAD requests #1407

craigds opened this issue May 30, 2024 · 2 comments

Comments

@craigds
Copy link
Contributor

craigds commented May 30, 2024

We've noticed that using S3Storage.open("file.x").read() does a lot of HEAD requests in addition to the GET:

"HEAD /file.x HTTP/1.1" 200 0
"HEAD /file.x HTTP/1.1" 200 0
"GET /file.x HTTP/1.1" 200 123

These are caused by:

  • first HEAD request is here during the S3File constructor - because opening a file doesn't necessarily mean you're going to read the file, and so a HEAD request is required to ensure the file actually exists
  • second HEAD request is actually caused by boto3 because we use download_fileobj here. That method is a high-level managed transfer. It does the HEAD request to find out the file size so it can potentially do a multipart download of large files using multiple threads.

When called in a tight loop these extra requests can slow things down a fair bit, especially for large numbers of small files.

I propose:

  1. Eliminate the request in the constructor by just hitting self.file (thus triggering the download_fileobj right away.). Probably most callers will be calling .read() immediately anyway. Add a config option (EAGER_DOWNLOAD?) to opt out if you really don't want to, but I don't see any common reason you wouldn't - If you don't want to read the file but just want object size or something, you don't need to call S3Storage.open() at all, you can use S3Storage.size()
  2. Eliminate the request in the download_fileobj by using get() instead of download_fileobj. This will probably be context-dependent (for larger files, download_fileobj may perform better), so it probably needs to be opt-in via a setting - what about USE_MULTIPART_DOWNLOAD?

Thanks for your consideration :)

@craigds
Copy link
Contributor Author

craigds commented May 31, 2024

to be clear i'm happy to submit a PR to implement this change :) just wanted to solicit some feedback on the ideas first

@jschneier
Copy link
Owner

Thanks for opening this, people also pay for these requests so best to minimize.

I strongly want to avoid adding settings where possible.

For option 1, would we still get an exception if you try to read a file that doesn't exist? As long as we maintain that invariant I think that is certainly the best way.

Am happy to accept a PR for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants