-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upload using a ChunkReader #47
base: main
Are you sure you want to change the base?
Conversation
This prevents the uploaded file from being completely loaded in memory. Instead it will be read chunk-by-chunk.
Thank you very much for this PR, it looks really useful! As I am not very proficient in Python (especially when it comes to async), I will try to find a person competent enough to do so. |
@ifedapoolarewaju do you maybe have a feedback on this? If yes, I can take the time to update the code to the 1.0.0 version. |
@oliverpool Would you mind having a look at the merge conflicts for your PR? |
Sure, but only if someone is willing to take a look at the changes (I don't really want to invest time, if this PR is going to stay ignored for another year ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the work. Dearly sorry for the delayed review 🙏🏽
I have left some suggestions. I'll be sure to be timely for any follow up reviews
try: | ||
async with aiohttp.ClientSession(loop=self.io_loop) as session: | ||
async with session.patch(self._url, data=chunk, headers=self._request_headers) as resp: | ||
async with session.patch(self._url, data=self._chunk.reset().async_reader(8*1024), headers=self._request_headers) as resp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we move 8*1024
to a constant variable so it's re-used everywhere else as the internal chunk size
uploader.get_request_length(), | ||
) | ||
|
||
def add_checksum(self, file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add types? I believe the type we need here might be io.BytesIO
|
||
|
||
class ChunkReader(object): | ||
def __init__(self, file, start, length): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add types for the params?
def read(self, size=-1): | ||
if self.remaining is None: | ||
raise Exception("reset() must be called before first read") | ||
|
||
if size == -1 or size > self.remaining: | ||
size = self.remaining | ||
data = self.file.read(size) | ||
self.remaining -= len(data) | ||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be better if we separate the class that mimics the File stream behaviour from every other functionality. This is important so we can easily identify that read(...)
specifically exists to implement stream readers.
So what I mean is, async_reader
could just be a stand alone function async_reader(chunk_reader)
. And reset
may not be needed if we just used ChunkReader
directly wherever we need it. So for every place that the ChunkReader is needed, a new instance of it could be created. And then ChunkReader
itself only contains the read
method.
Can someone point me to where the file being uploaded is completely loaded into memory? Having the file completely loaded into memory would be a blocker for the current use case that I'm working on, and using |
try: | ||
chunk = self.file.read(self._content_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean that he file is already read in chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, each tus-chunk will be completely loaded in memory (I shouldn’t have used the word “chunk” in my initial comment).
Still if your chunk is 100 MB big, it is suboptimal to load those 100MB in memory…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it's only the chunks that are fully read (and which this PR tries to avoid). Thanks for clarifying.
@oliverpool Did you see the feedback from @ifedapoolarewaju last year? Are you still interested in pushing this PR forward? From April onwards, we will have more capacity to review any changes to this PR. |
This PR prevents the uploaded file from being completely loaded in memory.
Instead it will be read chunk-by-chunk.
This can be particularly useful to upload files larger than the available RAM.
My implementation of a
ChunkReader
can probably be improved, thanks in advance for the feedback!