Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload using a ChunkReader #47

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

oliverpool
Copy link

This PR prevents the uploaded file from being completely loaded in memory.
Instead it will be read chunk-by-chunk.

This can be particularly useful to upload files larger than the available RAM.

My implementation of a ChunkReader can probably be improved, thanks in advance for the feedback!

This prevents the uploaded file from being completely loaded in memory.
Instead it will be read chunk-by-chunk.
@Acconut
Copy link
Member

Acconut commented Oct 15, 2021

Thank you very much for this PR, it looks really useful! As I am not very proficient in Python (especially when it comes to async), I will try to find a person competent enough to do so.

@oliverpool
Copy link
Author

@ifedapoolarewaju do you maybe have a feedback on this?

If yes, I can take the time to update the code to the 1.0.0 version.

@Acconut
Copy link
Member

Acconut commented Sep 14, 2022

@oliverpool Would you mind having a look at the merge conflicts for your PR?

@oliverpool
Copy link
Author

Sure, but only if someone is willing to take a look at the changes (I don't really want to invest time, if this PR is going to stay ignored for another year ;-)

Copy link
Contributor

@ifedapoolarewaju ifedapoolarewaju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work. Dearly sorry for the delayed review 🙏🏽
I have left some suggestions. I'll be sure to be timely for any follow up reviews

try:
async with aiohttp.ClientSession(loop=self.io_loop) as session:
async with session.patch(self._url, data=chunk, headers=self._request_headers) as resp:
async with session.patch(self._url, data=self._chunk.reset().async_reader(8*1024), headers=self._request_headers) as resp:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we move 8*1024 to a constant variable so it's re-used everywhere else as the internal chunk size

uploader.get_request_length(),
)

def add_checksum(self, file):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add types? I believe the type we need here might be io.BytesIO



class ChunkReader(object):
def __init__(self, file, start, length):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add types for the params?

Comment on lines +133 to +141
def read(self, size=-1):
if self.remaining is None:
raise Exception("reset() must be called before first read")

if size == -1 or size > self.remaining:
size = self.remaining
data = self.file.read(size)
self.remaining -= len(data)
return data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better if we separate the class that mimics the File stream behaviour from every other functionality. This is important so we can easily identify that read(...) specifically exists to implement stream readers.

So what I mean is, async_reader could just be a stand alone function async_reader(chunk_reader). And reset may not be needed if we just used ChunkReader directly wherever we need it. So for every place that the ChunkReader is needed, a new instance of it could be created. And then ChunkReader itself only contains the read method.

@peterroelants
Copy link

This PR prevents the uploaded file from being completely loaded in memory.
Instead it will be read chunk-by-chunk.

Can someone point me to where the file being uploaded is completely loaded into memory?

Having the file completely loaded into memory would be a blocker for the current use case that I'm working on, and using tus-py-client for. However, from what I understand from the code the file is already read in chunks?

try:
chunk = self.file.read(self._content_length)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean that he file is already read in chunks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, each tus-chunk will be completely loaded in memory (I shouldn’t have used the word “chunk” in my initial comment).

Still if your chunk is 100 MB big, it is suboptimal to load those 100MB in memory…

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it's only the chunks that are fully read (and which this PR tries to avoid). Thanks for clarifying.

@Acconut
Copy link
Member

Acconut commented Mar 26, 2023

@oliverpool Did you see the feedback from @ifedapoolarewaju last year? Are you still interested in pushing this PR forward? From April onwards, we will have more capacity to review any changes to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants