-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORM: Add get_size_on_disk
method to RemoteData
#6584
ORM: Add get_size_on_disk
method to RemoteData
#6584
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6584 +/- ##
==========================================
+ Coverage 77.92% 77.94% +0.02%
==========================================
Files 563 563
Lines 41671 41761 +90
==========================================
+ Hits 32467 32545 +78
- Misses 9204 9216 +12 ☔ View full report in Codecov by Sentry. |
get_total_size_on_disk
method to RemoteData
.get_total_size_on_disk
method to RemoteData
This is maybe machine-dependent, but rather than going via our API (that is more robust, but definitely going to be slower, I think) have a first "fast" option just running |
Note to self to run |
e801a78
to
e0be575
Compare
get_total_size_on_disk
method to RemoteData
get_size_on_disk
method to RemoteData
e0be575
to
b1a0560
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GeigerJ2 please go through more pains that I imposed 😈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why my replies via the files
GH tab appear as a review, but whatever... ^^
Thanks for the review @khsrali. I implemented most of your proposed changes!
As to our in-person discussion if du
or lstat
is preferred, I think none of the two is ideal... lstat
giving the actual byte-sized content, which is neat and all, but won't correspond to the disk space that will actually be occupied locally by a file, due to the use of blocks for the file system. And du
giving the actual occupied disk space on the remote, which, however, might be different from the local file system (due to having a different file system on the local machine, different formatting, different block size, etc.). Hence, the big difference in the file size check in the test. For real-world use cases, with more and larger files, the difference is likely much smaller, and won't matter too much, I think. I'll add a test for a larger file, as well as modify the message that the given size is just an estimate, so that user are aware they should take the value with a grain of salt.
Maybe also @agoscinski with his actual computer science background can weigh in 🫶
f2f90f9
to
4fcb1ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @GeigerJ2! Just a few minor comments..
f7dfe7e
to
ecfa53b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes seems like we should split the function into two or provide optional arguments. These two concepts seem to be distinguished by the terms "disk usage" and "apparent size" (see https://stackoverflow.com/a/569485). I think the output of du --apparent-size
is the same as with lstat
, so you do not need to use lstat
Thanks for the review, @agoscinski! I'm currently still working on this, will ping you once it's again ready for review.
The reason I'm providing |
58b3248
to
8bcf0f8
Compare
OK, this should be ready for a final review, @agoscinski and @khsrali. Also pinging, @mikibonacci, if you want to provide some feedback on the CLI/API for use in AiiDAlab? |
get_size_on_disk
method to RemoteData
get_size_on_disk
method to RemoteData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GeigerJ2 ,
Please consider my review, I haven't checked the tests thoroughly. I leave that to @agoscinski
f0bb1f6
to
192b0fa
Compare
Thanks again for the review, @khsrali, I implemented your proposed changes. |
In addition, extend the tests for `RemoteData` in `test_remote.py` for the methods added in this PR, as well as parametrize them to run on a `RemoteData` via local and ssh transport.
In addition, extend the tests for `RemoteData` in `test_remote.py` for the methods added in this PR, as well as parametrize them to run on a `RemoteData` via local and ssh transport.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
… nested directory
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
09c30d0
to
870263a
Compare
This allows just defining one fixture and passing a mode parameter, to create different `RemoteData` instances, one with SSH and one without. In addition, it allows to pass different content directly to the fixture factory to create differently parametrized instances. Without a factory, this is not possible, as the instantiated `RemoteData` that is returned cannot be parametrized (by something like `(content=b'a')`), as it would be already the instantiated `RemoteData` object. In addition, this change also removes the need for the `request.getfixturevalue(fixture)` code, but allows just passing the fixture factory and the mode as a parameter.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GeigerJ2 , all good. Just two comments:
- I find it confusing that you call it
stat
while it's actually doing alstat
. - Maybe consider to raise
ValueError
instead ofNotimplemented
.
raise NotImplementedError( | ||
f'Specified method `{method}` for evaluating the size on disk not implemented.' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense for this to be a ValueError
, otherwise might be confusing when there's a typo, e.g. 'state' -> NotImplemented (?)
if you have any specific method in mind (like what actually? 🤔 ) that is legit but not implemented, you could single them out and raise NotImplemented
.
But in generic cases I'd raise ValueError
which is more clear.
raise NotImplementedError( | |
f'Specified method `{method}` for evaluating the size on disk not implemented.' | |
) | |
raise ValueError( | |
f'Specified method `{method}` is not an valid input. Please choose either 'du' or 'stat'.' | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see your point. Yeah, the idea was indeed to hint at the possibility of more options to evaluate the disk usage to be implemented in the future. Though, now that I think of it, I wouldn't know what that would be, honestly ^^ so I agree it's better to change it to a ValueError
to capture typos, so I modified it as you suggested. Thanks!
|
||
:param relpath: File or directory path for which the total size should be returned, relative to | ||
``self.get_remote_path()``. | ||
:param method: Method to be used to evaluate the directory/file size (either ``du`` or ``stat``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change to stat
, in the end?
What you are doing here is actually lstat
.
Links are not followed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it because I think people are more familiar with stat
, as it is also a Linux command line utility, whereas lstat
is not. Also, lstat
is just the same as os.stat(follow_symlinks=False)
, and I don't think the implementation detail should make the user-facing option less understandable. Lastly, I also think it's self-explanatory that, to get the size of a directory on disk, symlinks that point to files/directories that life somewhere else, should not be considered. That being said, we could also think about exposing the follow_symlinks
option to the user, though that would also mean updating the implementation of listdir_withattributes
of the Transport
, so I wouldn't do that now, this PR has been dragging on for way too long anyway :D
Thanks again for the review, @khsrali. I wrote down my reasoning for point 1 in my response to your comment in the code, and implemented point 2. Once CI passes here (hopefully), I'll squash-merge. |
Required for PR #6578.
By default, the
get_size_on_disk
method calls the method_get_size_on_disk_du
that usesdu
to obtain the total directory size in bytes. If the call todu
fails for whatever reason, recursivestat
is being used, though, that is discouraged asstat
returns the apparent size of files, not the actual disk usage.I further extended the existing tests for
RemoteData
to use both,LocalTransport
, as well asSshTransport
, and test for the functionality added in this PR.Pinging also @npaulish, as she's currently working on retrieving
RemoteData
objects.