Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORM: Add get_size_on_disk method to RemoteData #6584

Merged
merged 24 commits into from
Dec 19, 2024

Conversation

GeigerJ2
Copy link
Contributor

@GeigerJ2 GeigerJ2 commented Oct 15, 2024

Required for PR #6578.

By default, the get_size_on_disk method calls the method _get_size_on_disk_du that uses du to obtain the total directory size in bytes. If the call to du fails for whatever reason, recursive stat is being used, though, that is discouraged as stat returns the apparent size of files, not the actual disk usage.

I further extended the existing tests for RemoteData to use both, LocalTransport, as well as SshTransport, and test for the functionality added in this PR.

Pinging also @npaulish, as she's currently working on retrieving RemoteData objects.

Copy link

codecov bot commented Oct 15, 2024

Codecov Report

Attention: Patch coverage is 82.60870% with 16 lines in your changes missing coverage. Please review.

Project coverage is 77.94%. Comparing base (c532b34) to head (9c3d2ba).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/aiida/orm/nodes/data/remote/base.py 88.24% 8 Missing ⚠️
src/aiida/cmdline/commands/cmd_data/cmd_remote.py 50.00% 7 Missing ⚠️
src/aiida/common/utils.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6584      +/-   ##
==========================================
+ Coverage   77.92%   77.94%   +0.02%     
==========================================
  Files         563      563              
  Lines       41671    41761      +90     
==========================================
+ Hits        32467    32545      +78     
- Misses       9204     9216      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@GeigerJ2 GeigerJ2 changed the title Add get_total_size_on_disk method to RemoteData. Add get_total_size_on_disk method to RemoteData Oct 17, 2024
@giovannipizzi
Copy link
Member

This is maybe machine-dependent, but rather than going via our API (that is more robust, but definitely going to be slower, I think) have a first "fast" option just running du -s and parsing the output (but careful about units! E.g. it uses "blocks", on some systems it's 512 or 2048bytes!! And if it fails, fall back to your solution?

@GeigerJ2
Copy link
Contributor Author

GeigerJ2 commented Oct 17, 2024

Note to self to run du via exec_command_wait method from transport.

@GeigerJ2 GeigerJ2 requested a review from khsrali November 21, 2024 14:53
@GeigerJ2 GeigerJ2 force-pushed the feature/remote-data-total-size branch from e801a78 to e0be575 Compare November 21, 2024 14:54
@GeigerJ2 GeigerJ2 marked this pull request as ready for review November 21, 2024 14:54
@GeigerJ2 GeigerJ2 changed the title Add get_total_size_on_disk method to RemoteData Add get_size_on_disk method to RemoteData Nov 21, 2024
@GeigerJ2 GeigerJ2 force-pushed the feature/remote-data-total-size branch from e0be575 to b1a0560 Compare November 21, 2024 15:14
Copy link
Contributor

@khsrali khsrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeigerJ2 please go through more pains that I imposed 😈

src/aiida/common/utils.py Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
tests/orm/nodes/data/test_remote.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@GeigerJ2 GeigerJ2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why my replies via the files GH tab appear as a review, but whatever... ^^

Thanks for the review @khsrali. I implemented most of your proposed changes!

As to our in-person discussion if du or lstat is preferred, I think none of the two is ideal... lstat giving the actual byte-sized content, which is neat and all, but won't correspond to the disk space that will actually be occupied locally by a file, due to the use of blocks for the file system. And du giving the actual occupied disk space on the remote, which, however, might be different from the local file system (due to having a different file system on the local machine, different formatting, different block size, etc.). Hence, the big difference in the file size check in the test. For real-world use cases, with more and larger files, the difference is likely much smaller, and won't matter too much, I think. I'll add a test for a larger file, as well as modify the message that the given size is just an estimate, so that user are aware they should take the value with a grain of salt.

Maybe also @agoscinski with his actual computer science background can weigh in 🫶

src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
@GeigerJ2 GeigerJ2 force-pushed the feature/remote-data-total-size branch from f2f90f9 to 4fcb1ba Compare November 25, 2024 15:41
Copy link
Contributor

@khsrali khsrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @GeigerJ2! Just a few minor comments..

src/aiida/common/utils.py Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
tests/orm/nodes/data/test_remote.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
@GeigerJ2 GeigerJ2 force-pushed the feature/remote-data-total-size branch from f7dfe7e to ecfa53b Compare December 9, 2024 09:36
Copy link
Contributor

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes seems like we should split the function into two or provide optional arguments. These two concepts seem to be distinguished by the terms "disk usage" and "apparent size" (see https://stackoverflow.com/a/569485). I think the output of du --apparent-size is the same as with lstat, so you do not need to use lstat

src/aiida/orm/nodes/data/remote/base.py Show resolved Hide resolved
src/aiida/cmdline/commands/cmd_data/cmd_remote.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
@GeigerJ2
Copy link
Contributor Author

Thanks for the review, @agoscinski! I'm currently still working on this, will ping you once it's again ready for review.

I think the output of du --apparent-size is the same as with lstat, so you do not need to use lstat

The reason I'm providing lstat as a fallback option is if du is not available (e.g., MacOS, as you mentioned, didn't know that ^^), or if exec_command_wait isn't available, which will be the case for FirecREST.

@GeigerJ2 GeigerJ2 force-pushed the feature/remote-data-total-size branch from 58b3248 to 8bcf0f8 Compare December 11, 2024 10:59
@GeigerJ2
Copy link
Contributor Author

OK, this should be ready for a final review, @agoscinski and @khsrali. Also pinging, @mikibonacci, if you want to provide some feedback on the CLI/API for use in AiiDAlab?

@GeigerJ2 GeigerJ2 changed the title Add get_size_on_disk method to RemoteData ORM: Add get_size_on_disk method to RemoteData Dec 11, 2024
Copy link
Contributor

@khsrali khsrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeigerJ2 ,
Please consider my review, I haven't checked the tests thoroughly. I leave that to @agoscinski

src/aiida/cmdline/commands/cmd_data/cmd_remote.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
tests/orm/nodes/data/test_remote.py Outdated Show resolved Hide resolved
@GeigerJ2 GeigerJ2 force-pushed the feature/remote-data-total-size branch from f0bb1f6 to 192b0fa Compare December 12, 2024 11:44
@GeigerJ2
Copy link
Contributor Author

Thanks again for the review, @khsrali, I implemented your proposed changes.

GeigerJ2 and others added 4 commits December 17, 2024 14:42
In addition, extend the tests for `RemoteData` in `test_remote.py`
for the methods added in this PR, as well as parametrize them to run on
a `RemoteData` via local and ssh transport.
In addition, extend the tests for `RemoteData` in `test_remote.py`
for the methods added in this PR, as well as parametrize them to run on
a `RemoteData` via local and ssh transport.
@GeigerJ2 GeigerJ2 force-pushed the feature/remote-data-total-size branch from 09c30d0 to 870263a Compare December 17, 2024 13:42
GeigerJ2 and others added 2 commits December 17, 2024 16:00
This allows just defining one fixture and passing a mode parameter, to
create different `RemoteData` instances, one with SSH and one without.
In addition, it allows to pass different content directly to the fixture
factory to create differently parametrized instances. Without a factory,
this is not possible, as the instantiated `RemoteData` that is returned
cannot be parametrized (by something like `(content=b'a')`), as it would
be already the instantiated `RemoteData` object.

In addition, this change also removes the need for the
`request.getfixturevalue(fixture)` code, but allows just passing the
fixture factory and the mode as a parameter.
Copy link
Contributor

@khsrali khsrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeigerJ2 , all good. Just two comments:

  1. I find it confusing that you call it stat while it's actually doing a lstat.
  2. Maybe consider to raise ValueError instead of Notimplemented.

tests/orm/nodes/data/test_remote.py Outdated Show resolved Hide resolved
Comment on lines 235 to 237
raise NotImplementedError(
f'Specified method `{method}` for evaluating the size on disk not implemented.'
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense for this to be a ValueError, otherwise might be confusing when there's a typo, e.g. 'state' -> NotImplemented (?)

if you have any specific method in mind (like what actually? 🤔 ) that is legit but not implemented, you could single them out and raise NotImplemented.
But in generic cases I'd raise ValueError which is more clear.

Suggested change
raise NotImplementedError(
f'Specified method `{method}` for evaluating the size on disk not implemented.'
)
raise ValueError(
f'Specified method `{method}` is not an valid input. Please choose either 'du' or 'stat'.'
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see your point. Yeah, the idea was indeed to hint at the possibility of more options to evaluate the disk usage to be implemented in the future. Though, now that I think of it, I wouldn't know what that would be, honestly ^^ so I agree it's better to change it to a ValueError to capture typos, so I modified it as you suggested. Thanks!


:param relpath: File or directory path for which the total size should be returned, relative to
``self.get_remote_path()``.
:param method: Method to be used to evaluate the directory/file size (either ``du`` or ``stat``).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change to stat, in the end?
What you are doing here is actually lstat.
Links are not followed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it because I think people are more familiar with stat, as it is also a Linux command line utility, whereas lstat is not. Also, lstat is just the same as os.stat(follow_symlinks=False), and I don't think the implementation detail should make the user-facing option less understandable. Lastly, I also think it's self-explanatory that, to get the size of a directory on disk, symlinks that point to files/directories that life somewhere else, should not be considered. That being said, we could also think about exposing the follow_symlinks option to the user, though that would also mean updating the implementation of listdir_withattributes of the Transport, so I wouldn't do that now, this PR has been dragging on for way too long anyway :D

@GeigerJ2
Copy link
Contributor Author

Thanks again for the review, @khsrali. I wrote down my reasoning for point 1 in my response to your comment in the code, and implemented point 2. Once CI passes here (hopefully), I'll squash-merge.

@GeigerJ2 GeigerJ2 merged commit 02cbe0c into aiidateam:main Dec 19, 2024
9 checks passed
@GeigerJ2 GeigerJ2 deleted the feature/remote-data-total-size branch December 19, 2024 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants