ORM: Add `get_size_on_disk` method to `RemoteData` #6584

GeigerJ2 · 2024-10-15T16:00:05Z

Required for PR #6578.

By default, the get_size_on_disk method calls the method _get_size_on_disk_du that uses du to obtain the total directory size in bytes. If the call to du fails for whatever reason, recursive stat is being used, though, that is discouraged as stat returns the apparent size of files, not the actual disk usage.

I further extended the existing tests for RemoteData to use both, LocalTransport, as well as SshTransport, and test for the functionality added in this PR.

Pinging also @npaulish, as she's currently working on retrieving RemoteData objects.

codecov · 2024-10-15T16:32:14Z

Codecov Report

Attention: Patch coverage is 82.60870% with 16 lines in your changes missing coverage. Please review.

Project coverage is 77.94%. Comparing base (c532b34) to head (9c3d2ba).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/aiida/orm/nodes/data/remote/base.py	88.24%	8 Missing ⚠️
src/aiida/cmdline/commands/cmd_data/cmd_remote.py	50.00%	7 Missing ⚠️
src/aiida/common/utils.py	90.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6584      +/-   ##
==========================================
+ Coverage   77.92%   77.94%   +0.02%     
==========================================
  Files         563      563              
  Lines       41671    41761      +90     
==========================================
+ Hits        32467    32545      +78     
- Misses       9204     9216      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

giovannipizzi · 2024-10-17T12:14:19Z

This is maybe machine-dependent, but rather than going via our API (that is more robust, but definitely going to be slower, I think) have a first "fast" option just running du -s and parsing the output (but careful about units! E.g. it uses "blocks", on some systems it's 512 or 2048bytes!! And if it fails, fall back to your solution?

GeigerJ2 · 2024-10-17T12:31:05Z

Note to self to run du via exec_command_wait method from transport.

khsrali

Thanks @GeigerJ2 please go through more pains that I imposed 😈

src/aiida/common/utils.py

src/aiida/orm/nodes/data/remote/base.py

tests/orm/nodes/data/test_remote.py

GeigerJ2

Not sure why my replies via the files GH tab appear as a review, but whatever... ^^

Thanks for the review @khsrali. I implemented most of your proposed changes!

As to our in-person discussion if du or lstat is preferred, I think none of the two is ideal... lstat giving the actual byte-sized content, which is neat and all, but won't correspond to the disk space that will actually be occupied locally by a file, due to the use of blocks for the file system. And du giving the actual occupied disk space on the remote, which, however, might be different from the local file system (due to having a different file system on the local machine, different formatting, different block size, etc.). Hence, the big difference in the file size check in the test. For real-world use cases, with more and larger files, the difference is likely much smaller, and won't matter too much, I think. I'll add a test for a larger file, as well as modify the message that the given size is just an estimate, so that user are aware they should take the value with a grain of salt.

Maybe also @agoscinski with his actual computer science background can weigh in 🫶

src/aiida/orm/nodes/data/remote/base.py

khsrali

Thanks a lot @GeigerJ2! Just a few minor comments..

src/aiida/common/utils.py

src/aiida/orm/nodes/data/remote/base.py

tests/orm/nodes/data/test_remote.py

src/aiida/orm/nodes/data/remote/base.py

agoscinski

Yes seems like we should split the function into two or provide optional arguments. These two concepts seem to be distinguished by the terms "disk usage" and "apparent size" (see https://stackoverflow.com/a/569485). I think the output of du --apparent-size is the same as with lstat, so you do not need to use lstat

src/aiida/orm/nodes/data/remote/base.py

src/aiida/cmdline/commands/cmd_data/cmd_remote.py

src/aiida/orm/nodes/data/remote/base.py

GeigerJ2 · 2024-12-10T15:44:03Z

Thanks for the review, @agoscinski! I'm currently still working on this, will ping you once it's again ready for review.

I think the output of du --apparent-size is the same as with lstat, so you do not need to use lstat

The reason I'm providing lstat as a fallback option is if du is not available (e.g., MacOS, as you mentioned, didn't know that ^^), or if exec_command_wait isn't available, which will be the case for FirecREST.

GeigerJ2 · 2024-12-11T13:08:28Z

OK, this should be ready for a final review, @agoscinski and @khsrali. Also pinging, @mikibonacci, if you want to provide some feedback on the CLI/API for use in AiiDAlab?

khsrali

Thanks @GeigerJ2 ,
Please consider my review, I haven't checked the tests thoroughly. I leave that to @agoscinski

src/aiida/cmdline/commands/cmd_data/cmd_remote.py

src/aiida/orm/nodes/data/remote/base.py

tests/orm/nodes/data/test_remote.py

GeigerJ2 · 2024-12-12T11:45:37Z

Thanks again for the review, @khsrali, I implemented your proposed changes.

In addition, extend the tests for `RemoteData` in `test_remote.py` for the methods added in this PR, as well as parametrize them to run on a `RemoteData` via local and ssh transport.

for more information, see https://pre-commit.ci

… nested directory

for more information, see https://pre-commit.ci

This allows just defining one fixture and passing a mode parameter, to create different `RemoteData` instances, one with SSH and one without. In addition, it allows to pass different content directly to the fixture factory to create differently parametrized instances. Without a factory, this is not possible, as the instantiated `RemoteData` that is returned cannot be parametrized (by something like `(content=b'a')`), as it would be already the instantiated `RemoteData` object. In addition, this change also removes the need for the `request.getfixturevalue(fixture)` code, but allows just passing the fixture factory and the mode as a parameter.

for more information, see https://pre-commit.ci

khsrali

Thanks @GeigerJ2 , all good. Just two comments:

I find it confusing that you call it stat while it's actually doing a lstat.
Maybe consider to raise ValueError instead of Notimplemented.

tests/orm/nodes/data/test_remote.py

khsrali · 2024-12-18T12:58:02Z

src/aiida/orm/nodes/data/remote/base.py

+                raise NotImplementedError(
+                    f'Specified method `{method}` for evaluating the size on disk not implemented.'
+                )


I think it makes more sense for this to be a ValueError, otherwise might be confusing when there's a typo, e.g. 'state' -> NotImplemented (?)

if you have any specific method in mind (like what actually? 🤔 ) that is legit but not implemented, you could single them out and raise NotImplemented.
But in generic cases I'd raise ValueError which is more clear.

Suggested change

raise NotImplementedError(

f'Specified method `{method}` for evaluating the size on disk not implemented.'

)

raise ValueError(

f'Specified method `{method}` is not an valid input. Please choose either 'du' or 'stat'.'

)

Ah, I see your point. Yeah, the idea was indeed to hint at the possibility of more options to evaluate the disk usage to be implemented in the future. Though, now that I think of it, I wouldn't know what that would be, honestly ^^ so I agree it's better to change it to a ValueError to capture typos, so I modified it as you suggested. Thanks!

khsrali · 2024-12-18T12:59:02Z

src/aiida/orm/nodes/data/remote/base.py

+
+        :param relpath: File or directory path for which the total size should be returned, relative to
+            ``self.get_remote_path()``.
+        :param method: Method to be used to evaluate the directory/file size (either ``du`` or ``stat``).


Why did you change to stat, in the end?
What you are doing here is actually lstat.
Links are not followed.

I changed it because I think people are more familiar with stat, as it is also a Linux command line utility, whereas lstat is not. Also, lstat is just the same as os.stat(follow_symlinks=False), and I don't think the implementation detail should make the user-facing option less understandable. Lastly, I also think it's self-explanatory that, to get the size of a directory on disk, symlinks that point to files/directories that life somewhere else, should not be considered. That being said, we could also think about exposing the follow_symlinks option to the user, though that would also mean updating the implementation of listdir_withattributes of the Transport, so I wouldn't do that now, this PR has been dragging on for way too long anyway :D

GeigerJ2 · 2024-12-19T08:08:29Z

Thanks again for the review, @khsrali. I wrote down my reasoning for point 1 in my response to your comment in the code, and implemented point 2. Once CI passes here (hopefully), I'll squash-merge.

tests/orm/nodes/data/test_remote.py

GeigerJ2 marked this pull request as draft October 15, 2024 16:02

GeigerJ2 mentioned this pull request Oct 15, 2024

Add --also-remote option to verdi process dump #6578

Open

GeigerJ2 changed the title ~~Add get_total_size_on_disk method to RemoteData.~~ Add get_total_size_on_disk method to RemoteData Oct 17, 2024

mikibonacci mentioned this pull request Oct 17, 2024

On clean remote data button and checks aiidalab/aiidalab-qe#857

Open

3 tasks

GeigerJ2 requested a review from khsrali November 21, 2024 14:53

GeigerJ2 force-pushed the feature/remote-data-total-size branch from e801a78 to e0be575 Compare November 21, 2024 14:54

GeigerJ2 marked this pull request as ready for review November 21, 2024 14:54

GeigerJ2 changed the title ~~Add get_total_size_on_disk method to RemoteData~~ Add get_size_on_disk method to RemoteData Nov 21, 2024

GeigerJ2 force-pushed the feature/remote-data-total-size branch from e0be575 to b1a0560 Compare November 21, 2024 15:14

khsrali requested changes Nov 21, 2024

View reviewed changes

GeigerJ2 commented Nov 25, 2024

View reviewed changes

GeigerJ2 force-pushed the feature/remote-data-total-size branch from f2f90f9 to 4fcb1ba Compare November 25, 2024 15:41

khsrali requested changes Nov 25, 2024

View reviewed changes

GeigerJ2 force-pushed the feature/remote-data-total-size branch from f7dfe7e to ecfa53b Compare December 9, 2024 09:36

agoscinski reviewed Dec 10, 2024

View reviewed changes

GeigerJ2 force-pushed the feature/remote-data-total-size branch from 58b3248 to 8bcf0f8 Compare December 11, 2024 10:59

GeigerJ2 changed the title ~~Add get_size_on_disk method to RemoteData~~ ORM: Add get_size_on_disk method to RemoteData Dec 11, 2024

khsrali requested changes Dec 12, 2024

View reviewed changes

GeigerJ2 force-pushed the feature/remote-data-total-size branch from f0bb1f6 to 192b0fa Compare December 12, 2024 11:44

GeigerJ2 mentioned this pull request Dec 12, 2024

Adding get_size_on_remote to the Transport interface #6665

Open

GeigerJ2 and others added 4 commits December 17, 2024 14:42

Add get_total_size_on_disk method to RemoteData.

7a9466c

Implement function with du and lstat fallback

577342c

In addition, extend the tests for `RemoteData` in `test_remote.py` for the methods added in this PR, as well as parametrize them to run on a `RemoteData` via local and ssh transport.

Implement function with du and lstat fallback

010d850

In addition, extend the tests for `RemoteData` in `test_remote.py` for the methods added in this PR, as well as parametrize them to run on a `RemoteData` via local and ssh transport.

Apply changes from code review by @khsrali

2091fab

pre-commit-ci bot and others added 16 commits December 17, 2024 14:42

[pre-commit.ci] auto fixes from pre-commit.com hooks

a70da2c

for more information, see https://pre-commit.ci

Add tests for different file sizes.

960fc6e

Backup to check out other branch.

cbeb8ec

Final code cleanup

b68dd3e

[pre-commit.ci] auto fixes from pre-commit.com hooks

48ea748

for more information, see https://pre-commit.ci

Address final minor review comments

c3fb4e6

Expand CLI, method argument, fix recursive bug, expand tests also for…

342fe03

… nested directory

[pre-commit.ci] auto fixes from pre-commit.com hooks

6accea9

for more information, see https://pre-commit.ci

Final polishing.

44b1889

Fix docstrings for RTD build.

5d69999

Fix docstrings for RTD build... second try.

2454488

Apply changes from code review.

37d6cdb

Minor changes

589dbf7

[pre-commit.ci] auto fixes from pre-commit.com hooks

a542b1b

for more information, see https://pre-commit.ci

Remove num_char in favor of content.

71f5268

[pre-commit.ci] auto fixes from pre-commit.com hooks

870263a

for more information, see https://pre-commit.ci

GeigerJ2 force-pushed the feature/remote-data-total-size branch from 09c30d0 to 870263a Compare December 17, 2024 13:42

GeigerJ2 and others added 2 commits December 17, 2024 16:00

[pre-commit.ci] auto fixes from pre-commit.com hooks

a41f501

for more information, see https://pre-commit.ci

khsrali approved these changes Dec 18, 2024

View reviewed changes

Apply changes from review of @khsrali.

659811c

khsrali reviewed Dec 19, 2024

View reviewed changes

tests/orm/nodes/data/test_remote.py Outdated Show resolved Hide resolved

Update test to capture correct exception.

9c3d2ba

GeigerJ2 merged commit 02cbe0c into aiidateam:main Dec 19, 2024
9 checks passed

GeigerJ2 deleted the feature/remote-data-total-size branch December 19, 2024 09:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORM: Add `get_size_on_disk` method to `RemoteData` #6584

ORM: Add `get_size_on_disk` method to `RemoteData` #6584

GeigerJ2 commented Oct 15, 2024 •

edited

Loading

codecov bot commented Oct 15, 2024 •

edited

Loading

giovannipizzi commented Oct 17, 2024

GeigerJ2 commented Oct 17, 2024 •

edited

Loading

khsrali left a comment

GeigerJ2 left a comment

khsrali left a comment

agoscinski left a comment

GeigerJ2 commented Dec 10, 2024

GeigerJ2 commented Dec 11, 2024

khsrali left a comment

GeigerJ2 commented Dec 12, 2024

khsrali left a comment

khsrali Dec 18, 2024

GeigerJ2 Dec 19, 2024

khsrali Dec 18, 2024

GeigerJ2 Dec 19, 2024

GeigerJ2 commented Dec 19, 2024

ORM: Add get_size_on_disk method to RemoteData #6584

ORM: Add get_size_on_disk method to RemoteData #6584

Conversation

GeigerJ2 commented Oct 15, 2024 • edited Loading

codecov bot commented Oct 15, 2024 • edited Loading

Codecov Report

giovannipizzi commented Oct 17, 2024

GeigerJ2 commented Oct 17, 2024 • edited Loading

khsrali left a comment

Choose a reason for hiding this comment

GeigerJ2 left a comment

Choose a reason for hiding this comment

khsrali left a comment

Choose a reason for hiding this comment

agoscinski left a comment

Choose a reason for hiding this comment

GeigerJ2 commented Dec 10, 2024

GeigerJ2 commented Dec 11, 2024

khsrali left a comment

Choose a reason for hiding this comment

GeigerJ2 commented Dec 12, 2024

khsrali left a comment

Choose a reason for hiding this comment

khsrali Dec 18, 2024

Choose a reason for hiding this comment

GeigerJ2 Dec 19, 2024

Choose a reason for hiding this comment

khsrali Dec 18, 2024

Choose a reason for hiding this comment

GeigerJ2 Dec 19, 2024

Choose a reason for hiding this comment

GeigerJ2 commented Dec 19, 2024

ORM: Add `get_size_on_disk` method to `RemoteData` #6584

ORM: Add `get_size_on_disk` method to `RemoteData` #6584

GeigerJ2 commented Oct 15, 2024 •

edited

Loading

codecov bot commented Oct 15, 2024 •

edited

Loading

GeigerJ2 commented Oct 17, 2024 •

edited

Loading