Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better hardlink handling #5

Open
AI0867 opened this issue Mar 29, 2024 · 2 comments
Open

Better hardlink handling #5

AI0867 opened this issue Mar 29, 2024 · 2 comments

Comments

@AI0867
Copy link

AI0867 commented Mar 29, 2024

I have a project where there is lots of shared state between various versions, which is handled using hardlinks.

Spaceman and du give different results.

Example on a directory of about 1 GB with 14 subdirectories sharing a lot of data:
du -hs . ; du -hs *; du -hs 2

963M    .
694M    1
20M     2
21M     3
21M     4
21M     5
21M     6
21M     7
22M     8
21M     9
22M     10
21M     11
21M     12
21M     13
21M     14
694M    2

As you can see, du only assigns the size of each inode to the first directory it encounters it in, avoiding double-counting as long as it happens within a single run.

However, spaceman says that every folder is about 690 MB, and gives a total of 9.2 GB. This is very misleading when it comes to finding stuff to delete.

I'm not sure what a good approach would be, but this makes spaceman quite a bit less useful for me.

@salihgerdan
Copy link
Owner

salihgerdan commented Mar 29, 2024

I see, thank you for reporting. Initially it didn't occur to me something could be done about hardlinks as they're quite smooth at hiding their linked status.

Simplest (perhaps naive?) way to do this would be to keep a HashSet of all the inodes, and assigning 0 bytes to the size of any duplicate inode occurrences. Another possibility might be to use inodes as indices for the tree structure, this would be an involved change. Also need to come up with an equivalent implementation for Windows (or we could ignore as hard links are not that common on Windows land).

I intend to also support "size on disk" detection, I am myself using a compressed filesystem and the advertised sizes don't quite match reality. Hard-link elimination that you propose should also be a good addition.

I cannot promise a timeline of completion, currently being busy with university. This is still my favorite project however, thank you for reminding me to work on it. :^)

@AI0867
Copy link
Author

AI0867 commented Mar 29, 2024

The solution that simply tracks encountered inodes sounds like it would result in the same output as du. It's not perfect, but it's a big improvement over the current situation.

It still doesn't give an ideal overview of what to delete and whether deleting something will help (deleting directory 1 in my example will free ~20 MB, not ~700 MB), but fixing that is a whole UI design thing. I don't have a good solution ready for you there.

Re Windows: I'm not surprised that NTFS supports hardlinks. I'm somewhat surprised the windows API actually supports creating them. A cursory look indicates that if GetFileInformationByHandle()'s nNumberOfLinks > 1, then you can use nFileIndexHigh, nFileIndexLow as unique (per dwVolumeSerialNumber) identifier akin to an inode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants