-
-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tracking Issue] Deduplicate blob files #6265
Comments
Maybe instead of |
One of the reasons why #5495 exists is that it preserves original file names so that they are displayed as expected in external programs, this allows to avoid file copying. Though this requires the Delta Chat blobdir is traversable by that program which isn't true for all supported platforms. |
i closed #5495 for now, we can cherry-pick or re-open as needed, but it does not make much sense to get that in beforehand and without considering this issue first. i also have the gut feeling that it is better to leave things in a flat structure. #5495 would also only prevent copying if at the same time, things are set read-only, which is not that easy as it sounds iirc. also, the copying is not that much of an issue as it affects exporting files only - not showing images or playing audio/videos inside delta chat. exporting is not done that often and only on direct user action, taking anyways a moment |
We would like to eventually deduplicate blob files.
This supersedes #5495 and #4309. We may be able to revert #5778 afterwards.
Motivation
Especially with Webxdc, there are a lot of duplicate files in the blobs directory, because when you get the same file sent twice then it will be saved twice.
Also, it would be nice to use random filenames because it may happen that the SQL database references a file that doesn't exist anymore, and if the user sends or receives a file with this filename then this new file will accidentally be shown in the place of the removed file.
Prerequisites
dc_msg_get_filename()
(C-FFI) orMessageObject.file_name
(JsonRPC) needs to be usedParam::Filename
is set to the actual original filenameset_file()
, andset_file()
doesn't have anoriginal_name
parameterset_file_and_deduplicate(&mut self, path: &str, original_name: &str, mime: Option<&str>)
that is similar toset_file()
but you can specify the original file nameset_file()
is doing). It should be made to only work on files that already are in the blobs directory, in order to avoid accidentally moving a file that was still needed. Also, it should be allowed to immediately move the file (as opposed toset_file()
, which will only rename the file when sending.Current plan
TL;DR: Save all files as
<hash>
, without any extension.When inserting a file into the blobdir:
blake3
andiroh-blake3
dependencies anyway and iroh devs really like it. It is supposed to be much faster than other cryptographic hashes: https://peergos.org/posts/blake3<hash>
already exists; if yes: use the existing file (and to be safe, check that the content is still correct and overwrite it otherwise). Only if it doesn't exist yet, create it.Existing files will be kept as they are. Also, the existing
set_file()
function still won't deduplicate, only the newset_file_and_deduplicate()
and when receiving messages.Alternatives
guess_msgtype_from_suffix()
uses the actual filename on the disk to guess the mime type; this means that we need to be careful if we deduplicate files that have different extensions.Open questions
set_file_and_deduplicate()
rename the file immediately before returning, asynchronously in the background, or when sending out the message?The text was updated successfully, but these errors were encountered: