-
Notifications
You must be signed in to change notification settings - Fork 214
Added ED2K hash and BitTorrent pieces hash (#197) #375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting. Left some questions. Mostly about how things are being used since we have plenty of codes in the table that were never used but added because they felt like a fit. Very curious to know if about anything interesting going on with bittorrent-v2 hashes.
Additionally, in places where there were many repetitions but with different sizes I made consolidation recommendations similar to the ones here #342 (comment).
table.csv
Outdated
@@ -159,6 +159,56 @@ ripemd-128, multihash, 0x1052, draft, | |||
ripemd-160, multihash, 0x1053, draft, | |||
ripemd-256, multihash, 0x1054, draft, | |||
ripemd-320, multihash, 0x1055, draft, | |||
ed2k, multihash, 0x107a, draft, eDonkey2000 hash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable if anyone is planning to use it. Are you planning to use this for anything or is it just historical? We have a history of some codes being added where nobody ever really used them or fleshed out if they were sufficient so it'd be good to know.
Given that md4 is broken I'm not sure it makes a ton of sense in the modern era, but I get that some systems have a difficult time upgrading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable if anyone is planning to use it. Are you planning to use this for anything or is it just historical? We have a history of some codes being added where nobody ever really used them or fleshed out if they were sufficient so it'd be good to know.
I'm writing a local file retrieval application. It calculates the digests of all files on the local hard drive (e.g. md5, sha1, sha256, ed2k, pieces-root).
So when I get a digest (e.g. hash, Torrent, Ed2k), I can quickly know if this file already exists locally and find it to avoid duplicate downloads.
The current method I use is to store Multihash. That is, for each file, a Multihash is generated for each fnCode, and then it is concatenated and stored as a binary file.
So I would like the fnCode table of Multihash to contain as many hash methods as possible (even if it is unsafe/outdated) so that files can be better located.
Adding a hashing method to Multihash does not mean that IPFS must support it. So security is not that important. Unsafe/outdated methods can still be used for integrity verification in a secure environment, such as checking for file corruption caused by bad sectors/blocks on a hard drive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a hashing method to Multihash does not mean that IPFS must support it.
Of course not.
So security is not that important.
Yeah, the guidelines on multihash
https://github.com/multiformats/multihash are well-established cryptographic hash functions
so generally unsafe functions get some push back on the tag name. md4 is already on the table so no objection from me
table.csv
Outdated
@@ -159,6 +159,56 @@ ripemd-128, multihash, 0x1052, draft, | |||
ripemd-160, multihash, 0x1053, draft, | |||
ripemd-256, multihash, 0x1054, draft, | |||
ripemd-320, multihash, 0x1055, draft, | |||
ed2k, multihash, 0x107a, draft, eDonkey2000 hash. | |||
bittorrent-pieces-root, multihash, 0x107b, draft, BitTorrent v2 pieces root hash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a bittorrent-v2 file hash seems very reasonable to me. However, we already have two codes that afaict were used by in a production environment (the bittorrent and bencode codes) and barely (I'm one of the examples) in an experimental capacity, so it'd be good to know if this is actually going into a system where it'll be fleshed out whether this is good enough or not.
Side note: there is already a standard to do btmh://<sha256 of bittorent-v2 infodict>
is the idea to also enable something like btmh://<sha256 of bittorrent-v2 file>
? If so you probably need a way to denote padding / file length. For an example of that see: https://github.com/filecoin-project/FIPs/blob/81027798b7e50d482c15f5665df1c952aab348ca/FRCs/frc-0069.md?#fr32-sha2-256-trunc254-padded-binary-tree-multihash which is related to #331.
table.csv
Outdated
@@ -159,6 +159,56 @@ ripemd-128, multihash, 0x1052, draft, | |||
ripemd-160, multihash, 0x1053, draft, | |||
ripemd-256, multihash, 0x1054, draft, | |||
ripemd-320, multihash, 0x1055, draft, | |||
ed2k, multihash, 0x107a, draft, eDonkey2000 hash. | |||
bittorrent-pieces-root, multihash, 0x107b, draft, BitTorrent v2 pieces root hash. | |||
bittorrent-pieces-16k, multihash, 0x1080, draft, BitTorrent pieces with 16KiB piece length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of entries here seems excessive. Instead of taking up 16 slots for some of the sizes you could just grab a single slot and make the first byte in the "digest" the tree depth. This was approach was also taken in #331
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of entries here seems excessive. Instead of taking up 16 slots for some of the sizes you could just grab a single slot and make the first byte in the "digest" the tree depth. This was approach was also taken in #331
But this approach will result in the same file, with the same fnCode, but with multiple different digests.
This seems to violate the definition of hashing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to violate the definition of hashing.
You're still hashing, it perhaps stretches the definition of multihash by making the effective function code one byte longer by bleeding into the "digest" but the alternative is multiple garbage entries likely no one will use like we have with blake2b, skein, etc.
However, you still retain a number of the major properties:
- The multihash still tells you how it was derived
- It therefore also still tells you how to validate a given pile of bytes matches it
- It's still reasonably compact
The last time this came up we tried to put together some guidelines for avoiding this kind of table explosion #342 (comment).
However, if it's quite burdensome or introduces some other problem it'd be interesting to understand why in the event we'd like to revise the guidelines here (cc @rvagg).
table.csv
Outdated
bittorrent-pieces-128m, multihash, 0x108d, draft, BitTorrent pieces with 128MiB piece length. | ||
bittorrent-pieces-256m, multihash, 0x108e, draft, BitTorrent pieces with 256MiB piece length. | ||
bittorrent-pieces-512m, multihash, 0x108f, draft, BitTorrent pieces with 512MiB piece length. | ||
bittorrent-pieces-16k-padded, hash, 0x1090, draft, BitTorrent pieces with 16KiB piece length and padding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Independent from the issue where instead of multiple slots where you could add an extra byte to the "digest", what is this for?
If you're hoping to be able to disambiguate between a hash of 16k bytes and 15k bytes + 1k of zeros this doesn't look like enough since you can't distinguish between whether the padding is 1k of zeros or 2k of zeros. If that's what you're looking for then I'd consider looking at something like #331 as mentioned above. This would likely also let you combine this entry with the unpadded version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the padding method used in BTv1. For example, for a 500KB file, bittorrent-pieces-256k
will directly divide it into blocks and then calculate the hash. However, bittorrent-pieces-256k-padded
will first fill it with empty bytes to an integer multiple of the piece length
(i.e. 512KB), and then divide it into blocks and calculate the hash.
The hashes of the two differ only in the last 20 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A hash function generally goes hash(bytes) -> output
which can then be verified by checking hash(bytes) == expectedOutput
.
It looks to me like with padding this way you've opened up pre-image attacks in that say you have some data D that is 500KB and || means concatenation: bittorrent-pieces-16k-padded(D || 0) == bittorrent-pieces-16k-padded(D || 0 || 0) == bittorrent-pieces-16k-padded(D || 12KB of 0s)
. It can also be made equal to D with the last byte chopped off if the last byte happens to be a 0.
Is allowing this behavior intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is allowing this behavior intentional?
In BitTorrent, we also need to provide the size of the file.
After getting the file from the BT network, it will be trimmed according to the file size, and the extra empty bytes at the end will be discarded.
In the case of unknown file size, it does collide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you also need the size anyhow then wouldn't it be safer to embed the size (or size of the padding) in the digest (most other hash functions embed the size in the digest in some way anyhow)? Seems like in your case if collisions don't matter here then noting that the data was padded vs unpadded doesn't really matter either, right?
That being said idk if I have a strong objection to this if it's only one instead of many slots. Interested what others (e.g. @rvagg and @vmx think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you also need the size anyhow then wouldn't it be safer to embed the size (or size of the padding) in the digest (most other hash functions embed the size in the digest in some way anyhow)? Seems like in your case if collisions don't matter here then noting that the data was padded vs unpadded doesn't really matter either, right?
Yes.
In my app, I already store the file size as an extra so I don't need to store it in the summary. But for other uses, the lack of file size can be annoying.
Maybe it would be a better choice for my app to use the private area.
table.csv
Outdated
bittorrent-pieces-128m-padded, hash, 0x109d, draft, BitTorrent pieces with 128MiB piece length and padding. | ||
bittorrent-pieces-256m-padded, hash, 0x109e, draft, BitTorrent pieces with 256MiB piece length and padding. | ||
bittorrent-pieces-512m-padded, hash, 0x109f, draft, BitTorrent pieces with 512MiB piece length and padding. | ||
bittorrent-pieces-16k-v2, multihash, 0x10a0, draft, BitTorrent v2 pieces with 16KiB piece length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Independent from the issue where instead of multiple slots where you could add an extra byte to the "digest", what is this for?
IIUC the for a 256k file the 16k, 32k, ... 256k multihashes would all have the same digest since the bittorrent-v2 hash is defined as a binary tree except for the leaves which are 16k. Implementations for efficiency may send larger packages (i.e. send 32k instead of 16k chunks). Is the idea here to be able to represent the length field in the infodict for a similar kind of signaling (since the hashes will all be the same)?
The controversial items have been removed, leaving only ED2K and PiecesRoot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… it cannot uniquely identify a file when the file size is unknown.
@zhangzih4n apologies for the delay, checking to make sure you're still waiting on the merge here now that I have permissions. @vmx @rvagg unless I hear any objections I'll merge tomorrow. |
No objections. I didn't dig in deep enough, but it's just two codes, in a high enough range and reviewed by someone who knows those things, so all good. |
Added ED2K hash (#197)
It is mainly used in the eDonkey network. Now it is also used as a file identification code in some netdisk software (such as 115 Netdisk).
It is obtained by calculating the MD4 hash of each chunk after dividing, and then concatenating them, and then calculating the MD4 again.
It always generates the same hash result for the same bytes. It is a fixed-length MD4 hash that can be used to uniquely identify the file.
Added BitTorrent pieces hash.
Because arbitrary key-value pairs can be added to Torrent, it can generate countless InfoHash for the same file. We need to download the InfoHash's metadata to know whether it corresponds to a certain file, and when the InfoHash's seeder is missing in BT network, we will not be able to access the metadata. So it is not actually suitable for constructing CID.
BitTorrent pieces and pieces root are only related to the file content, and we can easily extract it from the Torrent file to construct the CID.
BitTorrent V1
BTv1 pieces is obtained by calculating the SHA1 hash of each block after padding (optional) and dividing, and then concatenating them.
When the piece length is determined and there is no padding, it always generates the same hash result for the same bytes and can uniquely identify the file.
When there is padding, it will have hash collision because the padding occurs before chunking. We need to know the file size in addition to uniquely identify the file.
BitTorrent V2
BTv2 pieces is obtained by calculating the SHA2-256 hash of each block after dividding, and then merging them as a Merkle tree.
When the piece length is determined, it always generates the same hash result for the same bytes and can uniquely identify the file.
BTv2 pieces root is the final result after merging, which is a fixed-length SHA2-256 hash that can be used to uniquely identify the file.
In the eDonkey and BitTorrent networks, these algorithms are used to uniquely identify files, and MD4 and SHA256 are both Multihash. So these algorithms are also
well-established cryptographic hash functions
and should also be Multihash.