Skip to content

Added ED2K hash and BitTorrent pieces hash (#197) #375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 23, 2025

Conversation

zhangzih4n
Copy link
Contributor

@zhangzih4n zhangzih4n commented Apr 14, 2025

Added ED2K hash (#197)

It is mainly used in the eDonkey network. Now it is also used as a file identification code in some netdisk software (such as 115 Netdisk).

It is obtained by calculating the MD4 hash of each chunk after dividing, and then concatenating them, and then calculating the MD4 again.

It always generates the same hash result for the same bytes. It is a fixed-length MD4 hash that can be used to uniquely identify the file.

Added BitTorrent pieces hash.

Because arbitrary key-value pairs can be added to Torrent, it can generate countless InfoHash for the same file. We need to download the InfoHash's metadata to know whether it corresponds to a certain file, and when the InfoHash's seeder is missing in BT network, we will not be able to access the metadata. So it is not actually suitable for constructing CID.

BitTorrent pieces and pieces root are only related to the file content, and we can easily extract it from the Torrent file to construct the CID.

BitTorrent V1

BTv1 pieces is obtained by calculating the SHA1 hash of each block after padding (optional) and dividing, and then concatenating them.

When the piece length is determined and there is no padding, it always generates the same hash result for the same bytes and can uniquely identify the file.

When there is padding, it will have hash collision because the padding occurs before chunking. We need to know the file size in addition to uniquely identify the file.

BitTorrent V2

BTv2 pieces is obtained by calculating the SHA2-256 hash of each block after dividding, and then merging them as a Merkle tree.

When the piece length is determined, it always generates the same hash result for the same bytes and can uniquely identify the file.

BTv2 pieces root is the final result after merging, which is a fixed-length SHA2-256 hash that can be used to uniquely identify the file.


In the eDonkey and BitTorrent networks, these algorithms are used to uniquely identify files, and MD4 and SHA256 are both Multihash. So these algorithms are also well-established cryptographic hash functions and should also be Multihash.

Copy link
Contributor

@aschmahmann aschmahmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting. Left some questions. Mostly about how things are being used since we have plenty of codes in the table that were never used but added because they felt like a fit. Very curious to know if about anything interesting going on with bittorrent-v2 hashes.

Additionally, in places where there were many repetitions but with different sizes I made consolidation recommendations similar to the ones here #342 (comment).

table.csv Outdated
@@ -159,6 +159,56 @@ ripemd-128, multihash, 0x1052, draft,
ripemd-160, multihash, 0x1053, draft,
ripemd-256, multihash, 0x1054, draft,
ripemd-320, multihash, 0x1055, draft,
ed2k, multihash, 0x107a, draft, eDonkey2000 hash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable if anyone is planning to use it. Are you planning to use this for anything or is it just historical? We have a history of some codes being added where nobody ever really used them or fleshed out if they were sufficient so it'd be good to know.

Given that md4 is broken I'm not sure it makes a ton of sense in the modern era, but I get that some systems have a difficult time upgrading.

Copy link
Contributor Author

@zhangzih4n zhangzih4n Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable if anyone is planning to use it. Are you planning to use this for anything or is it just historical? We have a history of some codes being added where nobody ever really used them or fleshed out if they were sufficient so it'd be good to know.

I'm writing a local file retrieval application. It calculates the digests of all files on the local hard drive (e.g. md5, sha1, sha256, ed2k, pieces-root).

So when I get a digest (e.g. hash, Torrent, Ed2k), I can quickly know if this file already exists locally and find it to avoid duplicate downloads.

The current method I use is to store Multihash. That is, for each file, a Multihash is generated for each fnCode, and then it is concatenated and stored as a binary file.

So I would like the fnCode table of Multihash to contain as many hash methods as possible (even if it is unsafe/outdated) so that files can be better located.

Adding a hashing method to Multihash does not mean that IPFS must support it. So security is not that important. Unsafe/outdated methods can still be used for integrity verification in a secure environment, such as checking for file corruption caused by bad sectors/blocks on a hard drive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a hashing method to Multihash does not mean that IPFS must support it.

Of course not.

So security is not that important.

Yeah, the guidelines on multihash https://github.com/multiformats/multihash are well-established cryptographic hash functions so generally unsafe functions get some push back on the tag name. md4 is already on the table so no objection from me

table.csv Outdated
@@ -159,6 +159,56 @@ ripemd-128, multihash, 0x1052, draft,
ripemd-160, multihash, 0x1053, draft,
ripemd-256, multihash, 0x1054, draft,
ripemd-320, multihash, 0x1055, draft,
ed2k, multihash, 0x107a, draft, eDonkey2000 hash.
bittorrent-pieces-root, multihash, 0x107b, draft, BitTorrent v2 pieces root hash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a bittorrent-v2 file hash seems very reasonable to me. However, we already have two codes that afaict were used by in a production environment (the bittorrent and bencode codes) and barely (I'm one of the examples) in an experimental capacity, so it'd be good to know if this is actually going into a system where it'll be fleshed out whether this is good enough or not.

Side note: there is already a standard to do btmh://<sha256 of bittorent-v2 infodict> is the idea to also enable something like btmh://<sha256 of bittorrent-v2 file>? If so you probably need a way to denote padding / file length. For an example of that see: https://github.com/filecoin-project/FIPs/blob/81027798b7e50d482c15f5665df1c952aab348ca/FRCs/frc-0069.md?#fr32-sha2-256-trunc254-padded-binary-tree-multihash which is related to #331.

table.csv Outdated
@@ -159,6 +159,56 @@ ripemd-128, multihash, 0x1052, draft,
ripemd-160, multihash, 0x1053, draft,
ripemd-256, multihash, 0x1054, draft,
ripemd-320, multihash, 0x1055, draft,
ed2k, multihash, 0x107a, draft, eDonkey2000 hash.
bittorrent-pieces-root, multihash, 0x107b, draft, BitTorrent v2 pieces root hash.
bittorrent-pieces-16k, multihash, 0x1080, draft, BitTorrent pieces with 16KiB piece length.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of entries here seems excessive. Instead of taking up 16 slots for some of the sizes you could just grab a single slot and make the first byte in the "digest" the tree depth. This was approach was also taken in #331

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of entries here seems excessive. Instead of taking up 16 slots for some of the sizes you could just grab a single slot and make the first byte in the "digest" the tree depth. This was approach was also taken in #331

But this approach will result in the same file, with the same fnCode, but with multiple different digests.

This seems to violate the definition of hashing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to violate the definition of hashing.

You're still hashing, it perhaps stretches the definition of multihash by making the effective function code one byte longer by bleeding into the "digest" but the alternative is multiple garbage entries likely no one will use like we have with blake2b, skein, etc.

However, you still retain a number of the major properties:

  • The multihash still tells you how it was derived
  • It therefore also still tells you how to validate a given pile of bytes matches it
  • It's still reasonably compact

The last time this came up we tried to put together some guidelines for avoiding this kind of table explosion #342 (comment).

However, if it's quite burdensome or introduces some other problem it'd be interesting to understand why in the event we'd like to revise the guidelines here (cc @rvagg).

table.csv Outdated
bittorrent-pieces-128m, multihash, 0x108d, draft, BitTorrent pieces with 128MiB piece length.
bittorrent-pieces-256m, multihash, 0x108e, draft, BitTorrent pieces with 256MiB piece length.
bittorrent-pieces-512m, multihash, 0x108f, draft, BitTorrent pieces with 512MiB piece length.
bittorrent-pieces-16k-padded, hash, 0x1090, draft, BitTorrent pieces with 16KiB piece length and padding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent from the issue where instead of multiple slots where you could add an extra byte to the "digest", what is this for?

If you're hoping to be able to disambiguate between a hash of 16k bytes and 15k bytes + 1k of zeros this doesn't look like enough since you can't distinguish between whether the padding is 1k of zeros or 2k of zeros. If that's what you're looking for then I'd consider looking at something like #331 as mentioned above. This would likely also let you combine this entry with the unpadded version.

Copy link
Contributor Author

@zhangzih4n zhangzih4n Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the padding method used in BTv1. For example, for a 500KB file, bittorrent-pieces-256k will directly divide it into blocks and then calculate the hash. However, bittorrent-pieces-256k-padded will first fill it with empty bytes to an integer multiple of the piece length (i.e. 512KB), and then divide it into blocks and calculate the hash.

The hashes of the two differ only in the last 20 bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A hash function generally goes hash(bytes) -> output which can then be verified by checking hash(bytes) == expectedOutput.

It looks to me like with padding this way you've opened up pre-image attacks in that say you have some data D that is 500KB and || means concatenation: bittorrent-pieces-16k-padded(D || 0) == bittorrent-pieces-16k-padded(D || 0 || 0) == bittorrent-pieces-16k-padded(D || 12KB of 0s). It can also be made equal to D with the last byte chopped off if the last byte happens to be a 0.

Is allowing this behavior intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is allowing this behavior intentional?

In BitTorrent, we also need to provide the size of the file.

After getting the file from the BT network, it will be trimmed according to the file size, and the extra empty bytes at the end will be discarded.

In the case of unknown file size, it does collide.

Copy link
Contributor

@aschmahmann aschmahmann Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you also need the size anyhow then wouldn't it be safer to embed the size (or size of the padding) in the digest (most other hash functions embed the size in the digest in some way anyhow)? Seems like in your case if collisions don't matter here then noting that the data was padded vs unpadded doesn't really matter either, right?

That being said idk if I have a strong objection to this if it's only one instead of many slots. Interested what others (e.g. @rvagg and @vmx think)

Copy link
Contributor Author

@zhangzih4n zhangzih4n Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you also need the size anyhow then wouldn't it be safer to embed the size (or size of the padding) in the digest (most other hash functions embed the size in the digest in some way anyhow)? Seems like in your case if collisions don't matter here then noting that the data was padded vs unpadded doesn't really matter either, right?

Yes.

In my app, I already store the file size as an extra so I don't need to store it in the summary. But for other uses, the lack of file size can be annoying.

Maybe it would be a better choice for my app to use the private area.

table.csv Outdated
bittorrent-pieces-128m-padded, hash, 0x109d, draft, BitTorrent pieces with 128MiB piece length and padding.
bittorrent-pieces-256m-padded, hash, 0x109e, draft, BitTorrent pieces with 256MiB piece length and padding.
bittorrent-pieces-512m-padded, hash, 0x109f, draft, BitTorrent pieces with 512MiB piece length and padding.
bittorrent-pieces-16k-v2, multihash, 0x10a0, draft, BitTorrent v2 pieces with 16KiB piece length.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent from the issue where instead of multiple slots where you could add an extra byte to the "digest", what is this for?

IIUC the for a 256k file the 16k, 32k, ... 256k multihashes would all have the same digest since the bittorrent-v2 hash is defined as a binary tree except for the leaves which are 16k. Implementations for efficiency may send larger packages (i.e. send 32k instead of 16k chunks). Is the idea here to be able to represent the length field in the infodict for a similar kind of signaling (since the hashes will all be the same)?

@zhangzih4n
Copy link
Contributor Author

The controversial items have been removed, leaving only ED2K and PiecesRoot.

Copy link
Contributor

@aschmahmann aschmahmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem reasonable to me (the others I think are potentially viable too, pushback is generally a function of number of records requested).

@rvagg @vmx any thoughts / objections before merge (idk if I even have permissions anymore to do it myself)

@aschmahmann
Copy link
Contributor

@zhangzih4n apologies for the delay, checking to make sure you're still waiting on the merge here now that I have permissions. @vmx @rvagg unless I hear any objections I'll merge tomorrow.

@vmx
Copy link
Member

vmx commented Jun 4, 2025

No objections. I didn't dig in deep enough, but it's just two codes, in a high enough range and reviewed by someone who knows those things, so all good.

@aschmahmann aschmahmann merged commit d60499c into multiformats:master Jun 23, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants