Added ED2K hash and BitTorrent pieces hash (#197) #375

zhangzih4n · 2025-04-14T04:49:07Z

Added ED2K hash (#197)

It is mainly used in the eDonkey network. Now it is also used as a file identification code in some netdisk software (such as 115 Netdisk).

It is obtained by calculating the MD4 hash of each chunk after dividing, and then concatenating them, and then calculating the MD4 again.

It always generates the same hash result for the same bytes. It is a fixed-length MD4 hash that can be used to uniquely identify the file.

Added BitTorrent pieces hash.

Because arbitrary key-value pairs can be added to Torrent, it can generate countless InfoHash for the same file. We need to download the InfoHash's metadata to know whether it corresponds to a certain file, and when the InfoHash's seeder is missing in BT network, we will not be able to access the metadata. So it is not actually suitable for constructing CID.

BitTorrent pieces and pieces root are only related to the file content, and we can easily extract it from the Torrent file to construct the CID.

BitTorrent V1

BTv1 pieces is obtained by calculating the SHA1 hash of each block after padding (optional) and dividing, and then concatenating them.

When the piece length is determined and there is no padding, it always generates the same hash result for the same bytes and can uniquely identify the file.

When there is padding, it will have hash collision because the padding occurs before chunking. We need to know the file size in addition to uniquely identify the file.

BitTorrent V2

BTv2 pieces is obtained by calculating the SHA2-256 hash of each block after dividding, and then merging them as a Merkle tree.

When the piece length is determined, it always generates the same hash result for the same bytes and can uniquely identify the file.

BTv2 pieces root is the final result after merging, which is a fixed-length SHA2-256 hash that can be used to uniquely identify the file.

In the eDonkey and BitTorrent networks, these algorithms are used to uniquely identify files, and MD4 and SHA256 are both Multihash. So these algorithms are also well-established cryptographic hash functions and should also be Multihash.

aschmahmann

Thanks for submitting. Left some questions. Mostly about how things are being used since we have plenty of codes in the table that were never used but added because they felt like a fit. Very curious to know if about anything interesting going on with bittorrent-v2 hashes.

Additionally, in places where there were many repetitions but with different sizes I made consolidation recommendations similar to the ones here #342 (comment).

aschmahmann · 2025-04-21T18:15:14Z

table.csv

@@ -159,6 +159,56 @@ ripemd-128,                     multihash,      0x1052,         draft,
 ripemd-160,                     multihash,      0x1053,         draft,
 ripemd-256,                     multihash,      0x1054,         draft,
 ripemd-320,                     multihash,      0x1055,         draft,
+ed2k,                           multihash,      0x107a,         draft,      eDonkey2000 hash.


Seems reasonable if anyone is planning to use it. Are you planning to use this for anything or is it just historical? We have a history of some codes being added where nobody ever really used them or fleshed out if they were sufficient so it'd be good to know.

Given that md4 is broken I'm not sure it makes a ton of sense in the modern era, but I get that some systems have a difficult time upgrading.

Seems reasonable if anyone is planning to use it. Are you planning to use this for anything or is it just historical? We have a history of some codes being added where nobody ever really used them or fleshed out if they were sufficient so it'd be good to know.

I'm writing a local file retrieval application. It calculates the digests of all files on the local hard drive (e.g. md5, sha1, sha256, ed2k, pieces-root).

So when I get a digest (e.g. hash, Torrent, Ed2k), I can quickly know if this file already exists locally and find it to avoid duplicate downloads.

The current method I use is to store Multihash. That is, for each file, a Multihash is generated for each fnCode, and then it is concatenated and stored as a binary file.

So I would like the fnCode table of Multihash to contain as many hash methods as possible (even if it is unsafe/outdated) so that files can be better located.

Adding a hashing method to Multihash does not mean that IPFS must support it. So security is not that important. Unsafe/outdated methods can still be used for integrity verification in a secure environment, such as checking for file corruption caused by bad sectors/blocks on a hard drive.

Adding a hashing method to Multihash does not mean that IPFS must support it.

Of course not.

So security is not that important.

Yeah, the guidelines on multihash https://github.com/multiformats/multihash are well-established cryptographic hash functions so generally unsafe functions get some push back on the tag name. md4 is already on the table so no objection from me

aschmahmann · 2025-04-21T18:20:56Z

table.csv

@@ -159,6 +159,56 @@ ripemd-128,                     multihash,      0x1052,         draft,
 ripemd-160,                     multihash,      0x1053,         draft,
 ripemd-256,                     multihash,      0x1054,         draft,
 ripemd-320,                     multihash,      0x1055,         draft,
+ed2k,                           multihash,      0x107a,         draft,      eDonkey2000 hash.
+bittorrent-pieces-root,         multihash,      0x107b,         draft,      BitTorrent v2 pieces root hash.


Having a bittorrent-v2 file hash seems very reasonable to me. However, we already have two codes that afaict were used by in a production environment (the bittorrent and bencode codes) and barely (I'm one of the examples) in an experimental capacity, so it'd be good to know if this is actually going into a system where it'll be fleshed out whether this is good enough or not.

Side note: there is already a standard to do btmh://<sha256 of bittorent-v2 infodict> is the idea to also enable something like btmh://<sha256 of bittorrent-v2 file>? If so you probably need a way to denote padding / file length. For an example of that see: https://github.com/filecoin-project/FIPs/blob/81027798b7e50d482c15f5665df1c952aab348ca/FRCs/frc-0069.md?#fr32-sha2-256-trunc254-padded-binary-tree-multihash which is related to #331.

aschmahmann · 2025-04-21T18:45:50Z

table.csv

@@ -159,6 +159,56 @@ ripemd-128,                     multihash,      0x1052,         draft,
 ripemd-160,                     multihash,      0x1053,         draft,
 ripemd-256,                     multihash,      0x1054,         draft,
 ripemd-320,                     multihash,      0x1055,         draft,
+ed2k,                           multihash,      0x107a,         draft,      eDonkey2000 hash.
+bittorrent-pieces-root,         multihash,      0x107b,         draft,      BitTorrent v2 pieces root hash.
+bittorrent-pieces-16k,          multihash,      0x1080,         draft,      BitTorrent pieces with 16KiB piece length.


The number of entries here seems excessive. Instead of taking up 16 slots for some of the sizes you could just grab a single slot and make the first byte in the "digest" the tree depth. This was approach was also taken in #331

The number of entries here seems excessive. Instead of taking up 16 slots for some of the sizes you could just grab a single slot and make the first byte in the "digest" the tree depth. This was approach was also taken in #331

But this approach will result in the same file, with the same fnCode, but with multiple different digests.

This seems to violate the definition of hashing.

This seems to violate the definition of hashing.

You're still hashing, it perhaps stretches the definition of multihash by making the effective function code one byte longer by bleeding into the "digest" but the alternative is multiple garbage entries likely no one will use like we have with blake2b, skein, etc.

However, you still retain a number of the major properties:

The multihash still tells you how it was derived

It therefore also still tells you how to validate a given pile of bytes matches it

It's still reasonably compact

The last time this came up we tried to put together some guidelines for avoiding this kind of table explosion #342 (comment).

However, if it's quite burdensome or introduces some other problem it'd be interesting to understand why in the event we'd like to revise the guidelines here (cc @rvagg).

aschmahmann · 2025-04-28T20:53:39Z

table.csv

+bittorrent-pieces-128m,         multihash,      0x108d,         draft,      BitTorrent pieces with 128MiB piece length.
+bittorrent-pieces-256m,         multihash,      0x108e,         draft,      BitTorrent pieces with 256MiB piece length.
+bittorrent-pieces-512m,         multihash,      0x108f,         draft,      BitTorrent pieces with 512MiB piece length.
+bittorrent-pieces-16k-padded,   hash,           0x1090,         draft,      BitTorrent pieces with 16KiB piece length and padding.


Independent from the issue where instead of multiple slots where you could add an extra byte to the "digest", what is this for?

If you're hoping to be able to disambiguate between a hash of 16k bytes and 15k bytes + 1k of zeros this doesn't look like enough since you can't distinguish between whether the padding is 1k of zeros or 2k of zeros. If that's what you're looking for then I'd consider looking at something like #331 as mentioned above. This would likely also let you combine this entry with the unpadded version.

This is the padding method used in BTv1. For example, for a 500KB file, bittorrent-pieces-256k will directly divide it into blocks and then calculate the hash. However, bittorrent-pieces-256k-padded will first fill it with empty bytes to an integer multiple of the piece length (i.e. 512KB), and then divide it into blocks and calculate the hash.

The hashes of the two differ only in the last 20 bytes.

A hash function generally goes hash(bytes) -> output which can then be verified by checking hash(bytes) == expectedOutput.

It looks to me like with padding this way you've opened up pre-image attacks in that say you have some data D that is 500KB and || means concatenation: bittorrent-pieces-16k-padded(D || 0) == bittorrent-pieces-16k-padded(D || 0 || 0) == bittorrent-pieces-16k-padded(D || 12KB of 0s). It can also be made equal to D with the last byte chopped off if the last byte happens to be a 0.

Is allowing this behavior intentional?

Is allowing this behavior intentional?

In BitTorrent, we also need to provide the size of the file.

After getting the file from the BT network, it will be trimmed according to the file size, and the extra empty bytes at the end will be discarded.

In the case of unknown file size, it does collide.

If you also need the size anyhow then wouldn't it be safer to embed the size (or size of the padding) in the digest (most other hash functions embed the size in the digest in some way anyhow)? Seems like in your case if collisions don't matter here then noting that the data was padded vs unpadded doesn't really matter either, right?

That being said idk if I have a strong objection to this if it's only one instead of many slots. Interested what others (e.g. @rvagg and @vmx think)

If you also need the size anyhow then wouldn't it be safer to embed the size (or size of the padding) in the digest (most other hash functions embed the size in the digest in some way anyhow)? Seems like in your case if collisions don't matter here then noting that the data was padded vs unpadded doesn't really matter either, right?

Yes.

In my app, I already store the file size as an extra so I don't need to store it in the summary. But for other uses, the lack of file size can be annoying.

Maybe it would be a better choice for my app to use the private area.

aschmahmann · 2025-04-28T20:54:53Z

table.csv

+bittorrent-pieces-128m-padded,  hash,           0x109d,         draft,      BitTorrent pieces with 128MiB piece length and padding.
+bittorrent-pieces-256m-padded,  hash,           0x109e,         draft,      BitTorrent pieces with 256MiB piece length and padding.
+bittorrent-pieces-512m-padded,  hash,           0x109f,         draft,      BitTorrent pieces with 512MiB piece length and padding.
+bittorrent-pieces-16k-v2,       multihash,      0x10a0,         draft,      BitTorrent v2 pieces with 16KiB piece length.


Independent from the issue where instead of multiple slots where you could add an extra byte to the "digest", what is this for?

IIUC the for a 256k file the 16k, 32k, ... 256k multihashes would all have the same digest since the bittorrent-v2 hash is defined as a binary tree except for the leaves which are 16k. Implementations for efficiency may send larger packages (i.e. send 32k instead of 16k chunks). Is the idea here to be able to represent the length field in the infodict for a similar kind of signaling (since the hashes will all be the same)?

zhangzih4n · 2025-04-29T15:14:24Z

The controversial items have been removed, leaving only ED2K and PiecesRoot.

aschmahmann

These seem reasonable to me (the others I think are potentially viable too, pushback is generally a function of number of records requested).

@rvagg @vmx any thoughts / objections before merge (idk if I even have permissions anymore to do it myself)

… it cannot uniquely identify a file when the file size is unknown.

aschmahmann · 2025-06-04T18:32:12Z

@zhangzih4n apologies for the delay, checking to make sure you're still waiting on the merge here now that I have permissions. @vmx @rvagg unless I hear any objections I'll merge tomorrow.

vmx · 2025-06-04T19:28:12Z

No objections. I didn't dig in deep enough, but it's just two codes, in a high enough range and reviewed by someone who knows those things, so all good.

zhangzih4n requested review from darobin, rvagg and vmx as code owners April 14, 2025 04:49

aschmahmann requested changes Apr 28, 2025

View reviewed changes

aschmahmann approved these changes Apr 30, 2025

View reviewed changes

aschmahmann mentioned this pull request May 19, 2025

add aschmahmann to multicodec codeowners multiformats/github-mgmt#125

Merged

4 tasks

zhangzih4n added 8 commits June 4, 2025 14:30

Added ED2K and BitTorrent related hash methods.

43a2eae

Since the padding in BitTorrent with padding happens before chunking,…

960cc53

… it cannot uniquely identify a file when the file size is unknown.

Changed word case.

0b50359

Remove hyphens.

7ece8c7

Modify the name.

8818d7c

Added 256MiB and 512MiB piece lengths for BitTorrent.

43b4a12

Remove disputed items.

6cd362e

Changed to more memorable hexadecimal codes.

6e9bf75

aschmahmann force-pushed the master branch from 63fcf06 to 6e9bf75 Compare June 4, 2025 18:30

aschmahmann merged commit d60499c into multiformats:master Jun 23, 2025
1 check passed

Added ED2K hash and BitTorrent pieces hash (#197) #375

Added ED2K hash and BitTorrent pieces hash (#197) #375

Uh oh!

Conversation

zhangzih4n commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added ED2K hash (#197)

Added BitTorrent pieces hash.

BitTorrent V1

Uh oh!

aschmahmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangzih4n Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangzih4n Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aschmahmann Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangzih4n Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangzih4n commented Apr 29, 2025

Uh oh!

aschmahmann left a comment

Choose a reason for hiding this comment

Uh oh!

aschmahmann commented Jun 4, 2025

Uh oh!

vmx commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

zhangzih4n commented Apr 14, 2025 •

edited

Loading

zhangzih4n Apr 29, 2025 •

edited

Loading

zhangzih4n Apr 29, 2025 •

edited

Loading

aschmahmann Apr 29, 2025 •

edited

Loading

zhangzih4n Apr 29, 2025 •

edited

Loading