Skip to content

[Xet] Basic shard creation #1633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

[Xet] Basic shard creation #1633

wants to merge 11 commits into from

Conversation

coyotte508
Copy link
Member

@coyotte508 coyotte508 commented Jul 16, 2025

cc @Kakulukian @assafvayner for viz, follow up to #1616

Based on https://github.com/huggingface/xet-core/blob/7e41fb0dd7cfb276222b9668d0b97a984647721e/spec/shard.md

Need to handle:

  • split into multiple shards when xorb or file info grows too big
  • uploading xorbs & shards (and we need to upload xorbs before shards referencing them)

Comment on lines 4 to 5
export function compute_range_verification_hash(chunkHashes: string[]): string;
export function compute_file_hash(chunks_array: Array<{ hash: string; length: number }>): string;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@assafvayner need those two functions from the wasm :)

(also , versions of those two or at least the last one with .update would be nice eventually)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by .update for those functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where you can feed it data progressively before calling finalize() to get the hash.

Copy link
Collaborator

@assafvayner assafvayner Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep the xorb and range hash computation simple and take just an array of items since those have roughly reasonable limit of ~1K items

the file hash I can see the value but we don't have this feature imlpemented in xet-core yet, and might be a while (it's not simple). For now there's just a compute_file_hash function that takes all the chunks at once but we may be able to update that later

Copy link
Member Author

@coyotte508 coyotte508 Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't see the difference between range hash & file hash, they both have all the chunk hashes for a file no? (the only diff is that file hash has chunk lengths too)

the file hash I can see the value but we don't have this feature imlpemented in xet-core yet, and might be a while (it's not simple)

yes no problem

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the range hash is at most 1 xorb's worth of hashes (this is a bit odd to explain, that's why we need to write the whole spec).

let's say a file has the following structure:

xorb A chunks 0-1024 (out of 1024)
xorb B chunks 0-500 (out of 1024)
xorb A chunks 1-44

Then the range hashes for the verification section of the shard containing this file info will need to have:

range_hash(xorb_A.chunks_hashes.slice(0, 1025))
range_hash(xorb_B.chunks_hashes.slice(0, 501))
range_hash(xorb_A.chunks_hashes.slice(1, 45))

notice that all the reasonable parameters to the range_hash function are <= number of chunks in a xorb

Copy link
Member Author

@coyotte508 coyotte508 Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there's 1 FileVerificationEntry for each FileDataSequenceEntry?

And it's like this?

FileDataSequenceEntry A
FileDataSequenceEntry B
FileDataSequenceEntry C
FileDataSequenceEntry D
FileVerificationEntry A (for FileDataSequenceEntry A)
FileVerificationEntry B
FileVerificationEntry C
FileVerificationEntry D
FileMetadataExt

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes; I will need to revise the spec to make this clear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants