Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to compute hash prefixes #10

Open
rdonuk opened this issue Jan 6, 2020 · 14 comments
Open

How to compute hash prefixes #10

rdonuk opened this issue Jan 6, 2020 · 14 comments

Comments

@rdonuk
Copy link

rdonuk commented Jan 6, 2020

I read the docs but really didn't understand how to compute hash prefixes. It says it should be between 4-32 bytes but how can we decide the size?

@victorberestian
Copy link

Hi @rdonuk did you receive answer on this question?

@rdonuk
Copy link
Author

rdonuk commented Jan 16, 2020

@victorberestian nope, not at all.

@victorberestian
Copy link

I have issue with local db, but maybe can help you with your problem, for what you are trying to compute hash?

@rdonuk
Copy link
Author

rdonuk commented Jan 16, 2020

Because it required for Update API.

@victorberestian
Copy link

victorberestian commented Jan 16, 2020

I didn't get so far, but as I understand when you getting your local DB, you will get sets of PrefixSize and RawHashes, then you divide them to Prefix Size and store them in DB, when you will be searching you will take first N characters of created Hashes of URL and try to search them in DB.

@rdonuk
Copy link
Author

rdonuk commented Jan 16, 2020

Yes, I didn't understand what will be that N. It will be between 4-32 for sure but how can I know the exact length? Maybe I need to create prefixes with all lengths and search one by one for each of them but I am not sure.

@victorberestian
Copy link

Yes will be 4-32, but you don't need to compute hashes if you don't have those prexis size in your local DB. Because for now in WebRisk all hashes are size of 4.

@rdonuk
Copy link
Author

rdonuk commented Jan 16, 2020

Ok, so you say if I have 4,5 and 6 sized hashes in my localdb, I need to compute prefixes for all those sizes (4,5,6)? Am I correct?

@victorberestian
Copy link

Yes, and then search them in DB, but it's not 100% correct, it's just my vision 🧐

@rdonuk
Copy link
Author

rdonuk commented Jan 16, 2020

Ok. It makes sense actually because there is nothing about picking the length in the docs. Thanks for comments, I will leave this issue open, maybe we could get an official response.

@victorberestian
Copy link

No problem) Good luck

@georgestephanis
Copy link

I just do 32 bits, or 8 characters in hex, for all hex prefixes. Better local cache means fewer remote lookups.

I'm working on a php implementation that's nearly done -- there's a couple of our implementation specific artifacts still in the code that I'll be cleaning up in the next couple days/weeks/hopefully, but if any of it is useful to y'all, enjoy!

https://github.com/Automattic/php-webrisk/blob/master/webrisk.class.php#L344-L566

@blackmad
Copy link

might be helpful: https://www.npmjs.com/package/webrisk-hash

@bsurmanski
Copy link
Collaborator

bsurmanski commented Jan 3, 2022

Hashes are calculated as SHA256 on the canonicalized URL. The hash prefixes are the substring of the hash, the first few bytes.

When it says prefixes are 4-32 bytes, the prefixes are almost always 4 bytes, but may be longer to avoid collisions on popular sites. (say, the sha256 badsite.example.com has the same first 4 bytes as the sha256 of popular.goodsite.example.com).

For the lookup API, you may send 4 byte hash prefixes.

For the update API, I believe the database stores hashes of sizes anywhere between 4-32 bytes. The local database should be checked for each prefix size 4-32 to be correct. Database::Lookup will automatically slice and check each of the prefix sizes, so a full 32 byte hash should be passed to this function for completeness. (Note, this function does not make any network calls, so privacy with the full hash should not be a concern here)

Canonicalization and hashing is elaborated in: https://cloud.google.com/web-risk/docs/urls-hashing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants