-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to compute hash prefixes #10
Comments
Hi @rdonuk did you receive answer on this question? |
@victorberestian nope, not at all. |
I have issue with local db, but maybe can help you with your problem, for what you are trying to compute hash? |
Because it required for Update API. |
I didn't get so far, but as I understand when you getting your local DB, you will get sets of PrefixSize and RawHashes, then you divide them to Prefix Size and store them in DB, when you will be searching you will take first N characters of created Hashes of URL and try to search them in DB. |
Yes, I didn't understand what will be that N. It will be between 4-32 for sure but how can I know the exact length? Maybe I need to create prefixes with all lengths and search one by one for each of them but I am not sure. |
Yes will be 4-32, but you don't need to compute hashes if you don't have those prexis size in your local DB. Because for now in WebRisk all hashes are size of 4. |
Ok, so you say if I have 4,5 and 6 sized hashes in my localdb, I need to compute prefixes for all those sizes (4,5,6)? Am I correct? |
Yes, and then search them in DB, but it's not 100% correct, it's just my vision 🧐 |
Ok. It makes sense actually because there is nothing about picking the length in the docs. Thanks for comments, I will leave this issue open, maybe we could get an official response. |
No problem) Good luck |
I just do 32 bits, or 8 characters in hex, for all hex prefixes. Better local cache means fewer remote lookups. I'm working on a php implementation that's nearly done -- there's a couple of our implementation specific artifacts still in the code that I'll be cleaning up in the next couple days/weeks/hopefully, but if any of it is useful to y'all, enjoy! https://github.com/Automattic/php-webrisk/blob/master/webrisk.class.php#L344-L566 |
might be helpful: https://www.npmjs.com/package/webrisk-hash |
Hashes are calculated as SHA256 on the canonicalized URL. The hash prefixes are the substring of the hash, the first few bytes. When it says prefixes are 4-32 bytes, the prefixes are almost always 4 bytes, but may be longer to avoid collisions on popular sites. (say, the sha256 badsite.example.com has the same first 4 bytes as the sha256 of popular.goodsite.example.com). For the lookup API, you may send 4 byte hash prefixes. For the update API, I believe the database stores hashes of sizes anywhere between 4-32 bytes. The local database should be checked for each prefix size 4-32 to be correct. Database::Lookup will automatically slice and check each of the prefix sizes, so a full 32 byte hash should be passed to this function for completeness. (Note, this function does not make any network calls, so privacy with the full hash should not be a concern here) Canonicalization and hashing is elaborated in: https://cloud.google.com/web-risk/docs/urls-hashing |
I read the docs but really didn't understand how to compute hash prefixes. It says it should be between 4-32 bytes but how can we decide the size?
The text was updated successfully, but these errors were encountered: