-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MLBF validation logic and corresponding management command #22983
Conversation
- Implemented a `validate` method in the `MLBF` class to check for duplicate items in the cache.json, ensuring each item occurs only once and in a single data type. - Added a new management command `validate_mlbf` to facilitate validation of MLBF directories from the command line. - Created unit tests to verify the validation logic, ensuring it raises errors for duplicate items within a single data type and across multiple data types.
Verified on production data that we do in fact have duplicates and the verification command catches them. Now to find out how that happened in the first place and how to prevent. FYI @Rob--W you're theory seems to be correct. |
Note: although In my local testing that lead to mozilla/addons#15261 (comment), I generated a new MLBF solely based on the |
I'm not sure I follow this logic.
What do you mean by that? like what code did you actually run?
That makes sense assuming the way you generated the filter is the same as the way production generated it. My understanding of that is based on the answer to the first question. But even if that were the case, if you did not de-duplicate the file and generated a filter that is smaller.. how did that happen? Isn't it even more odd if we get different sized filters on the same data set in different environments? I must be missing something from your comment. |
The production file size is
This is the code using I've attached the command output to https://mozilla-hub.atlassian.net/browse/AMOENG-1332?focusedCommentId=988314 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this fix up so quickly!
The fix of the primary issue looks good to me.
I'll defer to AMO engineers for the approving sign-off, in case they have thoughts on the tests or implementation details.
f381c9a
to
7e9ff26
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some code nits. I'll defer to Rob/Will's opinions for whether this is doing the "right" thing.
Fixes: mozilla/addons#15281
Description
validate
method in theMLBF
class to check for duplicate items in the cache.json, ensuring each item occurs only once and in a single data type.validate_mlbf
to facilitate validation of MLBF directories from the command line.Context
We have discovered that MLBFs created since November 2024 are larger than expected. This validation is one step to ensure we do not see unchecked growth in the mlbf by verifying the underlying data set used to produce the filter.
Testing
mlbf
directory into your "storage" directory in the repositoryWhere
id
is the timestamp directory name./mlbf/<id>/
.Checklist
#ISSUENUM
at the top of your PR to an existing open issue in the mozilla/addons repository.