Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MLBF validation logic and corresponding management command #22983

Merged
merged 9 commits into from
Jan 22, 2025

Conversation

KevinMind
Copy link
Contributor

@KevinMind KevinMind commented Jan 13, 2025

Fixes: mozilla/addons#15281

Description

  • Implemented a validate method in the MLBF class to check for duplicate items in the cache.json, ensuring each item occurs only once and in a single data type.
  • Added a new management command validate_mlbf to facilitate validation of MLBF directories from the command line.
  • Created unit tests to verify the validation logic, ensuring it raises errors for duplicate items within a single data type and across multiple data types.

Context

We have discovered that MLBFs created since November 2024 are larger than expected. This validation is one step to ensure we do not see unchecked growth in the mlbf by verifying the underlying data set used to produce the filter.

Testing

  • Download the cache.json from the (privately) linked g-drive URL
  • move the mlbf directory into your "storage" directory in the repository
  • Run the below command to validate the data
./manage.py validate_mlbf.py <id>

Where id is the timestamp directory name ./mlbf/<id>/.

Checklist

  • Add #ISSUENUM at the top of your PR to an existing open issue in the mozilla/addons repository.
  • Successfully verified the change locally.
  • The change is covered by automated tests, or otherwise indicated why doing so is unnecessary/impossible.
  • Add before and after screenshots (Only for changes that impact the UI).
  • Add or update relevant docs reflecting the changes made.

- Implemented a `validate` method in the `MLBF` class to check for duplicate items in the cache.json, ensuring each item occurs only once and in a single data type.
- Added a new management command `validate_mlbf` to facilitate validation of MLBF directories from the command line.
- Created unit tests to verify the validation logic, ensuring it raises errors for duplicate items within a single data type and across multiple data types.
@KevinMind
Copy link
Contributor Author

KevinMind commented Jan 13, 2025

Verified on production data that we do in fact have duplicates and the verification command catches them.

image

Now to find out how that happened in the first place and how to prevent.

FYI @Rob--W you're theory seems to be correct.

@Rob--W
Copy link
Member

Rob--W commented Jan 13, 2025

Now to find out how that happened in the first place and how to prevent.

FYI @Rob--W you're theory seems to be correct.

Note: although cache.json here does indeed have duplicates (which I had observed myself as well), it is not necessarily the root cause for the issues observed in #15261.

In my local testing that lead to mozilla/addons#15261 (comment), I generated a new MLBF solely based on the cache.json file, without de-duplication. This means that if there were any duplicates in cache.json, that it would directly contribute to the generated MLBF. The fact that my MLBF is still smaller than what was observed in production is a strong signal that there is additional duplicate data that is not stored in cache.json.

@KevinMind KevinMind requested review from a team and eviljeff and removed request for a team January 13, 2025 15:03
@KevinMind
Copy link
Contributor Author

KevinMind commented Jan 13, 2025

I generated a new MLBF solely based on the cache.json file, without de-duplication. This means that if there were any duplicates in cache.json, that it would directly contribute to the generated MLBF. The fact that my MLBF is still smaller than what was observed in production is a strong signal that there is additional duplicate data that is not stored in cache.json.

I'm not sure I follow this logic.

I generated a new MLBF solely based on the cache.json file, without de-duplication.

What do you mean by that? like what code did you actually run?

The fact that my MLBF is still smaller than what was observed in production is a strong signal that there is additional duplicate data that is not stored in cache.json.

That makes sense assuming the way you generated the filter is the same as the way production generated it. My understanding of that is based on the answer to the first question. But even if that were the case, if you did not de-duplicate the file and generated a filter that is smaller.. how did that happen? Isn't it even more odd if we get different sized filters on the same data set in different environments? I must be missing something from your comment.

@Rob--W
Copy link
Member

Rob--W commented Jan 13, 2025

I generated a new MLBF solely based on the cache.json file, without de-duplication. This means that if there were any duplicates in cache.json, that it would directly contribute to the generated MLBF. The fact that my MLBF is still smaller than what was observed in production is a strong signal that there is additional duplicate data that is not stored in cache.json.

I'm not sure I follow this logic.

The production file size is 931343 and linked from https://bugzilla.mozilla.org/show_bug.cgi?id=1938506#c5.
I generated a new MLBF from cache.json and got a smaller file. Conclusion: the actual MLBF generation on AMO prod includes more data than what is written to cache.json.

I generated a new MLBF solely based on the cache.json file, without de-duplication.

What do you mean by that? like what code did you actually run?

This is the code using filtercascade, similar to what generate_mlbf in mlbf.py does: https://gist.github.com/Rob--W/39940bf6266d5e5b2c3b7e65da67696b

I've attached the command output to https://mozilla-hub.atlassian.net/browse/AMOENG-1332?focusedCommentId=988314

@KevinMind KevinMind requested a review from Rob--W January 13, 2025 19:39
Copy link
Member

@Rob--W Rob--W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this fix up so quickly!

The fix of the primary issue looks good to me.

I'll defer to AMO engineers for the approving sign-off, in case they have thoughts on the tests or implementation details.

src/olympia/blocklist/tests/test_mlbf.py Outdated Show resolved Hide resolved
src/olympia/blocklist/mlbf.py Show resolved Hide resolved
src/olympia/blocklist/mlbf.py Show resolved Hide resolved
@KevinMind KevinMind marked this pull request as ready for review January 14, 2025 13:41
Copy link
Member

@eviljeff eviljeff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some code nits. I'll defer to Rob/Will's opinions for whether this is doing the "right" thing.

src/olympia/blocklist/management/commands/validate_mlbf.py Outdated Show resolved Hide resolved
src/olympia/blocklist/mlbf.py Outdated Show resolved Hide resolved
src/olympia/blocklist/mlbf.py Outdated Show resolved Hide resolved
src/olympia/blocklist/mlbf.py Outdated Show resolved Hide resolved
@willdurand willdurand removed their request for review January 22, 2025 14:15
@KevinMind KevinMind merged commit 21c80ad into master Jan 22, 2025
49 checks passed
@KevinMind KevinMind deleted the dedupe-mlbf branch January 22, 2025 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to verify no duplicate entries on MLBF cache.json
4 participants