-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce db stress noise #13447
Reduce db stress noise #13447
Conversation
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change lgtm - if the intense rehearsal CI shows to be stable, feel free to merge this
Few of the scheduled jobs failed [1]. It might be a bit of whack-a-mole game to catch all the relevant callsites. I'll continue iterating on the fix-test until satisfaction is reached OR we derive a conclusion that ROI doesn't justify the effort : ). [1] |
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
Stress test jobs [1] for the 2nd ("patched") run indicate that despite calling to ignore errors in [1] https://www.internalfb.com/sandcastle/workflow/477381560507616525
https://www.internalfb.com/sandcastle/workflow/3152519739165702497
https://www.internalfb.com/sandcastle/workflow/1585267068840779345
https://www.internalfb.com/sandcastle/workflow/2837267765249794731 |
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
a37471b
to
7f4303a
Compare
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
7f4303a
to
7202178
Compare
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
7202178
to
a19a972
Compare
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
a19a972
to
818b919
Compare
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
818b919
to
3d10bc7
Compare
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
3d10bc7
to
0057b30
Compare
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mszeszko-meta Are you able to repro these locally fast by increasing the frequency of doing TestPrefixScan + refresh with snapshot?? This should speed up your debugging cycle. One of the last command of these failed runs seem to consistently repro within a decent amount of time < 1min
Can we do the error ignore in ~DBIter() instead of lower-level places like how we deal with ThreadStatus::OperationType? |
@mszeszko-meta from my debugger on the repro command above, it seems that IGNORE_STATUS_IF_ERROR_DEBUG() was never called into the callback we want when the code got to IGNORE_STATUS_IF_ERROR_DEBUG() after GetFileModificationTime(). This is weird - not sure if it's some library or sync point set up problem. I changed IgnoreReadErrorCallback_DEBUG as below so not to confuse with the another printing statement of "DBStress thread id:". This "IgnoreReadErrorCallback_DEBUG DBStress thread id" was never get printed.
Found it - the sync point set up only happen
|
Good catch, @hx235 ! Yes, that'd explain it. I've been looking at the recent failed runs and observed the same. The confusing part was the message
Do you suggest something like [1] |
metadata read injection was added later by me and I wasn't aware of the fact that this was missed in setting up the sync point so the reason is I made a mistake ..... lol lol lol
Yeah - that was my suggestion before thinking we were in a whack-a-mole situation cuz there were too many places to patch. But if we fix it with sync point set up, I'm good with the current PR. |
I like your approach. It has its pros (avoid whack-a-mole now) and cons (codepaths that don't originate in ~DBIter() [but invoke its' respective downstream callbacks] will encounter the same problems so someone eventually will either have to whack-a-mole on their own OR do the top level patching like we do). I'll try to find the right balance. |
049be10
to
3f4e1c6
Compare
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mszeszko-meta merged this pull request in 8e16f8f. |
This PR is a followup to #13408. Thick bandaid of ignoring all injected read errors in context of periodic iterator auto refreshes in db stress proved to be 'effective'. We confirmed our theory that errors are not really a consequence / defect related to this new feature but rather due to subtle ways in which downstream code paths handle their respective non-critical IO failures. In this change we're replacing a thick 'ignore all IO read errors' bandaid in
no_batched_ops_stress
with a much smaller, targeted patches in obsolete files purge / delete codepaths, table block cache reader and table cache lookup to make sure we don't miss signal and ensure there's a single mechanism for ignoring error injection in db stress tests.Test Plan
[WIP] Expect all manually triggered sandcastle runs containing this change to succeed.