[move tables] cutover integration tests#1723
Conversation
| @@ -0,0 +1,106 @@ | |||
|
|
|||
There was a problem hiding this comment.
[Reviewer Note] focus on the test.sh files for each test case, the create/table files are identical.
| # See https://github.com/github/gh-ost/tree/doc/local-tests.md | ||
| # | ||
|
|
||
| # Usage: localtests/test/sh [filter] |
There was a problem hiding this comment.
[Reviewer Note] hide whitespace is helpful for this one
| # expected $1 to be a comma-separated list of tables to move | ||
|
|
||
| # build comma-separated list of tables to move | ||
| move_tables_arg=$(IFS=, ; echo "${tables_to_migrate[*]}") |
There was a problem hiding this comment.
[reviewer note] moved this into build_ghost_command so custom test logic (i.e. test.sh) don't have to repeat it.
| } | ||
|
|
||
| build_binary() { | ||
| enable_failpoint |
There was a problem hiding this comment.
[reviewer note] only has to be enabled while the code is built. It actually modifies the code on disk, so we need to make sure to disable it to keep git history clean.
There was a problem hiding this comment.
Pull request overview
This PR expands --move-tables integration coverage around cutover crash-safety and resumability by adding failpoint-driven fault injection, plus new/updated move-tables localtests that validate intermediate cutover state and successful --resume completion.
Changes:
- Vendor and integrate
github.com/pingcap/failpoint, adding an--unsafe-fail-points-enabledguard and aMigrationContext.NewFailPoint(...)helper for panic-based fault injection. - Add new move-tables localtests for crash-at-specific-cutover-phases and resumability validation; increase DML throughput for the existing concurrent-writes test.
- Update the move-tables test harness to build a failpoint-enabled binary for these scenarios and to tune checkpointing for faster integration runs.
Show a summary per file
| File | Description |
|---|---|
| .gitignore | Ignore generated failpoint binding/stash files and the tools/ dir used by the test harness. |
| go.mod | Add github.com/pingcap/failpoint dependency. |
| go.sum | Add checksums for github.com/pingcap/failpoint. |
| go/cmd/gh-ost/main.go | Add --unsafe-fail-points-enabled flag and relax checkpoint-seconds minimum in unsafe mode. |
| go/base/context.go | Add UnsafeFailPointsEnabled and MigrationContext.NewFailPoint helper for injected panics. |
| go/logic/migrator.go | Add failpoint injection sites around move-tables cutover stages and after row-copy. |
| localtests/move-tables-test.sh | Extend harness to support failpoint-enabled builds, table list handling, and faster checkpoints. |
| localtests/move-tables/single-concurrent-writes/on_test.sh | Increase concurrent write throughput to better exercise queue drain. |
| script/move-tables/insert-source-primary-loop | Fix DATABASE variable assignment quoting. |
| localtests/move-tables/resume-panic-on-row-copy/test.sh | New test: panic during row-copy and validate --resume completion. |
| localtests/move-tables/resume-panic-on-row-copy/tables.txt | Tables list for the new row-copy panic test. |
| localtests/move-tables/resume-panic-on-row-copy/create.sql | Test schema/data for the new row-copy panic test. |
| localtests/move-tables/resume-panic-before-drain-complete/test.sh | New test: panic after source rename and before drain completion, then resume. |
| localtests/move-tables/resume-panic-before-drain-complete/tables.txt | Tables list for the new pre-drain panic test. |
| localtests/move-tables/resume-panic-before-drain-complete/create.sql | Test schema/data for the new pre-drain panic test. |
| localtests/move-tables/resume-panic-before-on-success-hook/test.sh | New test: panic after drain completion and before on-success hook, then resume. |
| localtests/move-tables/resume-panic-before-on-success-hook/tables.txt | Tables list for the new pre-on-success panic test. |
| localtests/move-tables/resume-panic-before-on-success-hook/create.sql | Test schema/data for the new pre-on-success panic test. |
| vendor/modules.txt | Add vendoring metadata entry for github.com/pingcap/failpoint. |
| vendor/github.com/pingcap/failpoint/terms.go | Vendored failpoint implementation. |
| vendor/github.com/pingcap/failpoint/README.md | Vendored failpoint documentation. |
| vendor/github.com/pingcap/failpoint/marker.go | Vendored failpoint marker stubs used by the rewriter. |
| vendor/github.com/pingcap/failpoint/Makefile | Vendored tool build scripts. |
| vendor/github.com/pingcap/failpoint/MAINTAINERS.md | Vendored project metadata. |
| vendor/github.com/pingcap/failpoint/LICENSE | Vendored license. |
| vendor/github.com/pingcap/failpoint/http.go | Vendored optional failpoint HTTP control handler. |
| vendor/github.com/pingcap/failpoint/failpoints.go | Vendored failpoint registry/eval logic. |
| vendor/github.com/pingcap/failpoint/failpoint.go | Vendored failpoint core types and execution logic. |
| vendor/github.com/pingcap/failpoint/CONTRIBUTING.md | Vendored contributing guide. |
| vendor/github.com/pingcap/failpoint/.gitignore | Vendored gitignore. |
| vendor/github.com/pingcap/failpoint/.codecov.yml | Vendored Codecov config. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (1)
localtests/move-tables-test.sh:500
build_binaryenables failpoint rewriting before checking for an existing binary, but returns early without disabling it. This can leave the working tree in a rewritten state (__failpoint_binding__/stash files), affecting subsequent builds/tests.
build_binary() {
enable_failpoint
echo "Building"
rm -f $default_ghost_binary
[ "$ghost_binary" == "" ] && ghost_binary="$default_ghost_binary"
if [ -f "$ghost_binary" ]; then
echo "Using binary: $ghost_binary"
return 0
fi
- Files reviewed: 16/31 changed files
- Comments generated: 8
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
zacharysierakowski
left a comment
There was a problem hiding this comment.
Love how easy it was to read through the flow of each test! Left some comments about potential assertions we could add for each test. But overall this looks pretty solid
| return 1 | ||
| fi | ||
|
|
||
| echo "✅ Validated checkpointed state on unexpected exit..." |
There was a problem hiding this comment.
should we query the checkpoint table above this too to sanity check its state? Or do we think that's getting too fine grained for this level of test
There was a problem hiding this comment.
also, i noticed there's nothing asserting above that the drain only partially completed (i.e. the source has more rows than the target). Did you exclude that intentionally b/c it might be flaky?
There was a problem hiding this comment.
Did you exclude that intentionally b/c it might be flaky?
Yeah, I worried that would be inherently flakey without a mechanism to programmatically stop the DML event stream from processing in gh-ost. I think I found a happy medium by looking at the latest checkpoint. (c7423cd)
Closes https://github.com/github/database-infrastructure/issues/8260
Follow up to #1714
A Pull Request should be associated with an Issue.
Related issue (internal): https://github.com/github/database-infrastructure/issues/8260
Related issue (public): #1681
Description
This PR extends move-tables integration test coverage introduced in #1714 to cover more in depth scenarios related to cutover semantics:
single-with-concurrent- modified the existing test to drive higher DML write throughput to more robustly exercise DML queue drain semanticsresume-panic-on-row-copy- simulates crash during row copy events prior to DML streaming.panic-on-row-copyfailpoint was added at the end of the row copy loop to ensure a single row copy happens before the panic.resume-panic-before-queue-drain- simulates crash after table renamed (T1) and before queue drain (T3)resume-panic-before-on-success- simulates crash after queue drain but before invocation of theon-successhookAll
resume-panic-*tests execute gh-ost once with the fault injected, validate the intermediate state, and then run gh-ost with--resumeand no failpoint to test resumability of the cutover.Failpoints
To support those scenarios, this PR integrates failpoint to enable arbitrary fault injection. As is, you can induce a runtime panic by:
base.AddNewFailPointwith an appropriate failpoint nameGO_FAILPOINTSenvironment variable--unsafe-fail-points-enabledtools/bin/failpoint-ctl enableactive -- this is handled by thelocalhost/move-tablesframework