Skip to content

[move tables] cutover integration tests#1723

Merged
chriskirkland merged 14 commits into
feature-move-tablesfrom
chriskirkland/issues-8260-pt2
Jun 23, 2026
Merged

[move tables] cutover integration tests#1723
chriskirkland merged 14 commits into
feature-move-tablesfrom
chriskirkland/issues-8260-pt2

Conversation

@chriskirkland

@chriskirkland chriskirkland commented Jun 22, 2026

Copy link
Copy Markdown

Closes https://github.com/github/database-infrastructure/issues/8260
Follow up to #1714

A Pull Request should be associated with an Issue.

We wish to have discussions in Issues. A single issue may be targeted by multiple PRs.
If you're offering a new feature or fixing anything, we'd like to know beforehand in Issues,
and potentially we'll be able to point development in a particular direction.

Related issue (internal): https://github.com/github/database-infrastructure/issues/8260
Related issue (public): #1681

Further notes in https://github.com/github/gh-ost/blob/master/.github/CONTRIBUTING.md
Thank you! We are open to PRs, but please understand if for technical reasons we are unable to accept each and any PR

Description

This PR extends move-tables integration test coverage introduced in #1714 to cover more in depth scenarios related to cutover semantics:

  • single-with-concurrent - modified the existing test to drive higher DML write throughput to more robustly exercise DML queue drain semantics
  • resume-panic-on-row-copy - simulates crash during row copy events prior to DML streaming.
    • The panic-on-row-copy failpoint was added at the end of the row copy loop to ensure a single row copy happens before the panic.
  • resume-panic-before-queue-drain - simulates crash after table renamed (T1) and before queue drain (T3)
  • resume-panic-before-on-success - simulates crash after queue drain but before invocation of the on-success hook

All resume-panic-* tests execute gh-ost once with the fault injected, validate the intermediate state, and then run gh-ost with --resume and no failpoint to test resumability of the cutover.

Failpoints

To support those scenarios, this PR integrates failpoint to enable arbitrary fault injection. As is, you can induce a runtime panic by:

  • adding an inline call to base.AddNewFailPoint with an appropriate failpoint name
  • passing that failpoint to gh-ost via the GO_FAILPOINTS environment variable
  • passing --unsafe-fail-points-enabled
    • all failpoints are disabled by default unless this flag is passed
  • compiling the gh-ost binary with tools/bin/failpoint-ctl enable active -- this is handled by the localhost/move-tables framework

Copilot AI review requested due to automatic review settings June 22, 2026 22:32
@chriskirkland chriskirkland changed the base branch from master to feature-move-tables June 22, 2026 22:32
@@ -0,0 +1,106 @@

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Reviewer Note] focus on the test.sh files for each test case, the create/table files are identical.

# See https://github.com/github/gh-ost/tree/doc/local-tests.md
#

# Usage: localtests/test/sh [filter]

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Reviewer Note] hide whitespace is helpful for this one

# expected $1 to be a comma-separated list of tables to move

# build comma-separated list of tables to move
move_tables_arg=$(IFS=, ; echo "${tables_to_migrate[*]}")

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[reviewer note] moved this into build_ghost_command so custom test logic (i.e. test.sh) don't have to repeat it.

}

build_binary() {
enable_failpoint

@chriskirkland chriskirkland Jun 22, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[reviewer note] only has to be enabled while the code is built. It actually modifies the code on disk, so we need to make sure to disable it to keep git history clean.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands --move-tables integration coverage around cutover crash-safety and resumability by adding failpoint-driven fault injection, plus new/updated move-tables localtests that validate intermediate cutover state and successful --resume completion.

Changes:

  • Vendor and integrate github.com/pingcap/failpoint, adding an --unsafe-fail-points-enabled guard and a MigrationContext.NewFailPoint(...) helper for panic-based fault injection.
  • Add new move-tables localtests for crash-at-specific-cutover-phases and resumability validation; increase DML throughput for the existing concurrent-writes test.
  • Update the move-tables test harness to build a failpoint-enabled binary for these scenarios and to tune checkpointing for faster integration runs.
Show a summary per file
File Description
.gitignore Ignore generated failpoint binding/stash files and the tools/ dir used by the test harness.
go.mod Add github.com/pingcap/failpoint dependency.
go.sum Add checksums for github.com/pingcap/failpoint.
go/cmd/gh-ost/main.go Add --unsafe-fail-points-enabled flag and relax checkpoint-seconds minimum in unsafe mode.
go/base/context.go Add UnsafeFailPointsEnabled and MigrationContext.NewFailPoint helper for injected panics.
go/logic/migrator.go Add failpoint injection sites around move-tables cutover stages and after row-copy.
localtests/move-tables-test.sh Extend harness to support failpoint-enabled builds, table list handling, and faster checkpoints.
localtests/move-tables/single-concurrent-writes/on_test.sh Increase concurrent write throughput to better exercise queue drain.
script/move-tables/insert-source-primary-loop Fix DATABASE variable assignment quoting.
localtests/move-tables/resume-panic-on-row-copy/test.sh New test: panic during row-copy and validate --resume completion.
localtests/move-tables/resume-panic-on-row-copy/tables.txt Tables list for the new row-copy panic test.
localtests/move-tables/resume-panic-on-row-copy/create.sql Test schema/data for the new row-copy panic test.
localtests/move-tables/resume-panic-before-drain-complete/test.sh New test: panic after source rename and before drain completion, then resume.
localtests/move-tables/resume-panic-before-drain-complete/tables.txt Tables list for the new pre-drain panic test.
localtests/move-tables/resume-panic-before-drain-complete/create.sql Test schema/data for the new pre-drain panic test.
localtests/move-tables/resume-panic-before-on-success-hook/test.sh New test: panic after drain completion and before on-success hook, then resume.
localtests/move-tables/resume-panic-before-on-success-hook/tables.txt Tables list for the new pre-on-success panic test.
localtests/move-tables/resume-panic-before-on-success-hook/create.sql Test schema/data for the new pre-on-success panic test.
vendor/modules.txt Add vendoring metadata entry for github.com/pingcap/failpoint.
vendor/github.com/pingcap/failpoint/terms.go Vendored failpoint implementation.
vendor/github.com/pingcap/failpoint/README.md Vendored failpoint documentation.
vendor/github.com/pingcap/failpoint/marker.go Vendored failpoint marker stubs used by the rewriter.
vendor/github.com/pingcap/failpoint/Makefile Vendored tool build scripts.
vendor/github.com/pingcap/failpoint/MAINTAINERS.md Vendored project metadata.
vendor/github.com/pingcap/failpoint/LICENSE Vendored license.
vendor/github.com/pingcap/failpoint/http.go Vendored optional failpoint HTTP control handler.
vendor/github.com/pingcap/failpoint/failpoints.go Vendored failpoint registry/eval logic.
vendor/github.com/pingcap/failpoint/failpoint.go Vendored failpoint core types and execution logic.
vendor/github.com/pingcap/failpoint/CONTRIBUTING.md Vendored contributing guide.
vendor/github.com/pingcap/failpoint/.gitignore Vendored gitignore.
vendor/github.com/pingcap/failpoint/.codecov.yml Vendored Codecov config.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

localtests/move-tables-test.sh:500

  • build_binary enables failpoint rewriting before checking for an existing binary, but returns early without disabling it. This can leave the working tree in a rewritten state (__failpoint_binding__/stash files), affecting subsequent builds/tests.
build_binary() {
    enable_failpoint

    echo "Building"
    rm -f $default_ghost_binary
    [ "$ghost_binary" == "" ] && ghost_binary="$default_ghost_binary"
    if [ -f "$ghost_binary" ]; then
        echo "Using binary: $ghost_binary"
        return 0
    fi
  • Files reviewed: 16/31 changed files
  • Comments generated: 8

Comment thread localtests/move-tables-test.sh Outdated
Comment thread localtests/move-tables-test.sh
Comment thread go/base/context.go
Comment thread localtests/move-tables/resume-panic-before-drain-complete/test.sh Outdated
Comment thread localtests/move-tables/resume-panic-before-drain-complete/test.sh Outdated
Comment thread localtests/move-tables/resume-panic-before-on-success-hook/test.sh Outdated
Comment thread localtests/move-tables/resume-panic-before-on-success-hook/test.sh Outdated
Comment thread localtests/move-tables/resume-panic-on-row-copy/test.sh Outdated
Comment thread script/move-tables/insert-source-primary-loop Outdated
chriskirkland and others added 3 commits June 22, 2026 19:38
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

@zacharysierakowski zacharysierakowski left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love how easy it was to read through the flow of each test! Left some comments about potential assertions we could add for each test. But overall this looks pretty solid

Comment thread go/logic/migrator.go
Comment thread script/move-tables/insert-source-primary-loop Outdated
Comment thread localtests/move-tables/resume-panic-on-row-copy/test.sh Outdated
Comment thread localtests/move-tables/resume-panic-on-row-copy/test.sh Outdated
Comment thread localtests/move-tables/resume-panic-on-row-copy/test.sh Outdated
Comment thread localtests/move-tables/resume-panic-on-row-copy/test.sh
return 1
fi

echo "✅ Validated checkpointed state on unexpected exit..."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we query the checkpoint table above this too to sanity check its state? Or do we think that's getting too fine grained for this level of test

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, i noticed there's nothing asserting above that the drain only partially completed (i.e. the source has more rows than the target). Did you exclude that intentionally b/c it might be flaky?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you exclude that intentionally b/c it might be flaky?

Yeah, I worried that would be inherently flakey without a mechanism to programmatically stop the DML event stream from processing in gh-ost. I think I found a happy medium by looking at the latest checkpoint. (c7423cd)

Comment thread localtests/move-tables/resume-panic-before-drain-complete/test.sh
Comment thread localtests/move-tables/resume-panic-before-on-success-hook/test.sh
Comment thread localtests/move-tables/resume-panic-before-on-success-hook/test.sh
@zacharysierakowski zacharysierakowski added the feature-move-tables PRs that are associated with the new move-tables feature label Jun 23, 2026
Comment thread localtests/move-tables/single-with-hooks/test.sh Outdated
Comment thread localtests/move-tables/resume-panic-on-row-copy/test.sh Outdated

@zacharysierakowski zacharysierakowski left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🚀 🚀 🚀

@chriskirkland chriskirkland merged commit a0f8722 into feature-move-tables Jun 23, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature-move-tables PRs that are associated with the new move-tables feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants