Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the unicast based recovery algorithm to do replication policy check #11996

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sbodagala
Copy link
Contributor

Extend the version vector/unicast based recovery algorithm to do the replication policy check while deciding whether a version can be recovered from the set of available log servers. This will make the algorithm compatible with the non-unicast/"main" algorithm while handling non-reporting log servers during recovery.

Test that exposed this issue:

build_output/bin/fdbserver -r simulation --crash -f /root/src/foundationdb/tests/slow/RyowCorrectness.toml -b off -s 29779152

A "getRange()" call was getting blocked because recovery was not completing, which was because "replication_factor" number of log servers were not reporting during recovery. But these set non-reporting log servers were not completing the replication policy, so extending the recovery algorithm to do the replication policy check allowed recovery to progress and the test to succeed.

Note that this extension will be able to make recovery progress only in cases where the non-reporting log servers won't meet the replication policy. But this will make the algorithm compatible with "main" while handling such scenarios.

Testing:

Id (with version vector disabled): 20250305-205711-sre-b53cba5eecb4dadb (started).

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@sbodagala sbodagala requested review from dlambrig and jzhou77 March 5, 2025 21:01
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 6851e8f
  • Duration 0:22:18
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 6851e8f
  • Duration 0:48:18
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 6851e8f
  • Duration 0:50:54
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 6851e8f
  • Duration 0:55:48
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 6851e8f
  • Duration 0:56:28
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants