Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing cases of corruption retries #13122

Closed
wants to merge 3 commits into from

Conversation

anand1976
Copy link
Contributor

This PR fixes a few cases where RocksDB was not retrying checksum failure/corruption of file reads with the verify_and_reconstruct_read IO option. After fixing these cases, we can almost always successfully open the DB and execute reads even if we see transient corruptions, provided the FileSystem supports the verify_and_reconstruct_read option. The specific cases fixed in this PR are -

  1. CURRENT file
  2. IDENTITY file
  3. OPTIONS file
  4. SST footer

Test plan:
Unit test in db_io_failure_test.cc that injects corruption at various stages of DB open and reads

// is allocated with max_open_files - 10 as capacity. So override
// max_open_files to 11 so table cache capacity will become 1. This will
// prevent file open during DB open and force the file to be opened
// during MultiGet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice trick!

}
if (s.ok()) {
s = ValidityCheck();
}
Copy link
Contributor

@jaykorean jaykorean Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if (s.ok()) { return s; } after this validity check would be more readable to me then setting retry = false at the else in line 349. Or you could explicitly return on line 349.

s = ValidityCheck();
}
if (!s.ok()) {
if ((s.IsCorruption() || s.IsInvalidArgument()) && !retry &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for my own learning. In what scenario that we'd get s.IsInvalidArgument() here and retry with kVerifyAndReconstructRead would succeed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any syntax errors during parsing are being considered as Status::InvalidArgument(). The syntax error could be due to corruption, and can potentially be corrected by retrying.

return s;
}

// This means the next read after injecting corruption was not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if the comment was cut

ss << std::setw(3) << 100 * sst + key;
ASSERT_OK(Put("key" + ss.str(), "val" + ss.str()));
}
Flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASSERT_OK(Flush());

}
Flush();
}
Close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASSERT_OK(Close());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like DBTestBase::Close() does not return a Status.

@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@jaykorean jaykorean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@anand1976 merged this pull request in ee25861.

anand1976 added a commit that referenced this pull request Nov 12, 2024
Summary:
This PR fixes a few cases where RocksDB was not retrying checksum failure/corruption of file reads with the `verify_and_reconstruct_read` IO option. After fixing these cases, we can almost always successfully open the DB and execute reads even if we see transient corruptions, provided the `FileSystem` supports the `verify_and_reconstruct_read` option. The specific cases fixed in this PR are -
1. CURRENT file
2. IDENTITY file
3. OPTIONS file
4. SST footer

Pull Request resolved: #13122

Test Plan: Unit test in `db_io_failure_test.cc` that injects corruption at various stages of DB open and reads

Reviewed By: jaykorean

Differential Revision: D65617982

Pulled By: anand1976

fbshipit-source-id: 4324b88cc7eee5501ab5df20ef7a95bb12ed3ea7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants