Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential duplication of shared backup files with ingestion of DB-generated SST files #12979

Open
pdillinger opened this issue Aug 28, 2024 · 0 comments
Labels
performance Issues related to performance that may or may not be bugs

Comments

@pdillinger
Copy link
Contributor

The thread of work including #12750 and #12959 is good but introduces a potential efficiency issue in backups. In the simplest case, suppose we ingest SSTs from one CF to another CF in the same DB. The SST files get new numbers and file number is a critical part of the key for de-duplicating SST files in backups. Thus, a backup of the DB post-ingestion cannot share the backed-up SST file with a backup with the same file in a different CF--because the file numbers are different. See ShareFilesNaming for more background.

At first glance, it might seem like the best solution is to add a new naming scheme based on just SST unique ids, but then there's the problem that we get the destination file name on restore from the shared file name (removing the parts added for uniqueness). If we don't have the file number, we don't know what file name to restore to. If we do have the file number in the file name, we can't maximize sharing.

I propose that we name the file in the backup shared directory based on the orig_file_number in the table properties (when != 0), and if the number in the DB doesn't match that, we add a field (perhaps just "num") to the backup manifest entry indicating what file number it should be restored to. This is a major schema change, so would be backup schema_version=3, because ignoring the new field on restore would result in a corrupt DB. Importantly, this makes for a graceful upgrade path to schema_version=3, because we only change the shared file name of an SST file if it was ingested from a DB file. There is no hiccup where we are breaking incrementality of backups.

@pdillinger pdillinger added the performance Issues related to performance that may or may not be bugs label Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to performance that may or may not be bugs
Projects
None yet
Development

No branches or pull requests

1 participant