Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11946. Require all ozone repair commands to support a --dry-run option #7682

Merged
merged 9 commits into from
Jan 23, 2025

Conversation

adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

  1. Add --dry-run option to all ozone repair subcommands that modify state. The new option currently defaults to false, so commands perform repair by default. (Otherwise we'd need to use it as --dry-run=false.)
  2. Implement dry-run mode where needed.

https://issues.apache.org/jira/browse/HDDS-11946

How was this patch tested?

Enforce --dry-run option

Added unit test to verify that each subcommand under ozone repair satisfies one of the following:

  • has subcommands
  • has --dry-run option
  • implements ReadOnlySubcommand (currently only QuotaStatus)

Test failure before implementing ReadOnlyCommand in QuotaStatus:

[ERROR] Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.261 s <<< FAILURE! - in org.apache.hadoop.ozone.repair.TestOzoneRepair
[ERROR] org.apache.hadoop.ozone.repair.TestOzoneRepair.subcommandsSupportDryRun  Time elapsed: 0.211 s  <<< FAILURE!
java.lang.AssertionError: 
['ozone repair quota status' defined by class org.apache.hadoop.ozone.repair.quota.QuotaStatus should support --dry-run or implement interface org.apache.hadoop.ozone.repair.ReadOnlyCommand] 
Expecting UnmodifiableSet:
  ["--service-id",
    "--om-service-id",
    "--service-host",
    "-h",
    "--help",
    "-V",
    "--version"]
to contain:
  ["--dry-run"]
but could not find the following element(s):
  ["--dry-run"]

	at org.apache.hadoop.ozone.repair.TestOzoneRepair.assertSubcommandOptionRecursively(TestOzoneRepair.java:84)
	at org.apache.hadoop.ozone.repair.TestOzoneRepair.assertSubcommandOptionRecursively(TestOzoneRepair.java:89)
	at org.apache.hadoop.ozone.repair.TestOzoneRepair.assertSubcommandOptionRecursively(TestOzoneRepair.java:89)
	at org.apache.hadoop.ozone.repair.TestOzoneRepair.subcommandsSupportDryRun(TestOzoneRepair.java:71)

Tool operation

Sample output of ozone repair om fso-tree from integration tests:

[dry run] No running OM service detected. Proceeding with repair.
[dry run] Creating database of reachable directories at hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-430e6515-4d93-4d63-99d9-2d0e1801aa90/ozone-meta/reachable.db
[dry run] Processing volume: /vol2
[dry run] Processing bucket: vol2/bucket1
[dry run] Deleting unreferenced directory /-9223372036854768896/-9223372036854768384/-9223372036854767871/dir2
[dry run] Deleting unreferenced file /-9223372036854768896/-9223372036854768384/-9223372036854767359/file3
[dry run] Deleting unreferenced file /-9223372036854768896/-9223372036854768384/-9223372036854767871/file1
[dry run] Deleting unreferenced file /-9223372036854768896/-9223372036854768384/-9223372036854767871/file2
[dry run] Processing bucket: vol2/bucket2
2025-01-10 16:46:50,087 [main] INFO  rocksdiff.RocksDBCheckpointDiffer (RocksDBCheckpointDiffer.java:close(307)) - Shutting down CompactionDagPruningService.
[dry run] 
Reachable:
	Directories: 4
	Files: 5
	Bytes: 50
Unreachable:
	Directories: 0
	Files: 0
	Bytes: 0
Unreferenced:
	Directories: 1
	Files: 3
	Bytes: 30

Manual test of ozone repair om update-transaction in compose environment:

$ ozone repair om update-transaction --db=/data/metadata/om.db --term 2 --index 3 --dry-run
[dry run] No running OM service detected. Proceeding with repair.
[dry run] The original highest transaction Info was (t:1, i:14)
[dry run] Updating transaction info to (t:2, i:3)

$ ozone repair om update-transaction --db=/data/metadata/om.db --term 2 --index 3 
ATTENTION: Running as user hadoop. Make sure this is the same user used to run the Ozone process. Are you sure you want to continue (y/N)? y
Run as user: hadoop
No running OM service detected. Proceeding with repair.
The original highest transaction Info was (t:1, i:14)
Updating transaction info to (t:2, i:3)
The highest transaction info has been updated to: (t:2, i:3)

$ ozone repair om update-transaction --db=/data/metadata/om.db --term 2 --index 3 --dry-run
[dry run] No running OM service detected. Proceeding with repair.
[dry run] The original highest transaction Info was (t:2, i:3)
[dry run] Updating transaction info to (t:2, i:3)

CI:
https://github.com/adoroszlai/ozone/actions/runs/12714010228

@adoroszlai adoroszlai added the tools Tools that helps with debugging label Jan 10, 2025
@adoroszlai adoroszlai self-assigned this Jan 10, 2025
@adoroszlai adoroszlai requested a review from errose28 January 10, 2025 18:39
@adoroszlai
Copy link
Contributor Author

@sarvekshayr @Tejaskriya please review

@sarvekshayr
Copy link
Contributor

In my opinion, it would be better to make the --dry-run option the default behaviour. This allows users to see what changes would be performed before executing the actual repair, reducing the risk of accidental modifications to the database. While users can explicitly specify the --dry-run option to preview changes, making it the default ensures it won't be overlooked.

Since the primary focus of this PR is to add the --dry-run option for all repair commands (and any commands to be added in the future), we can consider making --dry-run the default behaviour in a follow-up JIRA.

@adoroszlai
Copy link
Contributor Author

we can consider making --dry-run the default behaviour in a follow-up JIRA.

I think we can decide and implement the default behavior here, that's why I initially created the PR as draft.

I have no objection against making dry run the default, but I find --dry-run=false cumbersome. IMO we need a different name for the option in this case. --repair is already used in fso-tree, so that would be my primary choice.

@sarvekshayr
Copy link
Contributor

I have no objection against making dry run the default, but I find --dry-run=false cumbersome. IMO we need a different name for the option in this case. --repair is already used in fso-tree, so that would be my primary choice.

We can go ahead and use the --repair flag.

@adoroszlai adoroszlai marked this pull request as ready for review January 13, 2025 18:15
@adoroszlai adoroszlai changed the title HDDS-11946. Require all ozone repair commands to support a --dry-run option HDDS-11946. Let ozone repair run dry by default, require --repair option Jan 13, 2025
@adoroszlai
Copy link
Contributor Author

@errose28 could you please review?

@errose28
Copy link
Contributor

errose28 commented Jan 14, 2025

I actually prefer --dry-run as an option and the default to be doing the repair. The user has already entered ozone repair and acknowledged the warning prompt, so after all this not doing the repair seems counter-intuitive. Not to mention the repetitiveness of ozone repair ... --repair just to make the command do what was intended. FSO repair was originally made before there as a dedicated ozone repair command with its own safeguards in place hence dry run was the default in that implementation, but we have a better framework in place now.

@adoroszlai adoroszlai changed the title HDDS-11946. Let ozone repair run dry by default, require --repair option HDDS-11946. Require all ozone repair commands to support a --dry-run option Jan 14, 2025
Copy link
Contributor

@sarvekshayr sarvekshayr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @adoroszlai, LGTM

@adoroszlai
Copy link
Contributor Author

I actually prefer --dry-run as an option and the default to be doing the repair.

Thanks @errose28, patch is updated (reverted to previous commit).

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @adoroszlai, looks very nicely implemented. Can we add tests for repair commands that they in fact do not make changes accidentally when --dry-run is given? We might have to do this on a per-command basis, but I'm imagining a test like this:

  • Write permissions are removed on the directory with the content to be repaired
  • Command is run with --dry-run and it should still succeed without raising errors about being unable to modify content

Alternatively something like diff -rq could be used on the before and after content to check for differences.

@adoroszlai
Copy link
Contributor Author

Can we add tests for repair commands that they in fact do not make changes accidentally

  • Tests for FSORepairTool already cover both modes.
  • I can update tests for QuotaTrigger and TransactionInfoRepair. The latter is also being changed in HDDS-12050. Implement TransactionInfoRepair command for SCM #7689, and I want to avoid conflicts.
  • RecoverSCMCertificate and SnapshotChainRepair have no tests at all. Created HDDS-12121 and HDDS-12122 for addressing that. I think it's out of scope in this PR to add tests from scratch, since we need to repro the broken state that's being repaired.

We can put this PR on hold until #7689 is merged, and HDDS-12121 and HDDS-12122 are implemented.

@errose28
Copy link
Contributor

It looks like we will need to add tests individually, so I'm ok with just adding tests for QuotaTrigger dry run in this PR and merging it. We can use the other Jiras for adding dry run tests to each respective command since it looks like this PR is on track to get merged first.

Copy link
Contributor

@nandakumar131 nandakumar131 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, The change looks good.
It would be nice if we can have an abstract prepare() and execute() method in RepairTool class, we can only call prepare for dry-run and skip the execute part.
It will avoid if checks in the actual Repair Command implementations.

This will be hard to do for commands like FSORepairTool. We can merge this PR now and revisit later to see if we can simplify.

@adoroszlai adoroszlai merged commit 518338f into apache:master Jan 23, 2025
42 checks passed
@adoroszlai adoroszlai deleted the HDDS-11946 branch January 23, 2025 09:34
@adoroszlai
Copy link
Contributor Author

Thanks @errose28, @nandakumar131, @sarvekshayr, @Tejaskriya for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tools Tools that helps with debugging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants