Skip to content

bumping max se versions and age #321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from
Closed

bumping max se versions and age #321

wants to merge 0 commits into from

Conversation

cbb330
Copy link
Collaborator

@cbb330 cbb330 commented May 14, 2025

Summary

Problem:
The current CDC (Change Data Capture) benchmark process requires streaming data into OpenHouse for periods up to 24 hours. After the stream is manually stopped (sometime after the 24-hour mark), operators need to perform two key actions:

  1. Roll back the table to specific time-window checkpoints (e.g., 24hr, 12hr, 6hr).
  2. Perform further rollbacks to these checkpoints after executing additional operations, such as compaction, on the windowed dataset.
    The primary challenge is that routine snapshot expiration can delete the historical data versions required for these rollbacks. If snapshots expire too soon, rolling back becomes impossible. This necessitates a complete re-ingestion of the data, which can delay testing by up to 24 hours.

Solution:
To ensure that necessary snapshots are retained for the duration of the CDC benchmark and subsequent testing, we are increasing a configuration related to snapshot retention (referred to as the "ceiling") to 900. This new value is derived as follows:

  • The CDC benchmark streams commits at a frequency of one commit every 5 minutes.
  • We need to ensure data is available for up to 3 days to allow for the manual stopping of the stream and the completion of all rollback and testing procedures.

Calculation:
(24 hours/day * 60 minutes/hour / 5 minutes/commit) * 3 days = 288 commits/day * 3 days = 864 commits
Setting the ceiling to 900 provides a sufficient buffer above the calculated 864 commits. This prevents premature snapshot expiration, allowing operators to reliably perform rollbacks as required by the benchmark, thus avoiding lengthy data re-ingestion delays.

In addition, the max_age will take precedence over a max version, so 900 versions will still be limited to 3 days. so we must also bump the max_age to 15 days.

as an aside, making these parameters static constants for clarity.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Relying on existing unittests

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

teamurko
teamurko previously approved these changes May 14, 2025
Will-Lo
Will-Lo previously approved these changes May 14, 2025
@cbb330 cbb330 dismissed stale reviews from Will-Lo and teamurko via bd42a6f May 14, 2025 16:56
@cbb330 cbb330 changed the title bumping max se versions bumping max se versions and age May 14, 2025
private static final int MIN_VERSIONS = 2;
private static final int MAX_VERSIONS = 900;
private static final int MIN_DAYS_RETENTION = 1;
private static final int MAX_DAYS_RETENTION = 15;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 15 chosen?

SE is just a step of removing some data from the data-lake. But from compliance perspective, the e2e turn-around time matters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these values be config driven?

private static final int MIN_VERSIONS = 2;
private static final int MAX_VERSIONS = 900;
private static final int MIN_DAYS_RETENTION = 1;
private static final int MAX_DAYS_RETENTION = 15;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 15 chosen?

SE is just a step of removing some data from the data-lake. But from compliance perspective, the e2e turn-around time matters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants