-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture number of snapshots created per day as a metric #149
base: main
Are you sure you want to change the base?
Conversation
c3c15c5
to
ffd3afc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of nits and a question around storing large maps in the table.
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/TableStatsCollectorUtil.java
Outdated
Show resolved
Hide resolved
/** Get snapshot distribution for a given table by date. */ | ||
private static Map<String, Long> getSnapShotDistributionPerDay( | ||
Table table, SparkSession spark, MetadataTableType metadataTableType) { | ||
Dataset<Row> snapShotDistribution = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset<Row> snapShotDistribution = | |
Dataset<Row> snapshotDistribution = |
@@ -35,4 +36,6 @@ public class IcebergTableStats extends BaseTableMetadata { | |||
private Long numReferencedManifestFiles; | |||
|
|||
private Long numReferencedManifestLists; | |||
|
|||
private Map<String, Long> snapshotCountByDay; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be a large map? If key is all days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if SE is functioning fine this should only have a bounded number of days?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is right. It should ideally have only 3 days worth of data. But we can consider collecting only past 2 days since we should already have the old data from previous runs
private static Map<String, Long> getSnapShotDistributionPerDay( | ||
Table table, SparkSession spark, MetadataTableType metadataTableType) { | ||
Dataset<Row> snapShotDistribution = | ||
SparkTableUtil.loadMetadataTable(spark, table, metadataTableType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this get all snapshots committed from beginning of table?
If yes, should it have a filter criteria as well to get snapshots count only in last X
days?
…eStatsCollectorUtil.java Co-authored-by: Sumedh Sakdeo <[email protected]>
@@ -35,4 +36,6 @@ public class IcebergTableStats extends BaseTableMetadata { | |||
private Long numReferencedManifestFiles; | |||
|
|||
private Long numReferencedManifestLists; | |||
|
|||
private Map<String, Long> snapshotCountByDay; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if SE is functioning fine this should only have a bounded number of days?
Collectors.toMap( | ||
row -> { | ||
SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd"); | ||
return formatter.format(new Date(row.getTimestamp(1).getTime())); | ||
}, | ||
row -> 1L, | ||
Long::sum, | ||
LinkedHashMap::new)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be better to implement this before collectAsList
?
Summary
Adding a metric to capture number of snapshots being created every day. This will help in understanding if there is any anomalous behavior wrt any jobs that execute on OpenHouse tables. This could also help us get a count of number of unexpired snapshots from the past.
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
Added UT to check the same.
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
A minor non-breaking change.
For all the boxes checked, include additional details of the changes made in this pull request.