HIVE-28655: Implement HMS Related Drop Stats Changes (Part2) COL_STATS_ACCURATE related changes#6198
Conversation
|
6bed69f to
61539a6
Compare
soumyakanti3578
left a comment
There was a problem hiding this comment.
Added a few comments here, but I also noticed that I had reviewed the earlier version of this PR: #5790
Please go through the review comments there too.
Also it would be nice to go through the sonar report and fix the ones that make sense. For example, there are many reports on code conventions that you should incorporate to improve maintainability and consistency.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (2)
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/directsql/MetastoreDirectSqlUtils.java:588
- Indentation for these method declarations is inconsistent with the surrounding code (one less leading space than other members). This can trip checkstyle and makes diffs noisier; please align indentation with the rest of the class.
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/directsql/DirectSqlUpdatePart.java:675 - The query
select DISTINCT "CD_ID" ...returns a single column per row, butsqlResultis declared asList<Object[]>. This will cause aClassCastExceptionduring iteration (the iterator will attempt to cast each element toObject[]). UseList<Object>(orList<?>) for this result and pass each element toextractSqlLong, or adjust the query/result handling so the element type matches the actual result.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| String table_column_stats_accurate = tableParams.get(StatsSetupConst.COLUMN_STATS_ACCURATE); | ||
| assertTrue("parameter COLUMN_STATS_ACCURATE is not accurate in " + tblName, table_column_stats_accurate == null || | ||
| (!table_column_stats_accurate.contains(colName[0]) && !table_column_stats_accurate.contains(colName[1]))); |
There was a problem hiding this comment.
COLUMN_STATS_ACCURATE is a JSON payload; checking it via String.contains(...) is brittle (substring collisions, formatting changes, etc.). Prefer parsing via StatsSetupConst.getColumnsHavingStats(tableParams) and asserting the returned list does/doesn’t contain the expected column names.
| String table_column_stats_accurate = tableParams.get(StatsSetupConst.COLUMN_STATS_ACCURATE); | |
| assertTrue("parameter COLUMN_STATS_ACCURATE is not accurate in " + tblName, table_column_stats_accurate == null || | |
| (!table_column_stats_accurate.contains(colName[0]) && !table_column_stats_accurate.contains(colName[1]))); | |
| List<String> colsWithStats = StatsSetupConst.getColumnsHavingStats(tableParams); | |
| assertTrue("parameter COLUMN_STATS_ACCURATE is not accurate in " + tblName, | |
| colsWithStats == null || (!colsWithStats.contains(colName[0]) && !colsWithStats.contains(colName[1]))); |
| assertTrue("parameter COLUMN_STATS_ACCURATE is not accurate in " + tblName, table_column_stats_accurate == null || | ||
| (!table_column_stats_accurate.contains(colName[0]) && !table_column_stats_accurate.contains(colName[1]))); |
There was a problem hiding this comment.
Same issue as above: validating COLUMN_STATS_ACCURATE by substring search is brittle. Consider using StatsSetupConst.getColumnsHavingStats(...) and asserting the list doesn’t contain the deleted column names.
| assertTrue("parameter COLUMN_STATS_ACCURATE is not accurate in " + tblName, table_column_stats_accurate == null || | |
| (!table_column_stats_accurate.contains(colName[0]) && !table_column_stats_accurate.contains(colName[1]))); | |
| List<String> colsWithStats = StatsSetupConst.getColumnsHavingStats(tableParams); | |
| assertTrue("parameter COLUMN_STATS_ACCURATE is not accurate in " + tblName, | |
| colsWithStats == null || | |
| (!colsWithStats.contains(colName[0]) && !colsWithStats.contains(colName[1]))); |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/directsql/DirectSqlUpdatePart.java:676
- In the DISTINCT CD_ID query,
sqlResultis declared asList<Object[]>but then iterated asfor (Object cdId : sqlResult)and passed toextractSqlLong(cdId). This will likely fail at runtime (elements areObject[], notNumber). UseList<Object>for single-column results, or iterateObject[] rowand extractrow[0]consistently.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/directsql/DirectSqlUpdatePart.java:676
- In this query result handling,
executeWithArrayforselect DISTINCT "CD_ID"returns a single-column result list (elements are typicallyNumber/Long). Declaring it asList<Object[]>will cause aClassCastExceptionwhen iterating because the enhanced-for will cast each element toObject[]. UseList<Object>(as before) or map toList<Long>and callextractSqlLongon each element.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 8 comments.
Comments suppressed due to low confidence (1)
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/directsql/DirectSqlUpdatePart.java:673
- This query selects a single column (
SELECT DISTINCT "CD_ID" ...), but the result is stored in aList<Object[]>. That’s misleading and makes it easy to accidentally treat each row as an array later. PreferList<Object>(or explicitly convert toList<Long>) for single-column selects to reflect the actual result shape.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for (int i = 0; i < numUpdates.length; i++) { | ||
| if (numUpdates[i] != 1) { | ||
| throw new MetaException("Invalid state of PARTITION_PARAMS (" | ||
| + StatsSetupConst.COLUMN_STATS_ACCURATE + ") for PART_ID " + partIds.get(i)); | ||
| } |
There was a problem hiding this comment.
verifyUpdates treats any batch update count other than 1 as failure. JDBC allows executeBatch() to return Statement.SUCCESS_NO_INFO for successful updates, which would make this throw even though the update succeeded. Consider treating SUCCESS_NO_INFO as success (while still flagging 0 updates).
| String deleteSql = "delete from \"PART_COL_STATS\" where \"PART_ID\" in ( " + getIdListForIn(partitionIds) + ")"; | ||
| List<Object> params = new ArrayList<>(colNames == null ? 1 : colNames.size() + 1); | ||
|
|
||
| if (colNames != null && !colNames.isEmpty()) { | ||
| deleteSql += " and \"COLUMN_NAME\" in (" + makeParams(colNames.size()) + ")"; |
There was a problem hiding this comment.
This class hard-codes metastore table names in SQL strings (e.g., "PART_COL_STATS", "TAB_COL_STATS", "PARTITION_PARAMS"). MetaStoreDirectSql schema-qualifies these identifiers via getFullyQualifiedName(schema, ...) based on javax.jdo.mapping.Schema. If a schema is configured, these statements will hit the wrong table names. Consider obtaining the qualified table identifiers from MetaStoreDirectSql (or passing schema in and using getFullyQualifiedName) instead of embedding raw names.
|
|
@dengzhhu653 @soumyakanti3578 |



What changes were proposed in this pull request?
We need a way to drop the stats associated with the table/partition and its columns.
This can help a lot in migration or replication where the stats data take huge time to copy.
Particularly when the table is partitioned, we have stats rows for each table, partition, column combination, which can get huge when the number of partitions is huge.
This is the HMS side changes.
This is the part2 with the parameter COL_STATS_ACCURATE related changes.
Why are the changes needed?
This can give the users potential options to clear the unnecessary stats.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manual tests and unit tests.