-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29133: Support Z-order indexing for Iceberg tables via CREATE TABLE DDL #6138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…BLE DDL - Supported WRITE [LOCALLY] ORDERED BY ZORDER (col1, col2, ...) syntax in CREATE TABLE to enable Z-order. - Modified LOCALLY keyword to be optional in WRITE ORDERED BY SYNTAX. - Persisted Z-order information in HMS for iceberg tables using sort.order and sort.columns. - Implemented GenericUDFIcebergZorder UDF (iceberg_zorder) to compute Z-order values and sort data in ascending order. - Ensured INSERT operations respect Z-order sorting.
|
@deniskuzZ @zhangbutao |
private boolean isZOrderJSON(String jsonString) { | ||
try { | ||
JsonNode node = JSON_OBJECT_MAPPER.readTree(jsonString); | ||
return node.has("zorderFields"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not move this to constans?
.map(ZOrderFieldDesc::getColumnName) | ||
.collect(Collectors.toList()); | ||
|
||
LOG.info("Setting Z-order sort order for columns: {}", columnNames); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe LOG.debug(Applying Z-ordering to columns: {}
)
|
||
LOG.info("Setting Z-order sort order for columns: {}", columnNames); | ||
|
||
properties.put(SORT_ORDER, "ZORDER"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please introduce the enum:
enum SortType {
LEXICAL = 0,
ZORDER = 1
}
properties.put(SORT_ORDER, "ZORDER"); | ||
properties.put(SORT_COLUMNS, String.join(",", columnNames)); | ||
|
||
LOG.info("Z-order sort order configured for Iceberg table with columns: {}", columnNames); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is redundant
|
||
public static final String CATALOG_CONFIG_PREFIX = "iceberg.catalog."; | ||
|
||
public static final String SORT_ORDER = "sort.order"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sortOrder is ASC or DESC, should we rename to SORT_TYPE?
|
||
// Even if table has no explicit sort order, honor z-order if configured | ||
Map<String, String> props = table.properties(); | ||
if ("ZORDER".equalsIgnoreCase(props.getOrDefault(SORT_ORDER, ""))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use enum
* - Configures a single ASC sort key with NULLS FIRST and injects a custom key expression for | ||
* Z-order | ||
*/ | ||
private void createZOrderCustomSort(Map<String, String> props, DynamicPartitionCtx dpCtx, Table table, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe addZOrderExpr ?
org.apache.hadoop.hive.ql.metadata.Table hmsTable, Operation writeOperation) { | ||
String colsProp = props.get(SORT_COLUMNS); | ||
if (StringUtils.isNotBlank(colsProp)) { | ||
List<String> zCols = Arrays.stream(colsProp.split(",")).map(String::trim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List zCols = Arrays.stream(colsProp.split(","))
.map(String::trim)
.filter(Predicate.not(String::isEmpty))
.toList();
List<String> zCols = Arrays.stream(colsProp.split(",")).map(String::trim) | ||
.filter(s -> !s.isEmpty()).collect(Collectors.toList()); | ||
|
||
Map<String, Integer> fieldOrderMap = Maps.newHashMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Map<String, Integer> fieldOrderMap = new HashMap<>(fields.size());
Integer base = fieldOrderMap.get(col); | ||
Preconditions.checkArgument(base != null, "Z-order column not found in schema: %s", col); | ||
return base + offset; | ||
}).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.toList()
return base + offset; | ||
}).collect(Collectors.toList()); | ||
|
||
dpCtx.setCustomSortOrder(Lists.newArrayList(Collections.singletonList(1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we even need to set CustomSortOrder and CustomSortNullOrder for z-order?
dpCtx.setCustomSortNullOrder(Lists.newArrayList(Collections.singletonList(NullOrdering.NULLS_FIRST.getCode()))); | ||
|
||
dpCtx.addCustomSortExpressions(Collections.singletonList(allCols -> { | ||
List<ExprNodeDesc> args = Lists.newArrayListWithExpectedSize(zIndices.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List args = zIndices.stream()
.map(allCols::get)
.toList();
} | ||
try { | ||
GenericUDF udf = new GenericUDFIcebergZorder(); | ||
return ExprNodeGenericFuncDesc.newInstance(udf, "iceberg_zorder", args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name is redundant?
return ExprNodeGenericFuncDesc.newInstance(new GenericUDFIcebergZorder(), args)
set hive.fetch.task.conversion=more; | ||
select * from ice_orc_sorted; | ||
|
||
-- Validates syntax without LOCALLY keyword |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no sure this should be done on qtest level. maybe keep the locally syntax here and drop in zorder q file
(Class<? extends GenericUDF>) Class.forName("org.apache.iceberg.mr.hive.udf.GenericUDFIcebergDay")); | ||
system.registerGenericUDF("iceberg_hour", | ||
(Class<? extends GenericUDF>) Class.forName("org.apache.iceberg.mr.hive.udf.GenericUDFIcebergHour")); | ||
system.registerGenericUDF("iceberg_zorder", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to register it when you create an explicit expression?
return ExprNodeGenericFuncDesc.newInstance(new GenericUDFIcebergZorder(), args)
What changes were proposed in this pull request?
Added Z-order support for Iceberg tables via CREATE TABLE DDL
Why are the changes needed?
To support zorder indexing which will improve data clustering and query performance on Iceberg tables.
Does this PR introduce any user-facing change?
Yes , new syntax support
CREATE TABLE test_zorder (
id int,
text string)
WRITE [LOCALLY] ORDERED BY ZORDER (id, text)
STORED BY iceberg
STORED As orc;
How was this patch tested?
qtest