Draft: Hive iceberg materialized view#6393
Conversation
61ae0b8 to
27a0ddf
Compare
d6dd937 to
d3644f2
Compare
2b9fe54 to
9e34f3f
Compare
|
|
| return; | ||
| } | ||
|
|
||
| TableType tableType = Enum.valueOf(TableType.class, tableTypeValue); |
There was a problem hiding this comment.
Should the exception handled when tableTypeValue contains an unknown enum constant ?
There was a problem hiding this comment.
It seems it is a leftover method. I removed it.
| if (!currentMetadata.uuid().equals(newMetadata.uuid())) { | ||
| throw new UnsupportedOperationException( | ||
| String.format("Cannot change iceberg table %s metadata location pointing to another table's metadata %s", | ||
| icebergTable.name(), newMetadataLocation) | ||
| ); | ||
| } | ||
|
|
||
| String currentMetadataLocation = currentMetadata.metadataFileLocation(); | ||
| if (!currentMetadataLocation.equals(newMetadataLocation) && | ||
| !context.getProperties().containsKey(MANUAL_ICEBERG_METADATA_LOCATION_CHANGE)) { | ||
| // If we got here then this is an alter table operation where the table to be changed had an Iceberg commit | ||
| // meanwhile. The base metadata locations differ, while we know that this wasn't an intentional, manual | ||
| // metadata_location set by a user. To protect the interim commit we need to refresh the metadata file location. | ||
| tblParams.put(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, currentMetadataLocation); | ||
| LOG.warn("Detected an alter table operation attempting to do alterations on an Iceberg table with a stale " + | ||
| "metadata_location. Considered base metadata_location: {}, Actual metadata_location: {}. Will override " + | ||
| "this request with the refreshed metadata_location in order to preserve the concurrent commit.", | ||
| newMetadataLocation, currentMetadataLocation); | ||
| } | ||
| } |
There was a problem hiding this comment.
Can these lines be extracted? They appear in assertNotCrossTableMetadataLocationChangeForTable too.
There was a problem hiding this comment.
Sure. Extracted.
| import org.jetbrains.annotations.NotNull; | ||
| import org.jetbrains.annotations.Nullable; |
| HiveConf.getVar(configuration, HiveConf.ConfVars.HIVE_ICEBERG_MATERIALIZEDVIEW_METADATA_LOCATION) | ||
| ) && | ||
| ( | ||
| TableType.MATERIALIZED_VIEW.name().equalsIgnoreCase(hmsTable.getTableType()) || |
There was a problem hiding this comment.
Can we have managed materialized views with iceberg metadata location?
There was a problem hiding this comment.
I don't think so. Let's see the tests after I remove it.
| public static final String MATERIALIZED_VIEW_BASE_SNAPSHOT_PROPERTY_KEY_PREFIX = | ||
| "iceberg.base.snapshot."; | ||
| public static final String MATERIALIZED_VIEW_VERSION_PROPERTY_KEY = | ||
| "iceberg.materialized.view.version"; |
There was a problem hiding this comment.
These are not used. Should we remove them?
There was a problem hiding this comment.
Sure. Thank you for spotting it.
| "If this is set to true the URI for auth will have the default location masked with DEFAULT_TABLE_LOCATION"), | ||
| HIVE_ICEBERG_ALLOW_DATAFILES_IN_TABLE_LOCATION_ONLY("hive.iceberg.allow.datafiles.in.table.location.only", false, | ||
| "If this is set to true, then all the data files being read should be withing the table location"), | ||
| HIVE_ICEBERG_MATERIALIZEDVIEW_METADATA_LOCATION("hive.iceberg.materializedview.metadata.location", "metastore", |
There was a problem hiding this comment.
I'm thinking about dropping this config since we introduce external materialized views. We can extend the grammar to allow the external keyword in create materialized view statements.
create external materialized view mat_native stored by iceberg stored as orc tblproperties ('format-version'='1') as
select b, c from tbl_ice_native where c > 52;
For external MVs, store the MV metadata via the storage handler API, and keep the existing HMS storage logic for non-external MVs.
There was a problem hiding this comment.
Honestly, I'm not sure what is the proper approach: As HIVE-29578: Iceberg: add support for logical views is merged, it changes the whole picture:
This implementation uses a HiveConf setting.
That implementation uses a table property, 'view-format'='iceberg'.
My problem is if we introduce the external keyword here and don't change the existing view-format solution, we provide an inconsistent interface for the users.
But I also don't think if the view-format table property is a proper approach for materialized views: it works great until we stay within the Hive engine. But what happens if we not?
How we will handle the case when the user creates a materialized view (or a simple logical view) in Spark and we want to use it from Hive?
Should create external create the view if it doesn't exist and just fill the Hive side of metadata if the view physically exists but Hive doesn't know about it yet? Like how we do it with external tables?
Tagging @difin (as the author for Iceberg views) and @zabetak (as you are on PTO now) here to get their opinion to get a consistent Hive behavior.
| private static @NotNull TableType getViewType(String storageHandler) { | ||
| if ("org.apache.iceberg.mr.hive.HiveIcebergStorageHandler".equals(storageHandler)) { | ||
| return TableType.EXTERNAL_MATERIALIZED_VIEW; | ||
| } | ||
|
|
||
| return TableType.MATERIALIZED_VIEW; | ||
| } |
There was a problem hiding this comment.
Hardcoding HiveIcebergStorageHandler is not a good approach. I think the presence of a storage handler should indicate that an external MV needs to be created. Alternatively, extending the grammar to allow the EXTERNAL keyword in CREATE MATERIALIZED VIEW statements would also be a good option.
| public static final String ICEBERG_STORAGE_HANDLER = "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler"; | ||
| public static final String STORAGE_HANDLER_KEY = "storage_handler"; |
There was a problem hiding this comment.
These are not referenced, please remove them.
| import com.google.common.base.Joiner; | ||
| import com.google.common.collect.Lists; | ||
| import org.apache.commons.lang3.StringUtils; | ||
| import org.apache.hadoop.conf.Configuration; | ||
| import org.apache.hadoop.fs.Path; | ||
| import org.apache.hadoop.hive.common.StatsSetupConst; | ||
| import org.apache.hadoop.hive.common.TableName; | ||
| import org.apache.hadoop.hive.common.repl.ReplConst; | ||
| import org.apache.hadoop.hive.metastore.ColumnType; | ||
| import org.apache.hadoop.hive.metastore.HiveMetaHook; | ||
| import org.apache.hadoop.hive.metastore.TableType; | ||
| import org.apache.hadoop.hive.metastore.Warehouse; | ||
| import org.apache.hadoop.hive.metastore.api.Database; | ||
| import org.apache.hadoop.hive.metastore.api.DatabaseType; | ||
| import org.apache.hadoop.hive.metastore.api.FieldSchema; | ||
| import org.apache.hadoop.hive.metastore.api.GetPartitionsByNamesRequest; | ||
| import org.apache.hadoop.hive.metastore.api.MetaException; | ||
| import org.apache.hadoop.hive.metastore.api.Partition; | ||
| import org.apache.hadoop.hive.metastore.api.PartitionSpec; | ||
| import org.apache.hadoop.hive.metastore.api.PartitionsSpecByExprResult; | ||
| import org.apache.hadoop.hive.metastore.api.StorageDescriptor; | ||
| import org.apache.hadoop.hive.metastore.api.Table; | ||
| import org.apache.hadoop.hive.metastore.api.WMPoolSchedulingPolicy; | ||
| import org.apache.hadoop.hive.metastore.api.hive_metastoreConstants; | ||
| import org.apache.hadoop.hive.metastore.conf.MetastoreConf; | ||
| import org.slf4j.Logger; | ||
| import org.slf4j.LoggerFactory; | ||
|
|
||
| import javax.annotation.Nullable; |
There was a problem hiding this comment.
Is this change intentional?
| StringUtils.isNotBlank(storageHandler) && | ||
| TableType.EXTERNAL_MATERIALIZED_VIEW.equals(table.getTableType()) && | ||
| "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler".equals(storageHandler) | ||
| ) { |
There was a problem hiding this comment.
Can this be simplified to
if (TableType.EXTERNAL_MATERIALIZED_VIEW.toString().equals(table.getTableType())
?
In the current form it is always false: comparing an enum with a string
TableType.EXTERNAL_MATERIALIZED_VIEW.equals(table.getTableType())
|



What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?