feat(iceberg): Add support for writing iceberg tables #10996

imjalpreet · 2024-09-13T01:23:26Z

This PR implements support for Iceberg table insertions in Velox, enabling write operations to Iceberg tables through the Hive connector.

Changes

Implemented IcebergDataSink class that extends HiveDataSink to handle Iceberg-specific write operations.
Added IcebergInsertTableHandle to manage Iceberg table insertion metadata.
Added IcebergPartitionField, IcebergPartitionSpec to support Iceberg's partition transform specification, this PR only supports identity transform for now.
Implemented JSON serialization for partition values to support Iceberg metadata requirements.

Design Doc

Implementing_Iceberg_Insertion_Design.md

Implementation Details

The implementation follows Iceberg's table format specification, particularly for handling partitioning and metadata. Key components include:

IcebergPartitionSpec: Manages partition specifications with support for different transform types.
IcebergDataSink: Extends HiveDataSink to handle Iceberg-specific write operations.
Add extra parameters to HiveDataSink for dataChannels.
Adjust HiveDataSink data members and methods visibility for IcebergDataSink to use.
The PR also includes test infrastructure for validating Iceberg insertions with various partition strategies.

Testing

Added unit tests that verify:

Basic table writes to non-partitioned tables
Partitioned table writes with single column partition and multiple column partition

Limitation

This PR only support iceberg identity partition transform. And it only support primitive column type as partition column, nested column type such as struct is not supported.

All tests pass on the current codebase.

netlify · 2024-09-13T01:23:44Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`0833b10`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/68a5a5c5c67e29000883396b

velox/connectors/hive/iceberg/IcebergDataSink.h

velox/connectors/hive/iceberg/IcebergDataSink.cpp

velox/connectors/hive/HiveDataSink.cpp

velox/connectors/hive/iceberg/IcebergDataSink.cpp

velox/connectors/hive/HiveDataSink.cpp

velox/connectors/hive/iceberg/IcebergDataSink.cpp

velox/connectors/hive/iceberg/IcebergDataSink.h

velox/connectors/hive/iceberg/IcebergDataSink.cpp

zhouyuan · 2024-12-13T02:14:53Z

FWIW, there's a new project to add C++ support for iceberg:
apache/iceberg-cpp#2

prestodb-ci · 2025-01-28T17:09:54Z

@majetideepak imported this issue into IBM GitHub Enterprise

yingsu00

@imjalpreet Did you remove the tests?

velox/connectors/hive/iceberg/IcebergDataSink.cpp

yingsu00

@imjalpreet Can you please make the test file format agnostic? You can use ReaderFactory::createReader() to create the readers instead of directly calling into the Parquet reader constructor.

velox/connectors/hive/iceberg/tests/IcebergDataSinkTest.cpp

velox/connectors/hive/iceberg/IcebergDataSink.cpp

velox/connectors/hive/HiveConnector.cpp

velox/connectors/hive/iceberg/IcebergDataSink.cpp

velox/connectors/hive/iceberg/IcebergDataSink.h

PingLiuPing

When I create iceberg table based on your code, I hit following error, time data type is not supported?

presto:iceberg> insert into partition_t2 values (TIMESTAMP '2025-02-28 14:00:02', 11, DATE '2024-02-28', TIME '14:00:33', 128);
Query 20250303_133535_00004_gh5dp failed: inferredType Failed to parse type [time]. Type not registered.

My create table DDL is:

create table partition_t2 (c_timestamp timestamp, c_int int, c_date date, c_time time, c_bigint bigint) with (format='PARQUET', partitioning=ARRAY['year(c_date)']);

velox/connectors/hive/HiveConnector.cpp

imjalpreet · 2025-03-06T20:32:15Z

When I create iceberg table based on your code, I hit following error, time data type is not supported?

@PingLiuPing Velox does not yet support the Time Datatype, so we must first add support for it.

Types supported in Velox: https://facebookincubator.github.io/velox/develop/types.html

PingLiuPing · 2025-07-17T07:26:57Z

Looks like velox does not support ORC write, so we only support parquet format, please describe the limitation in PR description

Yes, velox ORC does not support insertion now.
I will add the limitation.

PingLiuPing · 2025-07-17T08:26:25Z

Yes, both data/ and data are valid input, so the downstream logic should treat them correctly. @PingLiuPing

Ok, I will handle this in velox.

PingLiuPing · 2025-07-17T09:08:59Z

@jinchengchenghh I fixed // issue. This is the only changes in this commit.

jinchengchenghh · 2025-07-18T05:52:09Z

CREATE TABLE %s (id INT, dep STRING) USING iceberg PARTITIONED BY (dep)
sql(
        "INSERT INTO TABLE %s VALUES (0, null), (1, 'hr'), (2, 'hardware'), (4, 'hr')",
        tableName);
assertEquals(
        "Should have expected rows",
        ImmutableList.of(row(0, null), row(1, "hr"), row(2, "hardware"), row(4, "hr")),
        sql("SELECT * FROM %s ORDER BY id ASC NULLS LAST", selectTarget()));

The partition column is null, but the returned "{"partitionValues":["null"]}" is a string null, which will cause result mismatch.

{
  "partitionDataJson" : "{\"partitionValues\":[\"null\"]}",
  "content" : "DATA",
  "fileFormat" : "PARQUET",
  "partitionSpecJson" : 0,
  "metrics" : {
    "recordCount" : 1
  },
  "fileSizeInBytes" : 523,
  "path" : "file:/var/folders/63/845y6pk53dx_83hpw8ztdchw0000gn/T/hive14011000783980279215/table/data/dep=null/3ed08e85-ca93-45ba-b61a-ad03a696fc90.parquet"
}

If I replace the null string by original null, it can pass the test. But the returned json should be correct, this is just for debug. @PingLiuPing

      case STRING:
        if (partitionValue.asText().equalsIgnoreCase("null")) {
          return null;
        }
        return partitionValue.asText();

jinchengchenghh · 2025-07-18T05:57:43Z

And add the test to cover null string or null as partition value

PingLiuPing · 2025-07-18T10:29:25Z

And add the test to cover null string or null as partition value

There is testcase cover null string and partition null column.
But it doesn't test the commit metadata.
Will add a case to cover that.

jinchengchenghh · 2025-07-22T06:48:09Z

Please fix apache/incubator-gluten#9481

test read iceberg with special characters in column name *** FAILED ***
null did not equal "test_data" (IcebergSuite.scala:664)

 test("test read iceberg with special characters in column name") {
    val testTable = "test_table_with_special_characters"
    withTable(testTable) {
      spark.sql(s"""
                   |CREATE TABLE $testTable (id INT, `my/data` STRING)
                   |USING iceberg
                   |""".stripMargin)
      spark.sql(s"""
                   |INSERT INTO $testTable VALUES
                   |(1, 'test_data');
                   |""".stripMargin)
      val resultDf = spark.sql(s"SELECT id, `my/data` FROM $testTable")
      val result = resultDf.collect()

      assert(result.length == 1)
      assert(result.head.getString(1) == "test_data")
    }
  }

PingLiuPing · 2025-07-22T11:00:54Z

@jinchengchenghh Do you mean sanitize column name in velox? It looks like we should do this in upper layer.

PingLiuPing · 2025-07-22T11:25:59Z

I have verified with both Presto and Prestissimo that data written by Velox can be successfully queried by Prestissimo and Presto, and vice versa.

jinchengchenghh · 2025-07-22T23:50:20Z

It should be done in downstream, if you have the restriction, you need to verify it and describe what input is valid input in API comments.

Base on this PR facebookincubator/velox#10996, which is merged to ibm/velox, and lacks for the metadata, so the read performance is not performed as expected. Use the flag --enable_enhanced_features to enable this feature, default disable. Use org.apache.gluten.tags.EnhancedFeaturesTest test Tag on the specified enhanced features tests to exclude, exclude the tests default by profile exclude-tests, we cannot use the jni call to decide if run the tests because the library is not loaded when listing the tests. Only supports Spark34, spark35 iceberg version 1.5.0 is not supported. Supports parquet format because avro and orc write is not supported in Velox Fallback the complex data type write because the metric does not support.

PingLiuPing · 2025-07-23T07:17:12Z

It should be done in downstream, if you have the restriction, you need to verify it and describe what input is valid input in API comments.

Let's clarify is it insertion failed or reading failed? If it is insertion failed, can you please check if the data and column name been passed into velox correctly, e.g, in IcebergDataSink::appendData . If it is reading failed, are you able to read the same data using Spark?

jinchengchenghh · 2025-07-23T08:45:41Z

The read is failed, it is a unit test in Gluten, before native write, it can pass

PingLiuPing · 2025-07-23T09:27:19Z

The read is failed, it is a unit test in Gluten, before native write, it can pass

Can you query this data that is written by Gluten from Spark?
I cannot reproduce this error from Pretissimo.

jinchengchenghh · 2025-07-23T09:30:06Z

Yes, Gluten without native write can query.
I have fallback this case, let us focus on statistic collect first.

PingLiuPing · 2025-07-23T10:46:38Z

Yes, Gluten without native write can query. I have fallback this case, let us focus on statistic collect first.

Ok, what I mean is using C++ insert data, and this will generate a data file. Then are you able to query this data file from Java (Spark).
Just want to figure out which component causes this error.

PingLiuPing · 2025-07-29T20:42:43Z

@Yuhta can you help review again? Thank you very much in advance.

velox/connectors/hive/iceberg/IcebergDataSink.cpp

velox/connectors/hive/PartitionIdGenerator.cpp

velox/connectors/hive/iceberg/IcebergDataSink.h

velox/connectors/hive/iceberg/tests/IcebergInsertTest.cpp

jinchengchenghh · 2025-07-30T09:31:16Z

velox/connectors/hive/iceberg/tests/IcebergTestBase.h

+
+  void TearDown() override;
+
+  std::vector<RowVectorPtr> createTestData(


Do some refactor for createTestData and listFiles, move to super class.

velox/connectors/hive/iceberg/tests/IcebergTestBase.h

PingLiuPing · 2025-07-31T21:31:23Z

The code change in Prestissimo will not be able get merged until velox PR is merged. To prevent the build break in other CI pipelines revert the cmake target name change in hive/iceberg/CMakeLists.txt first.

jinchengchenghh · 2025-08-11T16:02:59Z

Hi, @Yuhta @mbasmanova Could you help review this PR? This PR has been integrated with Gluten and Presto, and pass several unit tests back port from Apache Iceberg, apache/incubator-gluten#9397, I fallback the partition table because this PR only supports identity partitioning and lacks for metadata which causes some tests failed. The following prs will support all partition transforms, functions and metadata, after that, I will enable all the test.

Willing to see your reply, much thanks!

Co-authored-by [email protected]

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 13, 2024

imjalpreet force-pushed the icebergWriter branch from dcb40ae to 02e6d32 Compare September 13, 2024 02:04

Yuhta reviewed Sep 16, 2024

View reviewed changes

velox/connectors/hive/iceberg/IcebergDataSink.h Outdated Show resolved Hide resolved

Yuhta requested a review from xiaoxmeng September 16, 2024 18:07

yingsu00 reviewed Sep 27, 2024

View reviewed changes

velox/connectors/hive/iceberg/IcebergDataSink.cpp Outdated Show resolved Hide resolved

velox/connectors/hive/iceberg/IcebergDataSink.cpp Show resolved Hide resolved

imjalpreet force-pushed the icebergWriter branch from 02e6d32 to 9dde9fb Compare September 30, 2024 10:28

imjalpreet force-pushed the icebergWriter branch from 9dde9fb to ff22387 Compare October 24, 2024 19:04

yingsu00 reviewed Nov 7, 2024

View reviewed changes

imjalpreet force-pushed the icebergWriter branch 2 times, most recently from f3565a6 to e3aa199 Compare November 14, 2024 23:18

yingsu00 reviewed Nov 15, 2024

View reviewed changes

velox/connectors/hive/iceberg/IcebergDataSink.h Outdated Show resolved Hide resolved

velox/connectors/hive/iceberg/IcebergDataSink.cpp Outdated Show resolved Hide resolved

imjalpreet force-pushed the icebergWriter branch from e3aa199 to 967c556 Compare November 22, 2024 05:18

imjalpreet marked this pull request as ready for review January 2, 2025 15:37

imjalpreet requested a review from majetideepak as a code owner January 2, 2025 15:37

imjalpreet force-pushed the icebergWriter branch from 967c556 to a1ee3c5 Compare January 2, 2025 15:39

imjalpreet marked this pull request as draft January 2, 2025 15:42

imjalpreet changed the title ~~Add support for writing iceberg tables~~ feat(iceberg): Add support for writing iceberg tables Jan 2, 2025

imjalpreet force-pushed the icebergWriter branch from a1ee3c5 to 0e6fbfc Compare January 13, 2025 19:56

yingsu00 reviewed Jan 31, 2025

View reviewed changes

velox/connectors/hive/iceberg/IcebergDataSink.cpp Outdated Show resolved Hide resolved

imjalpreet force-pushed the icebergWriter branch 3 times, most recently from 2c9e448 to 2f1d227 Compare February 14, 2025 02:37

yingsu00 reviewed Feb 20, 2025

View reviewed changes

velox/connectors/hive/iceberg/tests/IcebergDataSinkTest.cpp Outdated Show resolved Hide resolved

PingLiuPing reviewed Feb 23, 2025

View reviewed changes

imjalpreet force-pushed the icebergWriter branch from 2f1d227 to dec50cb Compare February 27, 2025 11:26

PingLiuPing reviewed Mar 3, 2025

View reviewed changes

velox/connectors/hive/HiveConnector.cpp Outdated Show resolved Hide resolved

PingLiuPing force-pushed the icebergWriter branch 2 times, most recently from a3930d4 to aeed2b0 Compare July 17, 2025 09:05

PingLiuPing force-pushed the icebergWriter branch from aeed2b0 to 3643d15 Compare July 18, 2025 12:39

jinchengchenghh mentioned this pull request Jul 23, 2025

[VL] Support iceberg write apache/incubator-gluten#9335

Open

PingLiuPing force-pushed the icebergWriter branch from 3643d15 to 689f190 Compare July 25, 2025 08:39

jinchengchenghh reviewed Jul 30, 2025

View reviewed changes

PingLiuPing force-pushed the icebergWriter branch from 689f190 to 94a0fa2 Compare July 30, 2025 14:05

PingLiuPing mentioned this pull request Jul 30, 2025

[native] Iceberg basic insertion prestodb/presto#25659

Open

PingLiuPing force-pushed the icebergWriter branch from 94a0fa2 to 15bcb8f Compare July 31, 2025 21:29

Support insert data into iceberg table.

0833b10

Co-authored-by [email protected]

PingLiuPing force-pushed the icebergWriter branch from 15bcb8f to 0833b10 Compare August 20, 2025 10:38


		void TearDown() override;

		std::vector<RowVectorPtr> createTestData(

feat(iceberg): Add support for writing iceberg tables #10996

Are you sure you want to change the base?

feat(iceberg): Add support for writing iceberg tables #10996

Uh oh!

Conversation

imjalpreet commented Sep 13, 2024 • edited by PingLiuPing Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Design Doc

Implementation Details

Testing

Limitation

Uh oh!

netlify bot commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhouyuan commented Dec 13, 2024

Uh oh!

prestodb-ci commented Jan 28, 2025

Uh oh!

yingsu00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yingsu00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PingLiuPing left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

imjalpreet commented Mar 6, 2025

Uh oh!

PingLiuPing commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingLiuPing commented Jul 17, 2025

Uh oh!

PingLiuPing commented Jul 17, 2025

Uh oh!

jinchengchenghh commented Jul 18, 2025

Uh oh!

jinchengchenghh commented Jul 18, 2025

Uh oh!

PingLiuPing commented Jul 18, 2025

Uh oh!

jinchengchenghh commented Jul 22, 2025

Uh oh!

PingLiuPing commented Jul 22, 2025

Uh oh!

PingLiuPing commented Jul 22, 2025

Uh oh!

jinchengchenghh commented Jul 22, 2025

Uh oh!

PingLiuPing commented Jul 23, 2025

Uh oh!

jinchengchenghh commented Jul 23, 2025

Uh oh!

imjalpreet commented Sep 13, 2024 •

edited by PingLiuPing

Loading

netlify bot commented Sep 13, 2024 •

edited

Loading

PingLiuPing commented Jul 17, 2025 •

edited

Loading