Parquet Incremental Sync #768

sapienza88 · 2025-12-10T19:54:49Z

Important Read

Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

(For example: This pull request implements the sync for delta format.)

Brief change log

(for example:)

Fixed JSON parsing error when persisting state
Added unit tests for schema evolution

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added TestConversionController to verify the change.
Manually verified the change by running a job locally.

… into the parquet table

…ds, interfacing with ConversionSource

rahil-c · 2025-12-15T16:19:52Z

I can do first review for this @the-other-tim-brown @vinishjail97

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetFileConfig.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java

vinishjail97 · 2025-12-17T19:16:16Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java

+    try (ParquetWriter<Group> writer =
+        new ParquetWriter<Group>(
+            outputFile,
+            new GroupWriteSupport(),
+            parquetFileConfig.getCodec(),
+            (int) parquetFileConfig.getRowGroupSize(),
+            pageSize,
+            pageSize, // dictionaryPageSize
+            true, // enableDictionary
+            false, // enableValidation
+            ParquetWriter.DEFAULT_WRITER_VERSION,
+            conf)) {
+      Group currentGroup = null;
+      while ((currentGroup = (Group) reader.read()) != null) {
+        writer.write(currentGroup);


Why are we writing new parquet files again like this through the writer? I think there's some misunderstanding with the parquet incremental sync feature here.

Parquet Incremental Sync Requirements.

You have a target table where parquet files [p1/f1.parquet, p1/f2.parquet, p2/f1.parquet] have been synced to hudi, iceberg and delta for example.

In the source changes some changes have been made a new file in partition p1 was added and p2's file was deleted. The incremental sync should now sync the new changes incrementally.

@sapienza88 It's better to align on the approach first here before we push PR's. Can you add the approach for parquet incremental sync in the PR description or any google doc if possible?

@vinishjail97 we simply want to append the file to where it belongs in the table (under the right partition). so we need to find the partition path (from the table's) where the file should be injected (doing this through path construction). In order to write the file the only way as far as I know is the ParquetWriter. After doing so, the Source can filter the files based on the modfication dates.

sapienza88 · 2025-12-17T19:55:39Z

@vinishjail97 I added some comments on the functions so that the approach is clearer. All above suggestions were also taken into account in my last commit.

…ing)

given a parquet file return data from a certain modification time

e541a71

sapienza88 changed the title ~~Parquet Incremental Sync: Given a parquet file return data from a certain modification time~~ Parquet Incremental Sync Dec 10, 2025

Selim Soufargi added 3 commits December 13, 2025 18:20

create the path based on the partition then inject the file to append…

15e282a

… into the parquet table

Handle case of path construction with file partitioned over many fiel…

2ee71c9

…ds, interfacing with ConversionSource

test append Parquet file into table init

6032e5f

add function to test schema equivalence before appending

f6fdc72

vinishjail97 self-requested a review December 16, 2025 08:31

Selim Soufargi added 2 commits December 16, 2025 12:59

construct path to inject to based on partitions

a94c3f3

fix imports

f8bdbfe

vinishjail97 requested changes Dec 17, 2025

View reviewed changes

refactoring (lombok, logs, javadocs and function and approach comment…

c04a983

…ing)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet Incremental Sync #768

Parquet Incremental Sync #768

sapienza88 commented Dec 10, 2025

Uh oh!

rahil-c commented Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinishjail97 Dec 17, 2025

Uh oh!

sapienza88 Dec 17, 2025 •

edited

Loading

Uh oh!

sapienza88 commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Parquet Incremental Sync #768

Are you sure you want to change the base?

Parquet Incremental Sync #768

Conversation

sapienza88 commented Dec 10, 2025

Important Read

What is the purpose of the pull request

Brief change log

Verify this pull request

Uh oh!

rahil-c commented Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinishjail97 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

sapienza88 Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sapienza88 commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sapienza88 Dec 17, 2025 •

edited

Loading