feat: basic table scan planning #112

gty404 · 2025-05-27T06:46:04Z

Introducing basic scan table data interface

src/iceberg/result.h

src/iceberg/table_scan.h

src/iceberg/table_scan.cc

src/iceberg/type_fwd.h

src/iceberg/table_scan.h

src/iceberg/type_fwd.h

src/iceberg/table_scan.h

src/iceberg/table_scan.cc

Co-authored-by: Gang Wu <[email protected]>

src/iceberg/snapshot.h

src/iceberg/table_scan.h

src/iceberg/manifest_reader.h

wgtmac · 2025-06-29T07:39:12Z

src/iceberg/manifest_reader.h

+/// \param file_path Path to the manifest list file.
+/// \return A Result containing the reader or an error.
+Result<std::unique_ptr<ManifestListReader>> CreateManifestListReader(
+    const std::string& file_path) {


I think this is not enough. At least we need extra parameters like table_format_version and file_io.

BTW, should we use std::string_view for path?

Do you need to provide a ManifestListReaderBuilder/ManifestReaderBuilder?

I don't know yet. It depends on how we will use them. BTW @dongxiao1198 will work on manifest reading.

src/iceberg/snapshot.h

src/iceberg/table_scan.h

src/iceberg/table_scan.cc

wgtmac · 2025-06-30T14:44:08Z

src/iceberg/manifest_reader.h

+/// \param file_path Path to the manifest list file.
+/// \return A Result containing the reader or an error.
+Result<std::unique_ptr<ManifestListReader>> CreateManifestListReader(
+    const std::string& file_path) {


I don't know yet. It depends on how we will use them. BTW @dongxiao1198 will work on manifest reading.

src/iceberg/table_scan.h

wgtmac · 2025-07-01T10:02:25Z

src/iceberg/table_scan.h

+};
+
+/// \brief Represents a task to scan a portion of a data file.
+class ICEBERG_EXPORT FileScanTask : public ScanTask {


Some thoughts about FileScanTask:

Should we remove ScanTask abstraction above? If we remove the abstraction, we can directly use aggregate initialization to create a task. Otherwise we may need to expand the constructor every time a new parameter is required.

If we do (1) above, is it possible also to make it a simple struct by removing all functions (as they are all trivial accessors).

Should we add fields (a.k.a. spec and partition_value) from Java PartitionScanTask to support partitioning? We can add them later but a TODO comment is desirable.

Should we combine start and length, and wrap them by std::optional? I believe they are not required at all times.

I initially expected it to just be a struct, but since the previous comments suggested doing an abstraction, I referred to the design in iceberg-java/iceberg-python.

Partition spec and value can be obtained from DataFile and Snapshot, and we can add these interfaces when needed for subsequent PR

Sure, I will modify it to optional, thanks.

wgtmac · 2025-07-01T12:03:14Z

src/iceberg/table_scan.h

+  /// \brief Sets the schema to use for the scan.
+  /// \param schema The schema to use.
+  /// \return Reference to the builder.
+  TableScanBuilder& WithSchema(std::shared_ptr<Schema> schema);


I think we don't need this. We just need schema of a specific snapshot id which can be obtained via table_metadata. Did I miss something?

This is used to specify the projected schema without specifying the column name, and I have modified it to WithProjectedSchema.

wgtmac · 2025-07-01T13:46:16Z

src/iceberg/table_scan.h

+  /// \brief snapshot ID to scan, if specified.
+  std::optional<int64_t> snapshot_id_;
+  /// \brief Context for the scan, including snapshot, schema, and filter.
+  TableScanContext context_;


In the Java version of TableScanContext, column_names_ and snapshot_id_ are also stored in it. Should we follow the same pattern? If we do this, it seems that TableScanBuilder is indeed a TableScanContextBuilder.

I originally expected that TableScanContext would be context information retained after converting various input parameters, and that what was no longer needed in the subsequent file scanning process would be removed.

wgtmac · 2025-07-01T14:07:43Z

src/iceberg/table_scan.cc

+        data_entry.sequence_number.value_or(TableMetadata::kInitialSequenceNumber);
+    for (auto it = sequence_index.lower_bound(data_sequence_number);
+         it != sequence_index.end(); ++it) {
+      // Additional filtering logic here


What is the additional filtering logic? Did you mean to further check if the delete files can be filtered?

DataFiles only need to retain DeleteFiles with a sequence greater than their own?

src/iceberg/table_scan.cc

wgtmac · 2025-07-01T14:14:31Z

src/iceberg/table_scan.cc

+
+int64_t FileScanTask::length() const { return length_; }
+
+int64_t FileScanTask::size_bytes() const {


Suggested change

int64_t FileScanTask::size_bytes() const {

int64_t FileScanTask::SizeBytes() const {

This is not trivial.

wgtmac · 2025-07-01T14:14:47Z

src/iceberg/table_scan.cc

+  return static_cast<int32_t>(delete_files_.size() + 1);
+}
+
+int64_t FileScanTask::estimated_row_count() const {


wgtmac · 2025-07-01T14:15:16Z

src/iceberg/table_scan.cc

+  return sizeInBytes;
+}
+
+int32_t FileScanTask::files_count() const {


I'm not sure if we need to rename it to FilesCount(). @lidavidm suggestion?

gty404 added 4 commits May 27, 2025 14:43

feat: basic table scan planning

e971cc4

fix cpp lint

5fc6971

fix build fail on windows

6a2cb74

fix lint

d71c26a

lidavidm reviewed May 27, 2025

View reviewed changes

src/iceberg/result.h Outdated Show resolved Hide resolved

src/iceberg/table_scan.h Outdated Show resolved Hide resolved

gty404 added 2 commits May 27, 2025 16:18

fix some comments

c6c1a1f

fix clang format

cd07a0c

lidavidm reviewed May 28, 2025

View reviewed changes

src/iceberg/table_scan.h Outdated Show resolved Hide resolved

src/iceberg/table_scan.h Outdated Show resolved Hide resolved

src/iceberg/table_scan.cc Outdated Show resolved Hide resolved

src/iceberg/table_scan.cc Outdated Show resolved Hide resolved

wgtmac reviewed May 28, 2025

View reviewed changes

gty404 added 2 commits May 29, 2025 10:07

fix some comments

b7becc2

Merge branch 'main' into table-scan

abfdfcd

wgtmac reviewed May 30, 2025

View reviewed changes

yingcai-cy reviewed Jun 5, 2025

View reviewed changes

src/iceberg/table_scan.h Outdated Show resolved Hide resolved

src/iceberg/table_scan.cc Outdated Show resolved Hide resolved

src/iceberg/table_scan.cc Outdated Show resolved Hide resolved

gty404 and others added 3 commits June 14, 2025 14:08

Update src/iceberg/table_scan.h

28043b1

Co-authored-by: Gang Wu <[email protected]>

Update src/iceberg/table_scan.h

fa25891

Co-authored-by: Gang Wu <[email protected]>

Merge branch 'main' into table-scan

428651f

gty404 force-pushed the table-scan branch from 6cbd651 to 428651f Compare June 14, 2025 06:42

gty404 added 6 commits June 14, 2025 14:49

fix comments

812a545

Merge branch 'main' into table-scan

0f79c7c

Abstract TableScan and ScanTask

85802e9

fix lint

c7621b3

fix lint

e1267fc

fix lint

5248e22

zhjwpku reviewed Jun 28, 2025

View reviewed changes

src/iceberg/snapshot.h Outdated Show resolved Hide resolved

src/iceberg/table_scan.h Outdated Show resolved Hide resolved

wgtmac reviewed Jun 29, 2025

View reviewed changes

lishuxu reviewed Jun 29, 2025

View reviewed changes

src/iceberg/table_scan.h Show resolved Hide resolved

Merge branch 'main' into table-scan

368e268

lishuxu reviewed Jun 30, 2025

View reviewed changes

src/iceberg/table_scan.cc Outdated Show resolved Hide resolved

gty404 added 2 commits June 30, 2025 10:26

resolve some comments

29e8865

remove Snapshot::kInitialSequenceNumber

ae560f3

wgtmac reviewed Jul 1, 2025

View reviewed changes

resolve some comments

0ff952b


		int64_t FileScanTask::length() const { return length_; }

		int64_t FileScanTask::size_bytes() const {

	int64_t FileScanTask::size_bytes() const {
	int64_t FileScanTask::SizeBytes() const {

feat: basic table scan planning #112

Are you sure you want to change the base?

feat: basic table scan planning #112

Uh oh!

Conversation

gty404 commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!