Skip to content

feat: basic table scan planning #112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Conversation

gty404
Copy link
Contributor

@gty404 gty404 commented May 27, 2025

Introducing basic scan table data interface

/// \param file_path Path to the manifest list file.
/// \return A Result containing the reader or an error.
Result<std::unique_ptr<ManifestListReader>> CreateManifestListReader(
const std::string& file_path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not enough. At least we need extra parameters like table_format_version and file_io.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, should we use std::string_view for path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to provide a ManifestListReaderBuilder/ManifestReaderBuilder?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know yet. It depends on how we will use them. BTW @dongxiao1198 will work on manifest reading.

/// \param file_path Path to the manifest list file.
/// \return A Result containing the reader or an error.
Result<std::unique_ptr<ManifestListReader>> CreateManifestListReader(
const std::string& file_path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know yet. It depends on how we will use them. BTW @dongxiao1198 will work on manifest reading.

};

/// \brief Represents a task to scan a portion of a data file.
class ICEBERG_EXPORT FileScanTask : public ScanTask {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts about FileScanTask:

  1. Should we remove ScanTask abstraction above? If we remove the abstraction, we can directly use aggregate initialization to create a task. Otherwise we may need to expand the constructor every time a new parameter is required.
  2. If we do (1) above, is it possible also to make it a simple struct by removing all functions (as they are all trivial accessors).
  3. Should we add fields (a.k.a. spec and partition_value) from Java PartitionScanTask to support partitioning? We can add them later but a TODO comment is desirable.
  4. Should we combine start and length, and wrap them by std::optional? I believe they are not required at all times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I initially expected it to just be a struct, but since the previous comments suggested doing an abstraction, I referred to the design in iceberg-java/iceberg-python.
  2. Partition spec and value can be obtained from DataFile and Snapshot, and we can add these interfaces when needed for subsequent PR
  3. Sure, I will modify it to optional, thanks.

/// \brief Sets the schema to use for the scan.
/// \param schema The schema to use.
/// \return Reference to the builder.
TableScanBuilder& WithSchema(std::shared_ptr<Schema> schema);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this. We just need schema of a specific snapshot id which can be obtained via table_metadata. Did I miss something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to specify the projected schema without specifying the column name, and I have modified it to WithProjectedSchema.

/// \brief snapshot ID to scan, if specified.
std::optional<int64_t> snapshot_id_;
/// \brief Context for the scan, including snapshot, schema, and filter.
TableScanContext context_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Java version of TableScanContext, column_names_ and snapshot_id_ are also stored in it. Should we follow the same pattern? If we do this, it seems that TableScanBuilder is indeed a TableScanContextBuilder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally expected that TableScanContext would be context information retained after converting various input parameters, and that what was no longer needed in the subsequent file scanning process would be removed.

data_entry.sequence_number.value_or(TableMetadata::kInitialSequenceNumber);
for (auto it = sequence_index.lower_bound(data_sequence_number);
it != sequence_index.end(); ++it) {
// Additional filtering logic here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the additional filtering logic? Did you mean to further check if the delete files can be filtered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFiles only need to retain DeleteFiles with a sequence greater than their own?


int64_t FileScanTask::length() const { return length_; }

int64_t FileScanTask::size_bytes() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int64_t FileScanTask::size_bytes() const {
int64_t FileScanTask::SizeBytes() const {

This is not trivial.

return static_cast<int32_t>(delete_files_.size() + 1);
}

int64_t FileScanTask::estimated_row_count() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return sizeInBytes;
}

int32_t FileScanTask::files_count() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need to rename it to FilesCount(). @lidavidm suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants