-
Notifications
You must be signed in to change notification settings - Fork 40
feat: basic table scan planning #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
/// \param file_path Path to the manifest list file. | ||
/// \return A Result containing the reader or an error. | ||
Result<std::unique_ptr<ManifestListReader>> CreateManifestListReader( | ||
const std::string& file_path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not enough. At least we need extra parameters like table_format_version
and file_io
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, should we use std::string_view
for path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to provide a ManifestListReaderBuilder/ManifestReaderBuilder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know yet. It depends on how we will use them. BTW @dongxiao1198 will work on manifest reading.
/// \param file_path Path to the manifest list file. | ||
/// \return A Result containing the reader or an error. | ||
Result<std::unique_ptr<ManifestListReader>> CreateManifestListReader( | ||
const std::string& file_path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know yet. It depends on how we will use them. BTW @dongxiao1198 will work on manifest reading.
}; | ||
|
||
/// \brief Represents a task to scan a portion of a data file. | ||
class ICEBERG_EXPORT FileScanTask : public ScanTask { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts about FileScanTask
:
- Should we remove
ScanTask
abstraction above? If we remove the abstraction, we can directly use aggregate initialization to create a task. Otherwise we may need to expand the constructor every time a new parameter is required. - If we do (1) above, is it possible also to make it a simple struct by removing all functions (as they are all trivial accessors).
- Should we add fields (a.k.a. spec and partition_value) from Java
PartitionScanTask
to support partitioning? We can add them later but a TODO comment is desirable. - Should we combine
start
andlength
, and wrap them bystd::optional
? I believe they are not required at all times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I initially expected it to just be a struct, but since the previous comments suggested doing an abstraction, I referred to the design in iceberg-java/iceberg-python.
- Partition spec and value can be obtained from DataFile and Snapshot, and we can add these interfaces when needed for subsequent PR
- Sure, I will modify it to optional, thanks.
src/iceberg/table_scan.h
Outdated
/// \brief Sets the schema to use for the scan. | ||
/// \param schema The schema to use. | ||
/// \return Reference to the builder. | ||
TableScanBuilder& WithSchema(std::shared_ptr<Schema> schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need this. We just need schema of a specific snapshot id which can be obtained via table_metadata. Did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to specify the projected schema without specifying the column name, and I have modified it to WithProjectedSchema.
/// \brief snapshot ID to scan, if specified. | ||
std::optional<int64_t> snapshot_id_; | ||
/// \brief Context for the scan, including snapshot, schema, and filter. | ||
TableScanContext context_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Java version of TableScanContext
, column_names_
and snapshot_id_
are also stored in it. Should we follow the same pattern? If we do this, it seems that TableScanBuilder
is indeed a TableScanContextBuilder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally expected that TableScanContext would be context information retained after converting various input parameters, and that what was no longer needed in the subsequent file scanning process would be removed.
data_entry.sequence_number.value_or(TableMetadata::kInitialSequenceNumber); | ||
for (auto it = sequence_index.lower_bound(data_sequence_number); | ||
it != sequence_index.end(); ++it) { | ||
// Additional filtering logic here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the additional filtering logic? Did you mean to further check if the delete files can be filtered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFiles only need to retain DeleteFiles with a sequence greater than their own?
src/iceberg/table_scan.cc
Outdated
|
||
int64_t FileScanTask::length() const { return length_; } | ||
|
||
int64_t FileScanTask::size_bytes() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t FileScanTask::size_bytes() const { | |
int64_t FileScanTask::SizeBytes() const { |
This is not trivial.
src/iceberg/table_scan.cc
Outdated
return static_cast<int32_t>(delete_files_.size() + 1); | ||
} | ||
|
||
int64_t FileScanTask::estimated_row_count() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
src/iceberg/table_scan.cc
Outdated
return sizeInBytes; | ||
} | ||
|
||
int32_t FileScanTask::files_count() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need to rename it to FilesCount()
. @lidavidm suggestion?
Introducing basic scan table data interface