Skip to content

Conversation

@ebyhr
Copy link
Member

@ebyhr ebyhr commented Oct 19, 2025

Description

Fixes #26983

Release notes

## Iceberg
* Collect NDV stats on all columns when replacing tables. ({issue}`26983`)

Summary by Sourcery

Enable collection of distinct value statistics on all columns when replacing Iceberg tables by extending the statistics collection API with a replace flag and updating the Iceberg connector to use it.

New Features:

  • Introduce a replace flag in the statistics collection SPI and propagate it through core, planner, tracing, and connector layers to distinguish replacement writes

Enhancements:

  • Deprecate the old getStatisticsCollectionMetadataForWrite method and add a new overload accepting the replace parameter

Tests:

  • Add a TestIcebergStatistics test to verify SHOW STATS FOR after CREATE OR REPLACE TABLE includes NDV stats for all columns

@cla-bot cla-bot bot added the cla-signed label Oct 19, 2025
@sourcery-ai
Copy link

sourcery-ai bot commented Oct 19, 2025

Reviewer's Guide

Propagate a replace flag in statistics collection metadata APIs to enable collecting NDV stats on all columns during table replace operations, updating core SPI, metadata propagation, planners, tracing and wrapper layers, Iceberg connector implementation, and adding a test to validate SHOW STATS after REPLACE TABLE.

Sequence diagram for statistics collection during table replace operation

sequenceDiagram
    participant Planner
    participant MetadataManager
    participant ConnectorMetadata
    participant IcebergMetadata
    Planner->>MetadataManager: getStatisticsCollectionMetadataForWrite(..., replace=true)
    MetadataManager->>ConnectorMetadata: getStatisticsCollectionMetadataForWrite(..., replace=true)
    ConnectorMetadata->>IcebergMetadata: getStatisticsCollectionMetadataForWrite(..., replace=true)
    IcebergMetadata-->>ConnectorMetadata: TableStatisticsMetadata (NDV stats on all columns)
    ConnectorMetadata-->>MetadataManager: TableStatisticsMetadata
    MetadataManager-->>Planner: TableStatisticsMetadata
Loading

Class diagram for updated statistics collection metadata APIs

classDiagram
    class Metadata {
        +getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    class MetadataManager {
        +getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    class TracingMetadata {
        +getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    class ConnectorMetadata {
        +getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
        +getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata): TableStatisticsMetadata (deprecated)
    }
    class TracingConnectorMetadata {
        +getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    class ClassLoaderSafeConnectorMetadata {
        +getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    class IcebergMetadata {
        +getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    class LakehouseMetadata {
        +getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    MetadataManager --> Metadata
    TracingMetadata --> MetadataManager
    TracingConnectorMetadata --> ConnectorMetadata
    ClassLoaderSafeConnectorMetadata --> ConnectorMetadata
    IcebergMetadata --> ConnectorMetadata
    LakehouseMetadata --> ConnectorMetadata
Loading

File-Level Changes

Change Details Files
Introduce a boolean 'replace' parameter in getStatisticsCollectionMetadataForWrite SPI.
  • Add new default method overload in ConnectorMetadata with (session, tableMetadata, replace)
  • Deprecate the old single-arg getStatisticsCollectionMetadataForWrite method
core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java
Propagate the 'replace' flag through core metadata API and planner.
  • Update MetadataManager and Metadata interface to include replace parameter
  • Modify LogicalPlanner to pass create.isReplace() or false when requesting stats metadata
core/trino-main/src/main/java/io/trino/metadata/MetadataManager.java
core/trino-main/src/main/java/io/trino/sql/planner/LogicalPlanner.java
core/trino-main/src/main/java/io/trino/metadata/Metadata.java
Update tracing and classloader-safe metadata wrappers to forward the replace flag.
  • Add new overrides in TracingConnectorMetadata and TracingMetadata to accept replace
  • Update ClassLoaderSafeConnectorMetadata to delegate replace parameter
  • Extend LakehouseMetadata and AbstractMockMetadata to support replace flag
core/trino-main/src/main/java/io/trino/tracing/TracingConnectorMetadata.java
core/trino-main/src/main/java/io/trino/tracing/TracingMetadata.java
lib/trino-plugin-toolkit/src/main/java/io/trino/plugin/base/classloader/ClassLoaderSafeConnectorMetadata.java
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseMetadata.java
core/trino-main/src/test/java/io/trino/metadata/AbstractMockMetadata.java
Enhance Iceberg connector to collect NDV stats on all columns when replacing tables.
  • Add branch in IcebergMetadata.getStatisticsCollectionMetadataForWrite to handle replace=true
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Add integration test for SHOW STATS after CREATE OR REPLACE TABLE in Iceberg.
  • Introduce testShowStatsReplaceTable in TestIcebergStatistics verifying stats after replace
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergStatistics.java

Assessment against linked issues

Issue Objective Addressed Explanation
#26983 Ensure that extended table statistics, including NDV (number of distinct values), are collected and populated correctly when using CREATE OR REPLACE TABLE (CTAS) to replace an Iceberg table.

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@github-actions github-actions bot added iceberg Iceberg connector lakehouse labels Oct 19, 2025
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `core/trino-main/src/main/java/io/trino/tracing/TracingMetadata.java:633` </location>
<code_context>
     }

     @Override
-    public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata)
+    public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata, boolean replace)
     {
         CatalogMetadata catalogMetadata = getCatalogMetadataForWrite(session, catalogHandle);
</code_context>

<issue_to_address>
**suggestion:** Tracing wrapper passes 'replace' but does not log or trace its value.

If 'replace' impacts statistics collection, add its value to the tracing span to improve observability.
</issue_to_address>

### Comment 2
<location> `lib/trino-plugin-toolkit/src/main/java/io/trino/plugin/base/classloader/ClassLoaderSafeConnectorMetadata.java:158` </location>
<code_context>
     }

     @Override
-    public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata)
+    public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace)
     {
         Span span = startSpan("getStatisticsCollectionMetadataForWrite", tableMetadata.getTable());
</code_context>

<issue_to_address>
**issue:** ClassLoaderSafeConnectorMetadata now passes 'replace', but does not handle backward compatibility.

If the delegate lacks the updated method, runtime errors may occur. Please add a fallback to the deprecated method to maintain compatibility.
</issue_to_address>

### Comment 3
<location> `plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergStatistics.java:976-974` </location>
<code_context>
+    @Test
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding tests for error conditions and edge cases in table replacement.

Please add tests for cases like replacing tables with no columns, columns of varying types, duplicate column names, replacement failures, and when extended statistics are disabled.

Suggested implementation:

```java
    @Test
    public void testShowStatsReplaceTable()
    {
        try (TestTable table = newTrinoTable("show_stats_after_replace_table_", "AS SELECT 1 a, 2 b")) {
            assertThat(query("SHOW STATS FOR " + table.getName()))
                    .skippingTypesCheck()
                    .matches("""
                        VALUES
                        ('a', null, 1e0, 0e0, NULL, '1', '1'),
                        ('b', null, 1e0, 0e0, NULL, '2', '2'),
                        (NULL, NULL, NULL, NULL, 1e0, NULL, NULL)
                    """);
        }
    }

    @Test
    public void testReplaceTableWithNoColumns()
    {
        try (TestTable table = newTrinoTable("replace_table_no_columns", "(a INTEGER)")) {
            assertUpdate("REPLACE TABLE " + table.getName() + " AS SELECT"); // No columns
            assertThat(query("SHOW STATS FOR " + table.getName()))
                    .skippingTypesCheck()
                    .matches("""
                        VALUES
                        (NULL, NULL, NULL, NULL, 0e0, NULL, NULL)
                    """);
        }
    }

    @Test
    public void testReplaceTableWithVaryingColumnTypes()
    {
        try (TestTable table = newTrinoTable("replace_table_varying_types", "(a INTEGER, b VARCHAR, c DOUBLE)")) {
            assertUpdate("REPLACE TABLE " + table.getName() + " AS SELECT 1 a, 'x' b, 3.14 c");
            assertThat(query("SHOW STATS FOR " + table.getName()))
                    .skippingTypesCheck()
                    .matches("""
                        VALUES
                        ('a', null, 1e0, 0e0, NULL, '1', '1'),
                        ('b', null, 1e0, 0e0, NULL, 'x', 'x'),
                        ('c', null, 1e0, 0e0, NULL, '3.14', '3.14'),
                        (NULL, NULL, NULL, NULL, 1e0, NULL, NULL)
                    """);
        }
    }

    @Test
    public void testReplaceTableWithDuplicateColumnNames()
    {
        try (TestTable table = newTrinoTable("replace_table_duplicate_columns", "(a INTEGER, a INTEGER)")) {
            // This should fail due to duplicate column names
            assertQueryFails("REPLACE TABLE " + table.getName() + " AS SELECT 1 a, 2 a", "Duplicate column name: a");
        }
    }

    @Test
    public void testReplaceTableFailure()
    {
        try (TestTable table = newTrinoTable("replace_table_failure", "(a INTEGER)")) {
            // Invalid SQL
            assertQueryFails("REPLACE TABLE " + table.getName() + " AS SELECT invalid_column", "Column 'invalid_column' cannot be resolved");
        }
    }

    @Test
    public void testReplaceTableWithExtendedStatisticsDisabled()
    {
        // Assuming there is a session property or config to disable extended statistics
        Session session = Session.builder(getSession())
                .setCatalogSessionProperty("iceberg", "collect_extended_statistics_enabled", "false")
                .build();
        try (TestTable table = newTrinoTable("replace_table_ext_stats_disabled", "(a INTEGER)")) {
            getQueryRunner().execute(session, "REPLACE TABLE " + table.getName() + " AS SELECT 1 a");
            assertThat(query("SHOW STATS FOR " + table.getName()))
                    .skippingTypesCheck()
                    .matches("""
                        VALUES
                        ('a', null, 1e0, 0e0, NULL, '1', '1'),
                        (NULL, NULL, NULL, NULL, 1e0, NULL, NULL)
                    """);
        }
    }

```

- If your test infrastructure does not support `assertQueryFails`, you may need to use the appropriate method for asserting query failures.
- Adjust the session property name for disabling extended statistics if it differs in your codebase.
- Ensure that `newTrinoTable` and `getQueryRunner()` are available and correctly configured for these tests.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ebyhr ebyhr force-pushed the ebi/iceberg-stats-replace-table branch from d5ddb2c to 29536fe Compare October 19, 2025 06:03
@ebyhr
Copy link
Member Author

ebyhr commented Oct 21, 2025

CI failure will be fixed by #27019

@ebyhr ebyhr requested a review from findepi October 21, 2025 03:40
}

if (replace) {
return getStatisticsCollectionMetadata(tableMetadata, Optional.empty(), _ -> {});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it consistent with the 3 other occurrences of return getStatisticsCollectionMetadata(tableMetadata, Optional.empty(), availableColumnNames -> {}); below

Suggested change
return getStatisticsCollectionMetadata(tableMetadata, Optional.empty(), _ -> {});
return getStatisticsCollectionMetadata(tableMetadata, Optional.empty(), availableColumnNames -> {});


@Override
public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata)
public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it allow removal of exclusion in TestLakehouseMetadata ?


@Override
public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata)
public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add

@Override
    public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata)
    {
        throw new UnsupportedOperationException("This variant of getStatisticsCollectionMetadataForWrite is unsupported");
    }

/**
* Describes statistics that must be collected during a write.
*/
default TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace -> tableReplace
and add @param javadoc

in the context of this method it's quite not obvious what "replace" could mean
eg is it "replace old statistics" maybe?

@ebyhr ebyhr force-pushed the ebi/iceberg-stats-replace-table branch from 29536fe to b486387 Compare October 22, 2025 08:22
@ebyhr
Copy link
Member Author

ebyhr commented Oct 22, 2025

CI hit #23596

@ebyhr ebyhr merged commit f1a73d3 into trinodb:master Oct 22, 2025
96 of 98 checks passed
@ebyhr ebyhr deleted the ebi/iceberg-stats-replace-table branch October 22, 2025 09:24
@github-actions github-actions bot added this to the 478 milestone Oct 22, 2025
/**
* Describes statistics that must be collected during a write.
*
* @param tableReplace if true, replace old statistics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"indicates whether this is for table replace operation and statistics of an existing table (if any) should be ignored"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

[Iceberg] Extended table statistics are not collected when using CTAS to replace table

2 participants