Skip to content

Conversation

@dingo4dev
Copy link
Contributor

Rationale for this change

This pull request introduces a Trino service to the Docker Compose development environment, addressing the need for broader integration testing capabilities as discussed in issue #2219

Are these changes tested?

Are there any user-facing changes?

@dingo4dev dingo4dev force-pushed the test-trino-integration branch 4 times, most recently from 04d1f61 to a0e1a11 Compare July 22, 2025 19:01
@dingo4dev dingo4dev marked this pull request as ready for review July 22, 2025 19:02
Copilot AI review requested due to automatic review settings July 22, 2025 19:02
@dingo4dev dingo4dev changed the title [WIP] Trino: Add Trino Docker Compose for integration testing Trino: Add Trino Docker Compose for integration testing Jul 22, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces Trino integration testing capabilities to the PyIceberg project by adding Docker Compose configuration and test fixtures for Trino services. The changes enable testing of Iceberg operations through Trino's SQL interface to ensure compatibility and proper integration between PyIceberg and Trino.

Key changes include:

  • Added Trino Docker services configuration with both REST and Hive catalog support
  • Introduced new test fixtures for Trino database connections
  • Created integration tests that verify PyIceberg operations work correctly with Trino

Reviewed Changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/integration/test_writes/test_writes.py Added new Trino-specific UUID partitioning test and skipped existing Spark test
tests/integration/test_rest_catalog.py Added test to verify Iceberg namespace appears as Trino schema
tests/integration/test_register_table.py Added test to verify table registration works with Trino
tests/conftest.py Added Trino connection fixtures and command-line options
pyproject.toml Added Trino dependency and integration test marker
dev/trino/catalog/*.properties Trino catalog configuration files for REST and Hive catalogs
dev/run-trino.sh Shell script to manage Trino Docker services
dev/docker-compose-trino.yml Standalone Trino Docker Compose configuration
dev/docker-compose-integration.yml Added Trino service to main integration environment
Makefile Added target for running Trino integration tests
Comments suppressed due to low confidence (2)

tests/integration/test_register_table.py:109

  • [nitpick] The test doesn't handle the case where Trino connection might fail or timeout. Consider adding error handling or retries for database connectivity issues that could make the test flaky.
    assert table_name in inspect(trino_conn).get_table_names(schema=namespace)

@dingo4dev dingo4dev force-pushed the test-trino-integration branch from df48ce0 to f46d70f Compare August 28, 2025 19:02
Adds a Docker Compose configuration for setting up Trino with Iceberg, including support for Hive Metastore and REST catalog types. This allows for easier testing and development with Trino and Iceberg.

closes apache#2219
Add Trino as alternative tool for test uuid partitions as Java Iceberg 1.9.2 in spark is not yet supported.
<!--related to apache#2002-->
@dingo4dev dingo4dev force-pushed the test-trino-integration branch from f46d70f to 06f7419 Compare November 12, 2025 17:11
Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the PR!

assert tbl.scan().to_arrow() == arrow_table


@pytest.mark.skip("UUID BucketTransform is not supported in Spark Iceberg 1.9.2 yet")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this test skip related to the trino work?
I am curious why this wasn't skipped before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulcd, Thanks for your reviews.

Before this PR, the bucket transform is not test in this testcase because we found the issue in Iceberg Java.

@pytest.mark.parametrize(
"transform",
[
IdentityTransform(),
# Bucket is disabled because of an issue in Iceberg Java:
# https://github.com/apache/iceberg/pull/13324
# BucketTransform(32)
],
)
def test_uuid_partitioning(session_catalog: Catalog, spark: SparkSession, transform: Transform) -> None: # type: ignore

My thoughts is using Trino to test this uuid partition testcase for Identity & Bucket Transforms, instead of using Spark Iceberg to test it, because the UUID Bucket Transform is not supported in Spark but it is worked in Trino.

To isolate the current issue in Java Iceberg, I added new testcase for it in: https://github.com/dingo4dev/iceberg-python/blob/3eef2566e02342d65a2c2f650bfb6c62d0221cf3/tests/integration/test_writes/test_writes.py#L2161

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the comment. I understand you are adding a new test that is covering both the BucketTransform and the IdentityTransform on test_uuid_partitioning_with_trino but I don't think adding the conditional skip on this specific test is necessary. Either we leave it as is, the test will still run on non-trino set up or we remove it entirely as the transform is covered on the trino setup. I would rather just maintain it to be honest because if the Iceberg Java issue is fixed in the future we can just also add the BucketTransform to run unconditionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulcd You're right! I just found that is a way to add the skip marker to specific parameter, so that we can keep the partition test running on the Spark, and also able to skip the BucketTransform and easy to tracked it instead of using comments, tho. How does that sound?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good in principle, if you can push or show the diff I might understand better what you mean :)
Thanks for looking at this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure ! I just updated it. Please have a look ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change lgtm! Thanks, I'll see if I have some time to go over the changes in detail to properly review but we will still need a committer review :)

@@ -0,0 +1,23 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to add config.properties‎ file?

Copy link
Contributor Author

@dingo4dev dingo4dev Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ebyhr , Thanks for your review.
My first initial thought is to configure the catalog.management to dynamic which help to add and test the glue catalog and dynamo catalog without restart the container. Then, I though adding it will help development experience, you can modify server configure for your own resources without update dockerfile yourself

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I understand the motivation now. The next question is why enabling CATALOG_MANAGEMENT environment variable is insufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added catalog.management=${ENV:CATALOG_MANAGEMENT} to make sure it consistence while someone updating config.properties explicitly. Or we can just use catalog.management=dynamic and remove the env var.

Comment on lines +68 to +69
environment:
- CATALOG_MANAGEMENT=dynamic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This environment variable looks redundant since we're adding catalogs with properties files instead of CREATE CATALOG statement.

s3.aws-access-key=admin
s3.aws-secret-key=password
s3.endpoint=http://minio:9000
s3.path-style-access=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We could remove this s3.path-style-access=false. It's disabled by default.

- rest
- hive
volumes:
- ./trino/catalog:/etc/trino/catalog
Copy link
Contributor

@ebyhr ebyhr Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about specifying the full path of properties files? Thus, we can use existing catalogs such memory and TPCH catalogs - they are helpful during development.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great!

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Its great to see another engine integration.

A couple of general comments:

  • Make Trino infra and integration tests optional for now and hook it up to the CI later
  • Consolidate the 3 different ways to spin up trino infra. My preference would be to create a new docker-compose-trino with just the trino infra and spin up that alongside the original dev/docker-compose-integration.yml.
  • Group all the infra related changes to dev/trino/, similar to what we do for spark
  • Create a separate test file for trino tests

Let me know what you think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants