Skip to content

[WIP] Add Iceberg TAG management support #26050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

elmesaoudee
Copy link

Description

This PR implements TAG management support for Iceberg tables in Trino. I added new DDL syntax to create, replace, and drop named refs to specific table snapshots. Tags provide semantic checkpoints that enable consistent reads and rollback scenarios by allowing users to reference specific versions of Iceberg tables by name rather than timestamp or snapshot ID.

The implementation adds three new ALTER TABLE statements following Trino DDL conventions:

  • CREATE [OR REPLACE] TAG [IF NOT EXISTS] <name> [FOR VERSION AS OF <snapshot>] [RETAIN <n> DAYS]
  • REPLACE TAG <name> [FOR VERSION AS OF <snapshot>] [RETAIN <n> DAYS]
  • DROP TAG [IF EXISTS] <name>

Usage Examples:

-- Create a tag for the current table state
ALTER TABLE iceberg.schema.orders CREATE TAG quarterly_snapshot;

-- Create a tag pointing to a specific snapshot with retention
ALTER TABLE iceberg.schema.orders CREATE TAG end_of_month FOR VERSION AS OF 12345 RETAIN 90 DAYS;

-- Replace an existing tag
ALTER TABLE iceberg.schema.orders REPLACE TAG quarterly_snapshot FOR VERSION AS OF 67890;

-- Drop a tag
ALTER TABLE iceberg.schema.orders DROP TAG old_snapshot;

-- Query using the tag for time travel
SELECT * FROM iceberg.schema.orders FOR VERSION AS OF 'quarterly_snapshot';

Additional context and related issues

The implementation spans multiple modules of the Trino project:

  • [trino-grammar]
  • [trino-parser]
  • [trino-main]
  • [trino-iceberg]

Grammar and Parser Changes:

  • Extended SqlBase.g4 with new tokens (TAG, DAYS, RETAIN) and grammar rules for tag operations
  • Added comprehensive parsing support for all tag DDL variants, including optional clauses

AST Implementation:

  • Created new AST node classes (CreateTag, ReplaceTag, DropTag) with proper visitor pattern support
  • Implemented equals/hashCode methods and integrated with existing AST infrastructure
  • Added corresponding visitor methods in AstVisitor and AstBuilder

Iceberg Connector Integration:

  • Added createTag, replaceTag, dropTag methods to IcebergMetadata
  • Integrated with Iceberg's native TAG API
  • Supported snapshot targeting and retention configuration
  • Handled error cases for duplicate/missing tags

Testing:

  • Unit tests for new AST nodes and parser
  • Integration tests in TestIcebergTagManagement covering all DDL variants
  • Error condition testing for proper exception handling
  • Time-travel query validation using created tags

The implementation follows Iceberg's tag semantics while maintaining consistency with Trino's existing DDL patterns.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Iceberg connector
* Add support for managing Iceberg table tags with `CREATE TAG`, `REPLACE TAG`, and `DROP TAG` statements. Tags provide named references to specific table snapshots and support optional retention policies. ({issue}`issuenumber`)

Implement CREATE/REPLACE/DROP TAG operations for Iceberg tables following
Trino DDL syntax. Tags provide named references to specific table
snapshots with optional retention policies.

Grammar changes:
- Add new tokens (TAG, DAYS, RETAIN) to SqlBase.g4
- Add three new ALTER TABLE statements for tag operations:
  * CREATE [OR REPLACE] TAG [IF NOT EXISTS] <name> [FOR VERSION AS OF <snapshot>] [RETAIN <n> DAYS]
  * REPLACE TAG <name> [FOR VERSION AS OF <snapshot>] [RETAIN <n> DAYS]
  * DROP TAG [IF EXISTS] <name>

AST and parser implementation:
- Create CreateTag, ReplaceTag, DropTag AST node classes
- Add visitor methods in AstVisitor and AstBuilder
- Implement proper equals/hashCode and visitor patterns

Iceberg connector implementation:
- Add createTag, replaceTag, dropTag methods to IcebergMetadata
- Integrate with Iceberg's native TAG API
- Support snapshot targeting and retention configuration
- Handle error cases for duplicate/missing tags

Test coverage:
- Add unit tests for the new AST nodes and parser
- Add comprehensive integration tests in TestIcebergTagManagement
- Test all DDL variants and error conditions
- Verify tag creation, replacement, deletion, and time-travel queries
- Validate retention policy configuration

This makes it possible to create semantic checkpoints in Iceberg tables
for consistent reads, rollback scenarios, and data governance workflows.
Copy link

cla-bot bot commented Jun 23, 2025

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added the iceberg Iceberg connector label Jun 23, 2025
@martint
Copy link
Member

martint commented Jun 24, 2025

There's a PR in progress to add support for branching: #25751

Also, please take a look at the document describing how we're approaching building this feature: https://docs.google.com/document/d/1jEF4IkWu-2Gzk5ii2Nb0exuEnAUeo98UbiM3i0xtgWQ/edit?tab=t.0#heading=h.dglxb51zw9m2

In particular, we're not planning to add direct support for tags in the engine in the foreseeable future, as those can be modeled as branches with no UPDATE or INSERT permissions.

@elmesaoudee
Copy link
Author

There's a PR in progress to add support for branching: #25751

Also, please take a look at the document describing how we're approaching building this feature: https://docs.google.com/document/d/1jEF4IkWu-2Gzk5ii2Nb0exuEnAUeo98UbiM3i0xtgWQ/edit?tab=t.0#heading=h.dglxb51zw9m2

In particular, we're not planning to add direct support for tags in the engine in the foreseeable future, as those can be modeled as branches with no UPDATE or INSERT permissions.

Hello @martint, thanks for getting back to me and for all the great work on branching and the detailed design doc you shared with me. I wanted to kick off a quick discussion around “tags” as a lightweight, immutable version of branches, and pick your brains about it.

I’ve been working on this PR due to an increasing need by my team for an easy way to reference snapshots without the need to fetch the snapshot id every time. Some of our Iceberg tables ingest a 'live' version of the data every day and historicize it using snapshots. Spark allows us to create tags (date of ingestion) to reference these snapshots because the need to query the table in its old state is pretty frequent. It's very convenient since tags are just aliases to snapshot ids, and we don't have to fetch the refs every time before querying the snapshot.

My point is that tags are:

  • Immutable bookmarks: once you point a tag at a snapshot, it never moves (unlike branches, which you can fast-forward or reset).
  • Retention-aware: you can automatically GC old snapshots via RETAIN DAYS, which feels like a natural fit for audit checks or backfills.
  • Minimal API surface: only touches grammar, parser, connector metadata, and Iceberg’s tag API. No session-level state or new permission bits.

My point is tags are 'almost' already there by design in Trino for Iceberg, without needing the full machinery of “branch” semantics. It also makes intent crystal-clear: “I just want a fixed pointer to that exact table state,” instead of “I’m opening a living branch where I might write data.” I totally get that branches can model tags by simply never writing to them, but it blurs the mental model in the sense that tags are about historical snapshots and branches are about development lines.

I would love to hear:
Your thoughts on whether an iceberg-specific “tags” primitive like this feels like a worthwhile addition?
Any gaps I’m missing where branches already cover these scenarios cleanly?
Suggestions on how to document or flag this as “experimental” or "ephemeral" to try it out without committing core Trino to this kind of tagging forever

Looking forward to feedback! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

3 participants