-
Notifications
You must be signed in to change notification settings - Fork 3.2k
[WIP] Add Iceberg TAG management support #26050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Implement CREATE/REPLACE/DROP TAG operations for Iceberg tables following Trino DDL syntax. Tags provide named references to specific table snapshots with optional retention policies. Grammar changes: - Add new tokens (TAG, DAYS, RETAIN) to SqlBase.g4 - Add three new ALTER TABLE statements for tag operations: * CREATE [OR REPLACE] TAG [IF NOT EXISTS] <name> [FOR VERSION AS OF <snapshot>] [RETAIN <n> DAYS] * REPLACE TAG <name> [FOR VERSION AS OF <snapshot>] [RETAIN <n> DAYS] * DROP TAG [IF EXISTS] <name> AST and parser implementation: - Create CreateTag, ReplaceTag, DropTag AST node classes - Add visitor methods in AstVisitor and AstBuilder - Implement proper equals/hashCode and visitor patterns Iceberg connector implementation: - Add createTag, replaceTag, dropTag methods to IcebergMetadata - Integrate with Iceberg's native TAG API - Support snapshot targeting and retention configuration - Handle error cases for duplicate/missing tags Test coverage: - Add unit tests for the new AST nodes and parser - Add comprehensive integration tests in TestIcebergTagManagement - Test all DDL variants and error conditions - Verify tag creation, replacement, deletion, and time-travel queries - Validate retention policy configuration This makes it possible to create semantic checkpoints in Iceberg tables for consistent reads, rollback scenarios, and data governance workflows.
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
There's a PR in progress to add support for branching: #25751 Also, please take a look at the document describing how we're approaching building this feature: https://docs.google.com/document/d/1jEF4IkWu-2Gzk5ii2Nb0exuEnAUeo98UbiM3i0xtgWQ/edit?tab=t.0#heading=h.dglxb51zw9m2 In particular, we're not planning to add direct support for tags in the engine in the foreseeable future, as those can be modeled as branches with no UPDATE or INSERT permissions. |
Hello @martint, thanks for getting back to me and for all the great work on branching and the detailed design doc you shared with me. I wanted to kick off a quick discussion around “tags” as a lightweight, immutable version of branches, and pick your brains about it. I’ve been working on this PR due to an increasing need by my team for an easy way to reference snapshots without the need to fetch the snapshot id every time. Some of our Iceberg tables ingest a 'live' version of the data every day and historicize it using snapshots. Spark allows us to create tags (date of ingestion) to reference these snapshots because the need to query the table in its old state is pretty frequent. It's very convenient since tags are just aliases to snapshot ids, and we don't have to fetch the refs every time before querying the snapshot. My point is that tags are:
My point is tags are 'almost' already there by design in Trino for Iceberg, without needing the full machinery of “branch” semantics. It also makes intent crystal-clear: “I just want a fixed pointer to that exact table state,” instead of “I’m opening a living branch where I might write data.” I totally get that branches can model tags by simply never writing to them, but it blurs the mental model in the sense that tags are about historical snapshots and branches are about development lines. I would love to hear: Looking forward to feedback! 😊 |
Description
This PR implements TAG management support for Iceberg tables in Trino. I added new DDL syntax to create, replace, and drop named refs to specific table snapshots. Tags provide semantic checkpoints that enable consistent reads and rollback scenarios by allowing users to reference specific versions of Iceberg tables by name rather than timestamp or snapshot ID.
The implementation adds three new
ALTER TABLE
statements following Trino DDL conventions:CREATE [OR REPLACE] TAG [IF NOT EXISTS] <name> [FOR VERSION AS OF <snapshot>] [RETAIN <n> DAYS]
REPLACE TAG <name> [FOR VERSION AS OF <snapshot>] [RETAIN <n> DAYS]
DROP TAG [IF EXISTS] <name>
Usage Examples:
Additional context and related issues
The implementation spans multiple modules of the Trino project:
Grammar and Parser Changes:
SqlBase.g4
with new tokens (TAG
,DAYS
,RETAIN
) and grammar rules for tag operationsAST Implementation:
CreateTag
,ReplaceTag
,DropTag
) with proper visitor pattern supportAstVisitor
andAstBuilder
Iceberg Connector Integration:
Testing:
TestIcebergTagManagement
covering all DDL variantsThe implementation follows Iceberg's tag semantics while maintaining consistency with Trino's existing DDL patterns.
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: