diff --git a/docs/data-modeling/projections/1_projections.md b/docs/data-modeling/projections/1_projections.md index 8a443057f77..0ac1fdcf3db 100644 --- a/docs/data-modeling/projections/1_projections.md +++ b/docs/data-modeling/projections/1_projections.md @@ -49,7 +49,26 @@ ClickHouse automatically samples the primary keys and chooses a table that can generate the same correct result, but requires the least amount of data to be read as shown in the figure below: -Projections in ClickHouse +Projections in ClickHouse + +### Smarter storage with `_part_offset` + +Since version 25.5, ClickHouse supports the virtual column `_part_offset` in +projections which offers a new way to define a projection. + +There are now two ways to define a projection: + +- **Store full columns (the original behavior)**: The projection contains full + data and can be read directly, offering faster performance when filters match + the projection’s sort order. + +- **Store only the sorting key + `_part_offset`**: The projection works like an index. + ClickHouse uses the projection’s primary index to locate matching rows, but reads the + actual data from the base table. This reduces storage overhead at the cost of + slightly more I/O at query time. + +The approaches above can also be mixed, storing some columns in the projection and +others indirectly via `_part_offset`. ## When to use Projections? {#when-to-use-projections} @@ -68,8 +87,6 @@ users should be aware of and thus should be deployed sparingly. - Projections don't allow using different TTL for the source table and the (hidden) target table, materialized views allow different TTLs. -- Projections don't currently support `optimize_read_in_order` for the (hidden) - target table. - Lightweight updates and deletes are not supported for tables with projections. - Materialized Views can be chained: the target table of one Materialized View can be the source table of another Materialized View, and so on. This is not @@ -85,7 +102,7 @@ We recommend using projections when: to exploit projections that use a simple reordering, i.e., `SELECT * ORDER BY x`. Users can select a subset of columns in this expression to reduce storage footprint. -- Users are comfortable with the associated increase in storage footprint and +- Users are comfortable with the potential associated increase in storage footprint and overhead of writing data twice. Test the impact on insertion speed and [evaluate the storage overhead](/data-compression/compression-in-clickhouse). @@ -290,7 +307,7 @@ becomes `AggregatingMergeTree`, and all aggregate functions are converted to The figure below is a visualization of the main table `uk_price_paid_with_projections` and its two projections: -Visualization of the main table uk_price_paid_with_projections and its two projections +Visualization of the main table uk_price_paid_with_projections and its two projections If we now run the query that lists the counties in London for the three highest paid prices again, we see an improvement in query performance: @@ -516,6 +533,125 @@ LIMIT 100 Again, the result is the same but notice the improvement in query performance for the 2nd query. +### Combining projections in one query {#combining-projections} + +Starting in version 25.6, building on the `_part_offset` support introduced in +the previous version, ClickHouse can now use multiple projections to accelerate +a single query with multiple filters. + +Importantly, ClickHouse still reads data from only one projection (or the base table), +but can use other projections' primary indexes to prune unnecessary parts before reading. +This is especially useful for queries that filter on multiple columns, each +potentially matching a different projection. + +> Currently, this mechanism only prunes entire parts. Granule-level pruning is + not yet supported. + +To demonstrate this, we define the table (with projections using `_part_offset` columns) +and insert five example rows matching the diagrams above. + +```sql +CREATE TABLE page_views +( + id UInt64, + event_date Date, + user_id UInt32, + url String, + region String, + PROJECTION region_proj + ( + SELECT _part_offset ORDER BY region + ), + PROJECTION user_id_proj + ( + SELECT _part_offset ORDER BY user_id + ) +) +ENGINE = MergeTree +ORDER BY (event_date, id); +SETTINGS + index_granularity = 1, -- one row per granule + max_bytes_to_merge_at_max_space_in_pool = 1; -- disable merge +``` + +Then we insert data into the table: + +```sql +INSERT INTO page_views VALUES ( +1, '2025-07-01', 101, 'https://example.com/page1', 'europe'); +INSERT INTO page_views VALUES ( +2, '2025-07-01', 102, 'https://example.com/page2', 'us_west'); +INSERT INTO page_views VALUES ( +3, '2025-07-02', 106, 'https://example.com/page3', 'us_west'); +INSERT INTO page_views VALUES ( +4, '2025-07-02', 107, 'https://example.com/page4', 'us_west'); +INSERT INTO page_views VALUES ( +5, '2025-07-03', 104, 'https://example.com/page5', 'asia'); +``` + +:::note +Note: The table uses custom settings for illustration, such as one-row granules +and disabled part merges, which are not recommended for production use. +::: + +This setup produces: +- Five separate parts (one per inserted row) +- One primary index entry per row (in the base table and each projection) +- Each part contains exactly one row + +With this setup, we run a query filtering on both `region` and `user_id`. +Since the base table’s primary index is built from `event_date` and `id`, it +is unhelpful here, ClickHouse therefore uses: + +- `region_proj` to prune parts by region +- `user_id_proj` to further prune by `user_id` + +This behavior is visible using `EXPLAIN projections = 1`, which shows how +ClickHouse selects and applies projections. + +```sql +EXPLAIN projections=1 +SELECT * FROM page_views WHERE region = 'us_west' AND user_id = 107; +``` + +```response + ┌─explain────────────────────────────────────────────────────────────────────────────────┐ + 1. │ Expression ((Project names + Projection)) │ + 2. │ Expression │ + 3. │ ReadFromMergeTree (default.page_views) │ + 4. │ Projections: │ + 5. │ Name: region_proj │ + 6. │ Description: Projection has been analyzed and is used for part-level filtering │ + 7. │ Condition: (region in ['us_west', 'us_west']) │ + 8. │ Search Algorithm: binary search │ + 9. │ Parts: 3 │ +10. │ Marks: 3 │ +11. │ Ranges: 3 │ +12. │ Rows: 3 │ +13. │ Filtered Parts: 2 │ +14. │ Name: user_id_proj │ +15. │ Description: Projection has been analyzed and is used for part-level filtering │ +16. │ Condition: (user_id in [107, 107]) │ +17. │ Search Algorithm: binary search │ +18. │ Parts: 1 │ +19. │ Marks: 1 │ +20. │ Ranges: 1 │ +21. │ Rows: 1 │ +22. │ Filtered Parts: 2 │ + └────────────────────────────────────────────────────────────────────────────────────────┘ +``` + +The `EXPLAIN` output (shown above) reveals the logical query plan, top to bottom: + +| Row number | Description | +|------------|----------------------------------------------------------------------------------------------------------| +| 3 | Plans to read from the `page_views` base table | +| 5-13 | Uses `region_proj` to identify 3 parts where region = 'us_west', pruning 2 of the 5 parts | +| 14-22 | Uses user`_id_proj` to identify 1 part where `user_id = 107`, further pruning 2 of the 3 remaining parts | + +In the end, just **1 out of 5 parts** is read from the base table. +By combining the index analysis of multiple projections, ClickHouse significantly reduces the amount of data scanned, +improving performance while keeping storage overhead low. ## Related content {#related-content} - [A Practical Introduction to Primary Indexes in ClickHouse](/guides/best-practices/sparse-primary-indexes#option-3-projections) diff --git a/docs/data-modeling/projections/2_materialized-views-versus-projections.md b/docs/data-modeling/projections/2_materialized-views-versus-projections.md index 04fed539292..9819dbf5278 100644 --- a/docs/data-modeling/projections/2_materialized-views-versus-projections.md +++ b/docs/data-modeling/projections/2_materialized-views-versus-projections.md @@ -14,21 +14,23 @@ may want to pick one over the other in certain scenarios. The table below summarizes the key differences between materialized views and projections for various aspects of consideration. -| Aspect | Materialized views | Projections | -|------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| Data storage and location | Store their results in a **separate, explicit target table**, acting as insert triggers, on insert to a source table. | Projections create optimized data layouts that are physically **stored alongside the main table data** and are invisible to the user. | -| Update mechanism | Operate **synchronously** on `INSERT` to the source table (for incremental materialized views). Note: they can also be **scheduled** using refreshable materialized views. | **Asynchronous** updates in the background upon `INSERT` to the main table. | -| Query interaction | Working with Materialized Views requires querying the **target table directly**, meaning that users need to be aware of the existence of materialized views when writing queries. | Projections are **automatically selected** by ClickHouse's query optimizer, and are transparent in the sense that the user does not have to modify their queries to the table with the projection in order to utilise it. From version 25.6 it is also possible to filter by more than one projection. | -| Handling `UPDATE` / `DELETE` | **Do not automatically react** to `UPDATE` or `DELETE` operations on the source table as materialized views have no knowledge of the source table, acting only as insert triggers _to_ a source table. This can lead to potential data staleness between source and target tables and requires workarounds or periodic full refresh. (via refreshable materialized view). | By default, are **incompatible with `DELETED` rows** (especially lightweight deletes). `lightweight_mutation_projection_mode` (v24.7+) can enable compatibility. | -| `JOIN` support | Yes. Refreshable materialized views can be used for complex denormalization. Incremental materialized views only trigger on left-most table inserts. | No. `JOIN` operations are not supported within projection definitions for filtering the materialised data. | -| `WHERE` clause in definition | Yes. `WHERE` clauses can be included to filter data before materialization. | No. `WHERE` clauses are not supported within projection definitions for filtering the materialized data. | -| Chaining capabilities | Yes, the target table of one materialized view can be the source for another materialized view, enabling multi-stage pipelines. | No. Projections cannot be chained. | -| Applicable table engines | Can be used with various source table engines, but target tables are usually of the `MergeTree` family. | **Only available** for `MergeTree` family table engines. | -| Failure handling | Failure during data insertion means that data is lost in the target table, leading to potential inconsistency. | Failures are handled **silently** in the background. Queries can seamlessly mix materialized and unmaterialized parts. | -| Operational overhead | Requires explicit target table creation and often manual backfilling. Managing consistency with `UPDATE`/`DELETE` increases complexity. | Projections are automatically maintained and kept-in-sync and generally have a lower operational burden. | -| `FINAL` query compatibility | Generally compatible, but often require `GROUP BY` on the target table. | **Do not work** with `FINAL` queries. | -| Lazy materialization | Yes. | Monitor for projection compatibility issues when using materialization features. You may need to set `query_plan_optimize_lazy_materialization = false` | -| Parallel replicas | Yes. | No. | +| Aspect | Materialized views | Projections | +|----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Data storage and location | Store their results in a **separate, explicit target table**, acting as insert triggers, on insert to a source table. | Projections create optimized data layouts that are physically **stored alongside the main table data** and are invisible to the user. | +| Update mechanism | Operate **synchronously** on `INSERT` to the source table (for incremental materialized views). Note: they can also be **scheduled** using refreshable materialized views. | **Asynchronous** updates in the background upon `INSERT` to the main table. | +| Query interaction | Working with Materialized Views requires querying the **target table directly**, meaning that users need to be aware of the existence of materialized views when writing queries. | Projections are **automatically selected** by ClickHouse's query optimizer, and are transparent in the sense that the user does not have to modify their queries to the table with the projection in order to utilise it. From version 25.6 it is also possible to filter by more than one projection. | +| Handling `UPDATE` / `DELETE` | **Do not automatically react** to `UPDATE` or `DELETE` operations on the source table as materialized views have no knowledge of the source table, acting only as insert triggers _to_ a source table. This can lead to potential data staleness between source and target tables and requires workarounds or periodic full refresh. (via refreshable materialized view). | By default, are **incompatible with `DELETED` rows** (especially lightweight deletes). `lightweight_mutation_projection_mode` (v24.7+) can enable compatibility. | +| `JOIN` support | Yes. Refreshable materialized views can be used for complex denormalization. Incremental materialized views only trigger on left-most table inserts. | No. `JOIN` operations are not supported within projection definitions for filtering the materialised data. | +| `WHERE` clause in definition | Yes. `WHERE` clauses can be included to filter data before materialization. | No. `WHERE` clauses are not supported within projection definitions for filtering the materialized data. | +| Chaining capabilities | Yes, the target table of one materialized view can be the source for another materialized view, enabling multi-stage pipelines. | No. Projections cannot be chained. | +| Applicable table engines | Can be used with various source table engines, but target tables are usually of the `MergeTree` family. | **Only available** for `MergeTree` family table engines. | +| Failure handling | Failure during data insertion means that data is lost in the target table, leading to potential inconsistency. | Failures are handled **silently** in the background. Queries can seamlessly mix materialized and unmaterialized parts. | +| Operational overhead | Requires explicit target table creation and often manual backfilling. Managing consistency with `UPDATE`/`DELETE` increases complexity. | Projections are automatically maintained and kept-in-sync and generally have a lower operational burden. | +| `FINAL` query compatibility | Generally compatible, but often require `GROUP BY` on the target table. | **Do not work** with `FINAL` queries. | +| Lazy materialization | Yes. | Monitor for projection compatibility issues when using materialization features. You may need to set `query_plan_optimize_lazy_materialization = false` | +| Parallel replicas | Yes. | No. | +| [`optimize_read_in_order`](/operations/settings/settings#optimize_read_in_order) | Yes. | Yes. | +| Lightweight updates and deletes | Yes. | No. | ## Comparing materialized views and projections {#choose-between} diff --git a/docs/migrations/bigquery/migrating-to-clickhouse-cloud.md b/docs/migrations/bigquery/migrating-to-clickhouse-cloud.md index 42203d859f3..a146df70e1d 100644 --- a/docs/migrations/bigquery/migrating-to-clickhouse-cloud.md +++ b/docs/migrations/bigquery/migrating-to-clickhouse-cloud.md @@ -248,7 +248,11 @@ Important: Ensure your partitioning key expression does not result in a high car ClickHouse's concept of projections allows users to specify multiple `ORDER BY` clauses for a table. -In [ClickHouse data modeling](/data-modeling/schema-design), we explore how materialized views can be used in ClickHouse to pre-compute aggregations, transform rows, and optimize queries for different access patterns. For the latter, we [provided an example](/materialized-view/incremental-materialized-view#lookup-table) where the materialized view sends rows to a target table with a different ordering key than the original table receiving inserts. +In [ClickHouse data modeling](/data-modeling/schema-design), we explore how materialized views can be used +in ClickHouse to pre-compute aggregations, transform rows, and optimize queries +for different access patterns. For the latter, we [provided an example](/materialized-view/incremental-materialized-view#lookup-table) where +the materialized view sends rows to a target table with a different ordering key +to the original table receiving inserts. For example, consider the following query: @@ -260,12 +264,15 @@ WHERE UserId = 8592047 ┌──────────avg(Score)─┐ │ 0.18181818181818182 │ └─────────────────────┘ - +--highlight-next-line 1 row in set. Elapsed: 0.040 sec. Processed 90.38 million rows, 361.59 MB (2.25 billion rows/s., 9.01 GB/s.) Peak memory usage: 201.93 MiB. ``` -This query requires all 90m rows to be scanned (admittedly quickly) as the `UserId` is not the ordering key. Previously, we solved this using a materialized view acting as a lookup for the `PostId`. The same problem can be solved with a projection. The command below adds a projection for the `ORDER BY user_id`. +This query requires all 90m rows to be scanned (albeit quickly) as the `UserId` +is not the ordering key. Previously, we solved this using a materialized view +acting as a lookup for the `PostId`. The same problem can be solved with a projection. +The command below adds a projection with `ORDER BY user_id`. ```sql ALTER TABLE comments ADD PROJECTION comments_user_id ( @@ -275,35 +282,42 @@ SELECT * ORDER BY UserId ALTER TABLE comments MATERIALIZE PROJECTION comments_user_id ``` -Note that we have to first create the projection and then materialize it. This latter command causes the data to be stored twice on disk in two different orders. The projection can also be defined when the data is created, as shown below, and will be automatically maintained as data is inserted. +Note that we have to first create the projection and then materialize it. +This latter command causes the data to be stored twice on disk in two different +orders. The projection can also be defined when the data is created, as shown below, +and will be automatically maintained as data is inserted. ```sql CREATE TABLE comments ( - `Id` UInt32, - `PostId` UInt32, - `Score` UInt16, - `Text` String, - `CreationDate` DateTime64(3, 'UTC'), - `UserId` Int32, - `UserDisplayName` LowCardinality(String), - PROJECTION comments_user_id - ( - SELECT * - ORDER BY UserId - ) + `Id` UInt32, + `PostId` UInt32, + `Score` UInt16, + `Text` String, + `CreationDate` DateTime64(3, 'UTC'), + `UserId` Int32, + `UserDisplayName` LowCardinality(String), + --highlight-begin + PROJECTION comments_user_id + ( + SELECT * + ORDER BY UserId + ) + --highlight-end ) ENGINE = MergeTree ORDER BY PostId ``` -If the projection is created via an `ALTER` command, the creation is asynchronous when the `MATERIALIZE PROJECTION` command is issued. Users can confirm the progress of this operation with the following query, waiting for `is_done=1`. +If the projection is created via an `ALTER` command, the creation is asynchronous +when the `MATERIALIZE PROJECTION` command is issued. Users can confirm the progress +of this operation with the following query, waiting for `is_done=1`. ```sql SELECT - parts_to_do, - is_done, - latest_fail_reason + parts_to_do, + is_done, + latest_fail_reason FROM system.mutations WHERE (`table` = 'comments') AND (command LIKE '%MATERIALIZE%') @@ -314,7 +328,8 @@ WHERE (`table` = 'comments') AND (command LIKE '%MATERIALIZE%') 1 row in set. Elapsed: 0.003 sec. ``` -If we repeat the above query, we can see performance has improved significantly at the expense of additional storage. +If we repeat the above query, we can see performance has improved significantly +at the expense of additional storage. ```sql SELECT avg(Score) @@ -324,7 +339,7 @@ WHERE UserId = 8592047 ┌──────────avg(Score)─┐ 1. │ 0.18181818181818182 │ └─────────────────────┘ - +--highlight-next-line 1 row in set. Elapsed: 0.008 sec. Processed 16.36 thousand rows, 98.17 KB (2.15 million rows/s., 12.92 MB/s.) Peak memory usage: 4.06 MiB. ``` @@ -356,20 +371,21 @@ WHERE UserId = 8592047 ### When to use projections {#when-to-use-projections} -Projections are an appealing feature for new users as they are automatically maintained as data is inserted. Furthermore, queries can just be sent to a single table where the projections are exploited where possible to speed up the response time. +Projections are an appealing feature for new users as they are automatically +maintained as data is inserted. Furthermore, queries can just be sent to a single +table where the projections are exploited where possible to speed up the response +time. Projections -This is in contrast to materialized views, where the user has to select the appropriate optimized target table or rewrite their query, depending on the filters. This places greater emphasis on user applications and increases client-side complexity. - -Despite these advantages, projections come with some inherent limitations which users should be aware of and thus should be deployed sparingly: +This is in contrast to materialized views, where the user has to select the +appropriate optimized target table or rewrite their query, depending on the filters. +This places greater emphasis on user applications and increases client-side +complexity. -- Projections don't allow using different TTL for the source table and the (hidden) - target table. Materialized views allow different TTLs. -- Projections [don't currently support `optimize_read_in_order`](https://clickhouse.com/blog/clickhouse-faster-queries-with-projections-and-primary-indexes) for the (hidden) target table. -- Lightweight updates and deletes are not supported for tables with projections. -- Materialized views can be chained: the target table of one materialized view can be the source table of another materialized view, and so on. This is not possible with projections. -- Projections don't support joins; materialized views do. -- Projections don't support filters (`WHERE` clause); materialized views do. +Despite these advantages, projections come with some inherent limitations which +users should be aware of and thus should be deployed sparingly. For further +details see ["materialized views versus projections"](/managing-data/materialized-views-versus-projections) We recommend using projections when: diff --git a/docs/migrations/postgres/data-modeling-techniques.md b/docs/migrations/postgres/data-modeling-techniques.md index 4a6d57a486d..3e34f294cf1 100644 --- a/docs/migrations/postgres/data-modeling-techniques.md +++ b/docs/migrations/postgres/data-modeling-techniques.md @@ -241,18 +241,37 @@ WHERE UserId = 8592047 ### When to use projections {#when-to-use-projections} -Projections are an appealing feature for new users as they are automatically maintained as data is inserted. Furthermore, queries can just be sent to a single table where the projections are exploited where possible to speed up the response time. +Projections are an appealing feature for new users as they are automatically +maintained as data is inserted. Furthermore, queries can just be sent to a single +table where the projections are exploited where possible to speed up the response +time. PostgreSQL projections in ClickHouse -This is in contrast to materialized views, where the user has to select the appropriate optimized target table or rewrite their query, depending on the filters. This places greater emphasis on user applications and increases client-side complexity. +This is in contrast to materialized views, where the user has to select the +appropriate optimized target table or rewrite their query, depending on the filters. +This places greater emphasis on user applications and increases client-side complexity. -Despite these advantages, projections come with some [inherent limitations](/data-modeling/projections#when-to-use-projections) which users should be aware of and thus should be deployed sparingly. +Despite these advantages, projections come with some [inherent limitations](/data-modeling/projections#when-to-use-projections) +which users should be aware of and thus should be deployed sparingly. We recommend using projections when: -- A complete reordering of the data is required. While the expression in the projection can, in theory, use a `GROUP BY,` materialized views are more effective for maintaining aggregates. The query optimizer is also more likely to exploit projections that use a simple reordering, i.e., `SELECT * ORDER BY x`. Users can select a subset of columns in this expression to reduce storage footprint. -- Users are comfortable with the associated increase in storage footprint and overhead of writing data twice. Test the impact on insertion speed and [evaluate the storage overhead](/data-compression/compression-in-clickhouse). +- A complete reordering of the data is required. While the expression in the + projection can, in theory, use a `GROUP BY,` materialized views are more + effective for maintaining aggregates. The query optimizer is also more likely + to exploit projections that use a simple reordering, i.e., `SELECT * ORDER BY x`. + Users can select a subset of columns in this expression to reduce storage footprint. +- Users are comfortable with the associated increase in storage footprint and + overhead of writing data twice. Test the impact on insertion speed and + [evaluate the storage overhead](/data-compression/compression-in-clickhouse). + +:::note +Since version 25.5, ClickHouse supports the virtual column `_part_offset` in +projections. This unlocks a more space-efficient way to store projections. + +For more details see ["Projections"](/data-modeling/projections) +::: ## Denormalization {#denormalization}