Best practices for memory-efficient deduplication of pre-sorted Parquet files #16776

zheniasigayev · 2025-07-14T21:30:05Z

zheniasigayev
Jul 14, 2025

Hi DataFusion folks!

I'm trying to deduplicate pre-sorted data stored in several Parquet files.

Asks:
Any suggestions, or recommendations on changes I can make to the query or configuration that could make deduplicating and merging several pre-sorted Parquet files more performant in a memory constrained environment?

Context:
I am consistently running out of memory with the error:

Resources exhausted: Additional allocation failed with top memory consumers (across reservations) as:
  GroupedHashAggregateStream[1] (first_value(example.col_3), first_value(example.col_4))#16(can spill: true) consumed 849.6 MB,
  GroupedHashAggregateStream[5] (first_value(example.col_3), first_value(example.col_4))#28(can spill: true) consumed 849.2 MB,
  GroupedHashAggregateStream[2] (first_value(example.col_3), first_value(example.col_4))#19(can spill: true) consumed 843.9 MB,
  GroupedHashAggregateStream[0] (first_value(example.col_3), first_value(example.col_4))#13(can spill: true) consumed 843.0 MB,
  GroupedHashAggregateStream[9] (first_value(example.col_3), first_value(example.col_4))#50(can spill: true) consumed 838.9 MB,
  GroupedHashAggregateStream[3] (first_value(example.col_3), first_value(example.col_4))#22(can spill: true) consumed 837.6 MB,
  GroupedHashAggregateStream[7] (first_value(example.col_3), first_value(example.col_4))#34(can spill: true) consumed 832.3 MB,
  GroupedHashAggregateStream[8] (first_value(example.col_3), first_value(example.col_4))#47(can spill: true) consumed 701.3 MB,
  GroupedHashAggregateStream[4] (first_value(example.col_3), first_value(example.col_4))#25(can spill: true) consumed 697.9 MB,
  GroupedHashAggregateStream[6] (first_value(example.col_3), first_value(example.col_4))#31(can spill: true) consumed 696.1 MB,
  RepartitionExec[6]#43(can spill: false) consumed 48.4 MB,
  RepartitionExec[4]#39(can spill: false) consumed 31.3 MB,
  RepartitionExec[1]#36(can spill: false) consumed 27.1 MB,
  RepartitionExec[7]#44(can spill: false) consumed 27.1 MB,
  RepartitionExec[8]#45(can spill: false) consumed 24.2 MB,
  RepartitionExec[5]#41(can spill: false) consumed 24.2 MB,
  RepartitionExec[2]#37(can spill: false) consumed 5.7 MB,
  RepartitionExec[3]#38(can spill: false) consumed 4.3 MB,
  GroupedHashAggregateStream[7] (first_value(example.col_3), first_value(example.col_4))#10(can spill: true) consumed 1935.9 KB,
  GroupedHashAggregateStream[0] (first_value(example.col_3), first_value(example.col_4))#3(can spill: true) consumed 1935.9 KB,
  GroupedHashAggregateStream[1] (first_value(example.col_3), first_value(example.col_4))#4(can spill: true) consumed 1935.9 KB,
  GroupedHashAggregateStream[3] (first_value(example.col_3), first_value(example.col_4))#6(can spill: true) consumed 1935.9 KB,
  RepartitionExec[9]#46(can spill: false) consumed 1436.8 KB,
  ExternalSorter[7]#40(can spill: true) consumed 0.0 B,
  ExternalSorter[1]#17(can spill: true) consumed 0.0 B.
Error: Failed to allocate additional 1459.9 KB for RepartitionExec[3] with 4.3 MB already allocated for this reservation - 1095.4 KB remain available for the total pool

I currently do this by:

GROUP BY duplicate columns, and use first_value() aggregation function to select only the first occurrence
Leverarging N number of pre-sorted Parquet files by col_1 and col_2
Resulting single Parquet output is sorted using the same fields col_1 and col_2
Using DataFusion CLI v48.0.1
- Command ran: datafusion-cli -m 8G -d 50G --top-memory-consumers 25

CREATE EXTERNAL TABLE example (
        col_1,
        col_2,
        col_3,
        col_4,
    ) 
WITH ORDER (col_1 ASC, col_2 ASC) 
STORED AS PARQUET 
LOCATION '/tmp/example/*.parquet';

COPY (
    SELECT 
        col_1,
        col_2,
        first_value(col_3) AS col_3
        first_value(col_4) AS col_4
    FROM 
        example 
    GROUP BY 
        col_1, col_2
    ORDER BY 
        col_1, col_2
) 
TO '/tmp/example/output/result.parquet' 
STORED AS PARQUET 
OPTIONS (compression 'zstd(1)');

I have explored various configuration settings found here: https://datafusion.apache.org/user-guide/configs.html

datafusion.execution.parquet.max_row_group_size to 25k (default 1,048,576)
SET datafusion.runtime.memory_limit to 8G
Set disk limit for spilling queries to 50G
Set top_memory_consumers to 25
Saw datafusion.execution.spill_compression, but it does not seem to be implemented yet for v48.0.1 and going into v49.0.0 based on Add compression option to SpillManager #16268

I've seen some GitHub discussions. This is a unique use-case:

alamb · 2025-07-16T13:22:11Z

alamb
Jul 16, 2025
Collaborator

👋

Give your description, I am surprised that this query is using a HashAggregateStream -- the hash aggregate needs to buffer the entire dataset in RAM / spill it which is why it is likely running out of memory

Given that the data is sorted by col_1 and col_2, I would expect this query to use the streaming aggregate operatior (which should not have much memory at all)

What does the plan look like for this:

EXPLAIN SELECT 
        col_1,
        col_2,
        first_value(col_3) AS col_3
        first_value(col_4) AS col_4
    FROM 
        example 
    GROUP BY 
        col_1, col_2
    ORDER BY 
        col_1, col_2

Can you get the different operator when you remove the first/last value aggregates?

EXPLAIN SELECT 
        col_1,
        col_2 -- NOTE remove the first_value / last_value aggregates
    FROM 
        example 
    GROUP BY 
        col_1, col_2
    ORDER BY 
        col_1, col_2

0 replies

zheniasigayev · 2025-07-16T17:09:52Z

zheniasigayev
Jul 16, 2025
Author

Addressing Question 1.

The query plan for the original query:

CREATE EXTERNAL TABLE example (
    col_1 VARCHAR(50) NOT NULL,
    col_2 BIGINT NOT NULL,
    col_3 VARCHAR(50),
    col_4 VARCHAR(50),
    col_5 VARCHAR(50),
    col_6 VARCHAR(100) NOT NULL,
    col_7 VARCHAR(50),
    col_8 DOUBLE
) 
WITH ORDER (col_1 ASC, col_2 ASC) 
STORED AS PARQUET 
LOCATION '/tmp/redacted/*.parquet';

EXPLAIN COPY (
    SELECT 
        col_1,
        col_2,
        col_3,
        col_4,
        col_5,
        col_6,
        first_value(col_7) AS col_7,
        first_value(col_8) AS col_8
    FROM 
        example 
    GROUP BY 
        col_1, col_2, col_3, col_4, col_5, col_6 
    ORDER BY 
        col_1 ASC, col_2 ASC
) 
TO '/tmp/result.parquet' 
STORED AS PARQUET 
OPTIONS (compression 'zstd(1)');

The resulting EXPLAIN output:

+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │        DataSinkExec       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │  SortPreservingMergeExec  │ |
|               | │    --------------------   │ |
|               | │   col_1 ASC NULLS LAST,   │ |
|               | │    col_2 ASC NULLS LAST   │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │          SortExec         │ |
|               | │    --------------------   │ |
|               | │  col_1@0 ASC NULLS LAST,  │ |
|               | │   col_2@1 ASC NULLS LAST  │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       ProjectionExec      │ |
|               | │    --------------------   │ |
|               | │        col_1: col_1       │ |
|               | │        col_2: col_2       │ |
|               | │        col_3: col_3       │ |
|               | │        col_4: col_4       │ |
|               | │        col_5: col_5       │ |
|               | │        col_6: col_6       │ |
|               | │                           │ |
|               | │           col_7:          │ |
|               | │ first_value(example.col_7)│ |
|               | │                           │ |
|               | │           col_8:          │ |
|               | │ first_value(example.col_8)│ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │           aggr:           │ |
|               | │ first_value(example.col_7)│ |
|               | │   , first_value(example   │ |
|               | │          .col_8)          │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │ col_1, col_2, col_3, col_4│ |
|               | │       , col_5, col_6      │ |
|               | │                           │ |
|               | │           mode:           │ |
|               | │      FinalPartitioned     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │      RepartitionExec      │ |
|               | │    --------------------   │ |
|               | │ partition_count(in->out): │ |
|               | │          10 -> 10         │ |
|               | │                           │ |
|               | │    partitioning_scheme:   │ |
|               | │  Hash([col_1@0, col_2@1,  │ |
|               | │      col_3@2, col_4@3,    │ |
|               | │     col_5@4, col_6@5],    │ |
|               | │             10)           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │           aggr:           │ |
|               | │ first_value(example.col_7)│ |
|               | │   , first_value(example   │ |
|               | │          .col_8)          │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │ col_1, col_2, col_3, col_4│ |
|               | │       , col_5, col_6      │ |
|               | │                           │ |
|               | │       mode: Partial       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         files: 24         │ |
|               | │      format: parquet      │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+

1 reply

alamb Jul 18, 2025
Collaborator

The query plan for the original query:

So this query is ordered like

WITH ORDER (col_1 ASC, col_2 ASC)

But the grouping is on all columns

    GROUP BY 
        col_1, col_2, col_3, col_4, col_5, col_6

So I would expect that the partial group by stream could be used and that the code would be able to stream results out whenever it sees new values of col_1 and col_2. I am not sure what is going on

If you could make a reproducer with synthetic data and file a ticket I would be happy to look into this further

zheniasigayev · 2025-07-16T17:20:43Z

zheniasigayev
Jul 16, 2025
Author

Addressing Question 2)

It's not possible to remove the first_value() aggregate from the above query since col_7 and col_8 won't appear in the GROUP BY.

Error during planning: Column in SELECT must be in GROUP BY or an aggregate function: While expanding wildcard, column "example.col_7" must appear in the GROUP BY clause or must be part of an aggregate function, currently only "example.col_1, example.col_2, example.col_3, example.col_4, example.col_5, example.col_6" appears in the SELECT clause satisfies this requirement

Instead, I removed col_7 and col_8 (the columns which first_value() aggregate is applied to). This is the resulting query:

CREATE EXTERNAL TABLE example (
    col_1 VARCHAR(50) NOT NULL,
    col_2 BIGINT NOT NULL,
    col_3 VARCHAR(50),
    col_4 VARCHAR(50),
    col_5 VARCHAR(50),
    col_6 VARCHAR(100) NOT NULL,
    col_7 VARCHAR(50),
    col_8 DOUBLE
) 
WITH ORDER (col_1 ASC, col_2 ASC) 
STORED AS PARQUET 
LOCATION '/tmp/redacted/*.parquet';

COPY (
    SELECT 
        col_1,
        col_2,
        col_3,
        col_4,
        col_5,
        col_6
    FROM 
        example 
    GROUP BY 
        col_1, col_2, col_3, col_4, col_5, col_6
    ORDER BY 
        col_1 ASC, col_2 ASC
) 
TO '/tmp/result_part2.parquet' 
STORED AS PARQUET 
OPTIONS (compression 'zstd(1)');

The resulting EXPLAIN output:

+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │        DataSinkExec       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │  SortPreservingMergeExec  │ |
|               | │    --------------------   │ |
|               | │   col_1 ASC NULLS LAST,   │ |
|               | │    col_2 ASC NULLS LAST   │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │          SortExec         │ |
|               | │    --------------------   │ |
|               | │  col_1@0 ASC NULLS LAST,  │ |
|               | │   col_2@1 ASC NULLS LAST  │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │         group_by:         │ |
|               | │ col_1, col_2, col_3, col_4│ |
|               | │       , col_5, col_6      │ |
|               | │                           │ |
|               | │           mode:           │ |
|               | │      FinalPartitioned     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │      RepartitionExec      │ |
|               | │    --------------------   │ |
|               | │ partition_count(in->out): │ |
|               | │          10 -> 10         │ |
|               | │                           │ |
|               | │    partitioning_scheme:   │ |
|               | │  Hash([col_1@0, col_2@1,  │ |
|               | │      col_3@2, col_4@3,    │ |
|               | │     col_5@4, col_6@5],    │ |
|               | │             10)           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │         group_by:         │ |
|               | │ col_1, col_2, col_3, col_4│ |
|               | │       , col_5, col_6      │ |
|               | │                           │ |
|               | │       mode: Partial       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         files: 24         │ |
|               | │      format: parquet      │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+

Executing the query results in:

Not enough memory to continue external sort. Consider increasing the memory limit, or decreasing sort_spill_reservation_bytes
caused by
Resources exhausted: Additional allocation failed with top memory consumers (across reservations) as:
  ExternalSorter[6]#32(can spill: true) consumed 2.7 GB,
  ExternalSorter[1]#17(can spill: true) consumed 2.7 GB,
  GroupedHashAggregateStream[2] ()#19(can spill: true) consumed 307.0 MB,
  GroupedHashAggregateStream[4] ()#25(can spill: true) consumed 306.2 MB,
  GroupedHashAggregateStream[3] ()#22(can spill: true) consumed 305.2 MB,
  GroupedHashAggregateStream[8] ()#37(can spill: true) consumed 305.1 MB,
  GroupedHashAggregateStream[7] ()#34(can spill: true) consumed 304.6 MB,
  GroupedHashAggregateStream[9] ()#40(can spill: true) consumed 304.5 MB,
  GroupedHashAggregateStream[0] ()#13(can spill: true) consumed 304.4 MB,
  GroupedHashAggregateStream[5] ()#28(can spill: true) consumed 28.3 MB,
  GroupedHashAggregateStream[1] ()#16(can spill: true) consumed 28.3 MB,
  GroupedHashAggregateStream[6] ()#31(can spill: true) consumed 28.3 MB,
  ExternalSorterMerge[5]#30(can spill: false) consumed 10.0 MB,
  ExternalSorter[3]#23(can spill: true) consumed 0.0 B,
  RepartitionExec[2]#43(can spill: false) consumed 0.0 B,
  SortPreservingMergeExec[0]#2(can spill: false) consumed 0.0 B,
  ExternalSorterMerge[4]#27(can spill: false) consumed 0.0 B,
  ExternalSorterMerge[0]#15(can spill: false) consumed 0.0 B,
  ExternalSorterMerge[1]#18(can spill: false) consumed 0.0 B,
  ExternalSorterMerge[2]#21(can spill: false) consumed 0.0 B,
  RepartitionExec[3]#44(can spill: false) consumed 0.0 B,
  ExternalSorterMerge[8]#39(can spill: false) consumed 0.0 B,
  ExternalSorter[2]#20(can spill: true) consumed 0.0 B,
  ExternalSorter[8]#38(can spill: true) consumed 0.0 B,
  ExternalSorter[0]#14(can spill: true) consumed 0.0 B.
Error: Failed to allocate additional 552.9 MB for ExternalSorter[5] with 0.0 B already allocated for this reservation - 412.8 MB remain available for the total pool

0 replies

zheniasigayev · 2025-07-16T17:23:38Z

zheniasigayev
Jul 16, 2025
Author

The above results were performed with the following setup:

datafusion-cli -m 8G -d 50G --top-memory-consumers 25
The default datafusion.execution.parquet.max_row_group_size of 1048576
No other configs were modified

0 replies

zheniasigayev · 2025-07-21T19:34:37Z

zheniasigayev
Jul 21, 2025
Author

If you could make a reproducer with synthetic data and file a ticket I would be happy to look into this further

I created a public Gist which you can find here: https://gist.github.com/zheniasigayev/2e5e471c9070cfa685d938bced47aa7f.

I confirmed that the 2 queries that I provided in the discussion above produced the same query plan, and memory consumers, when run against the generated parquet files.

5 replies

zheniasigayev Jul 23, 2025
Author

@alamb To clarify what you meant by "file a ticket", you're saying to create an issue within this GitHub repo, correct?

alamb Jul 24, 2025
Collaborator

Yes, please, I actually did some testing today,

What I would expect in this case is to see an AggregateExec in the plan that had the annotation of ordering_mode=PartiallySorted([0] (note that is different than the "Partial" annotation)

AggregateExec: mode=Partial, gby=[a@0 as a, b@1 as b], aggr=[count(Int64(1))], ordering_mode=PartiallySorted([0])

Perhaps you can double check the explain plan like EXPLAIN FORMAT INDENT .. (which will produce a more detailed version of explain that has many more details)

Thanks for sticking with this

zheniasigayev Jul 24, 2025
Author

Both queries use mode=Partial.

Addressing Question / Query 1)

+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | CopyTo: format=parquet output_url=/tmp/result.parquet options: (format.compression zstd(1))                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |   Sort: example.col_1 ASC NULLS LAST, example.col_2 ASC NULLS LAST                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |     Projection: example.col_1, example.col_2, example.col_3, example.col_4, example.col_5, example.col_6, first_value(example.col_7) AS col_7, first_value(example.col_8) AS col_8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |       Aggregate: groupBy=[[example.col_1, example.col_2, example.col_3, example.col_4, example.col_5, example.col_6]], aggr=[[first_value(example.col_7), first_value(example.col_8)]]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|               |         TableScan: example projection=[col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| physical_plan | DataSinkExec: sink=ParquetSink(file_groups=[])                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|               |   SortPreservingMergeExec: [col_1@0 ASC NULLS LAST, col_2@1 ASC NULLS LAST]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |     SortExec: expr=[col_1@0 ASC NULLS LAST, col_2@1 ASC NULLS LAST], preserve_partitioning=[true]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|               |       ProjectionExec: expr=[col_1@0 as col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, col_6@5 as col_6, first_value(example.col_7)@6 as col_7, first_value(example.col_8)@7 as col_8]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|               |         AggregateExec: mode=FinalPartitioned, gby=[col_1@0 as col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, col_6@5 as col_6], aggr=[first_value(example.col_7), first_value(example.col_8)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|               |           CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|               |             RepartitionExec: partitioning=Hash([col_1@0, col_2@1, col_3@2, col_4@3, col_5@4, col_6@5], 10), input_partitions=10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|               |               AggregateExec: mode=Partial, gby=[col_1@0 as col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, col_6@5 as col_6], aggr=[first_value(example.col_7), first_value(example.col_8)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|               |                 DataSourceExec: file_groups={10 groups: [[tmp/redacted/reproducible_data_0.parquet:0..12537303, tmp/redacted/reproducible_data_1.parquet:0..6245726], [tmp/redacted/reproducible_data_1.parquet:6245726..12518047, tmp/redacted/reproducible_data_10.parquet:0..12510708], [tmp/redacted/reproducible_data_10.parquet:12510708..12530931, tmp/redacted/reproducible_data_11.parquet:0..12518206, tmp/redacted/reproducible_data_12.parquet:0..6244600], [tmp/redacted/reproducible_data_12.parquet:6244600..12522074, tmp/redacted/reproducible_data_13.parquet:0..12505555], [tmp/redacted/reproducible_data_13.parquet:12505555..12523871, tmp/redacted/reproducible_data_14.parquet:0..12515635, tmp/redacted/reproducible_data_2.parquet:0..6249078], ...]}, projection=[col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8], file_type=parquet |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Addressing Question / Query 2)

+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | CopyTo: format=parquet output_url=/tmp/result_part2.parquet options: (format.compression zstd(1))                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|               |   Sort: example.col_1 ASC NULLS LAST, example.col_2 ASC NULLS LAST                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |     Aggregate: groupBy=[[example.col_1, example.col_2, example.col_3, example.col_4, example.col_5, example.col_6]], aggr=[[]]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|               |       TableScan: example projection=[col_1, col_2, col_3, col_4, col_5, col_6]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| physical_plan | DataSinkExec: sink=ParquetSink(file_groups=[])                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|               |   SortPreservingMergeExec: [col_1@0 ASC NULLS LAST, col_2@1 ASC NULLS LAST]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |     SortExec: expr=[col_1@0 ASC NULLS LAST, col_2@1 ASC NULLS LAST], preserve_partitioning=[true]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|               |       AggregateExec: mode=FinalPartitioned, gby=[col_1@0 as col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, col_6@5 as col_6], aggr=[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|               |         CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|               |           RepartitionExec: partitioning=Hash([col_1@0, col_2@1, col_3@2, col_4@3, col_5@4, col_6@5], 10), input_partitions=10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|               |             AggregateExec: mode=Partial, gby=[col_1@0 as col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, col_6@5 as col_6], aggr=[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |               DataSourceExec: file_groups={10 groups: [[tmp/redacted/reproducible_data_0.parquet:0..12537303, tmp/redacted/reproducible_data_1.parquet:0..6245726], [tmp/redacted/reproducible_data_1.parquet:6245726..12518047, tmp/redacted/reproducible_data_10.parquet:0..12510708], [tmp/redacted/reproducible_data_10.parquet:12510708..12530931, tmp/redacted/reproducible_data_11.parquet:0..12518206, tmp/redacted/reproducible_data_12.parquet:0..6244600], [tmp/redacted/reproducible_data_12.parquet:6244600..12522074, tmp/redacted/reproducible_data_13.parquet:0..12505555], [tmp/redacted/reproducible_data_13.parquet:12505555..12523871, tmp/redacted/reproducible_data_14.parquet:0..12515635, tmp/redacted/reproducible_data_2.parquet:0..6249078], ...]}, projection=[col_1, col_2, col_3, col_4, col_5, col_6], file_type=parquet |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

zheniasigayev Jul 24, 2025
Author

Yes, please, I actually did some testing today,

Entire input is resorted when the data is partially sorted (not using PartialSortExec) #16899

Add partial_sort.slt test for partially sorted data #16900

I noticed for Add partial_sort.slt test for partially sorted data #16900, a related change: #16881 was made by @berkaysynnada. Would their change solve this issue?

berkaysynnada Jul 25, 2025
Collaborator

Yes, please, I actually did some testing today,

Entire input is resorted when the data is partially sorted (not using PartialSortExec) #16899

Add partial_sort.slt test for partially sorted data #16900

I noticed for Add partial_sort.slt test for partially sorted data #16900, a related change: #16881 was made by @berkaysynnada. Would their change solve this issue?

I didn't notice any PartialSortExec in your plans. PartialSort only emerges if the source is unbounded at the current datafusion configuration. I guess #16881 won't change any behavior of your scenario

zheniasigayev · 2025-07-25T21:58:10Z

zheniasigayev
Jul 25, 2025
Author

I created a GitHub issue with relevant details summarized. See: Streaming Aggregate operator not being used in deduplication of pre-sorted Parquet files #16919. @alamb, let me know what other help I can try to provide from my end.

0 replies

Best practices for memory-efficient deduplication of pre-sorted Parquet files #16776

Uh oh!

zheniasigayev Jul 14, 2025

Replies: 6 comments · 6 replies

Uh oh!

alamb Jul 16, 2025 Collaborator

Uh oh!

zheniasigayev Jul 16, 2025 Author

Uh oh!

alamb Jul 18, 2025 Collaborator

Uh oh!

zheniasigayev Jul 16, 2025 Author

Uh oh!

zheniasigayev Jul 16, 2025 Author

Uh oh!

zheniasigayev Jul 21, 2025 Author

Uh oh!

zheniasigayev Jul 23, 2025 Author

Uh oh!

alamb Jul 24, 2025 Collaborator

Uh oh!

zheniasigayev Jul 24, 2025 Author

Addressing Question / Query 1)

Addressing Question / Query 2)

Uh oh!

zheniasigayev Jul 24, 2025 Author

Uh oh!

berkaysynnada Jul 25, 2025 Collaborator

Uh oh!

zheniasigayev Jul 25, 2025 Author

zheniasigayev
Jul 14, 2025

Replies: 6 comments 6 replies

alamb
Jul 16, 2025
Collaborator

zheniasigayev
Jul 16, 2025
Author

alamb Jul 18, 2025
Collaborator

zheniasigayev
Jul 16, 2025
Author

zheniasigayev
Jul 16, 2025
Author

zheniasigayev
Jul 21, 2025
Author

zheniasigayev Jul 23, 2025
Author

alamb Jul 24, 2025
Collaborator

zheniasigayev Jul 24, 2025
Author

zheniasigayev Jul 24, 2025
Author

berkaysynnada Jul 25, 2025
Collaborator

zheniasigayev
Jul 25, 2025
Author