|
| 1 | +## Introduction |
| 2 | + |
| 3 | +AWS S3 has support for `SelectObjectContent` method, which allows to run SQL queries directly on S3 objects, if they contain data in CSV, JSON or Parquet formats. |
| 4 | + |
| 5 | +Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-select-sql-reference-select.html |
| 6 | + |
| 7 | +Unfortunately, the language is very primitive. It does not support ORDER BY or GROUP BY. Only filtering by WHERE with a primitive set of functions and the LIMIT clause. |
| 8 | + |
| 9 | +That's why it cannot complete ClickBench. |
| 10 | + |
| 11 | +The performance is atrocious, and the usability is dubious. It is pointless to use even if you want to pre-filter data by some conditions before further processing. |
| 12 | + |
| 13 | +## Comparison |
| 14 | + |
| 15 | +AWS S3 Select: |
| 16 | + |
| 17 | +``` |
| 18 | +time aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, SearchPhrase FROM S3Object WHERE SearchPhrase LIKE '%трешбоксарский%'" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' /dev/stdout |
| 19 | +1634,прировочный счёт трешбоксарский лабор для железневые в гаражных расписатель |
| 20 | +
|
| 21 | +real 0m33.796s |
| 22 | +user 0m0.842s |
| 23 | +sys 0m0.091s |
| 24 | +``` |
| 25 | + |
| 26 | +ClickHouse: |
| 27 | + |
| 28 | +``` |
| 29 | +time ch -q "SELECT CounterID, SearchPhrase FROM s3('s3://clickhouse-public-datasets/hits_compatible/hits.parquet') WHERE SearchPhrase LIKE '%трешбоксарский%'" |
| 30 | +1634 прировочный счёт трешбоксарский лабор для железневые в гаражных расписатель |
| 31 | +
|
| 32 | +real 0m3.526s |
| 33 | +user 0m7.248s |
| 34 | +sys 0m1.314s |
| 35 | +``` |
| 36 | + |
| 37 | +We can see that ClickHouse is ten times faster despite the need for client-side processing. |
| 38 | + |
| 39 | +## Caveats |
| 40 | + |
| 41 | +Some invalid queries just hang instead of returning an error: |
| 42 | + |
| 43 | +``` |
| 44 | +aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, count(*) FROM S3Object WHERE SearchPhrase LIKE '%test%'" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' - |
| 45 | +``` |
| 46 | + |
| 47 | +When they do return an error, the error message is below reasonable: |
| 48 | + |
| 49 | +``` |
| 50 | +aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, count(*) FROM S3Object GROUP BY CounterID ORDER BY count(*) DESC LIMIT 10" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' - |
| 51 | +
|
| 52 | +An error occurred (ParseUnexpectedToken) when calling the SelectObjectContent operation: Unexpected token found KEYWORD:UNKNOWN at line 1, column 61. |
| 53 | +``` |
| 54 | + |
| 55 | +## Alternatives |
| 56 | + |
| 57 | +You can use ClickHouse in AWS Lambda: https://github.com/aws-samples/aws-lambda-clickhouse |
| 58 | + |
| 59 | +This project is made by AWS engineers. |
0 commit comments