Skip to content

Commit 04d2283

Browse files
Add S3 Select
1 parent 01ebb8b commit 04d2283

File tree

3 files changed

+62
-1
lines changed

3 files changed

+62
-1
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -269,7 +269,7 @@ We also introduced the [Hardware Benchmark](https://benchmark.clickhouse.com/har
269269
- [ ] Apache Drill
270270
- [ ] Apache Kudu
271271
- [ ] Apache Kylin
272-
- [ ] S3 select command in AWS
272+
- [x] S3 select command in AWS
273273
- [x] Kinetica
274274
- [ ] YDB
275275
- [ ] OceanBase

brytlytdb/README.md

+2
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,5 @@ An attempt to use their service resulted in a failure. It showed "Error: cannot
33
(Update after 4 months) It did not happen.
44

55
(Update after 8 months) It did not happen.
6+
7+
(Update after 2 years) It did not happen.

s3select/README.md

+59
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
## Introduction
2+
3+
AWS S3 has support for `SelectObjectContent` method, which allows to run SQL queries directly on S3 objects, if they contain data in CSV, JSON or Parquet formats.
4+
5+
Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-select-sql-reference-select.html
6+
7+
Unfortunately, the language is very primitive. It does not support ORDER BY or GROUP BY. Only filtering by WHERE with a primitive set of functions and the LIMIT clause.
8+
9+
That's why it cannot complete ClickBench.
10+
11+
The performance is atrocious, and the usability is dubious. It is pointless to use even if you want to pre-filter data by some conditions before further processing.
12+
13+
## Comparison
14+
15+
AWS S3 Select:
16+
17+
```
18+
time aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, SearchPhrase FROM S3Object WHERE SearchPhrase LIKE '%трешбоксарский%'" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' /dev/stdout
19+
1634,прировочный счёт трешбоксарский лабор для железневые в гаражных расписатель
20+
21+
real 0m33.796s
22+
user 0m0.842s
23+
sys 0m0.091s
24+
```
25+
26+
ClickHouse:
27+
28+
```
29+
time ch -q "SELECT CounterID, SearchPhrase FROM s3('s3://clickhouse-public-datasets/hits_compatible/hits.parquet') WHERE SearchPhrase LIKE '%трешбоксарский%'"
30+
1634 прировочный счёт трешбоксарский лабор для железневые в гаражных расписатель
31+
32+
real 0m3.526s
33+
user 0m7.248s
34+
sys 0m1.314s
35+
```
36+
37+
We can see that ClickHouse is ten times faster despite the need for client-side processing.
38+
39+
## Caveats
40+
41+
Some invalid queries just hang instead of returning an error:
42+
43+
```
44+
aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, count(*) FROM S3Object WHERE SearchPhrase LIKE '%test%'" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' -
45+
```
46+
47+
When they do return an error, the error message is below reasonable:
48+
49+
```
50+
aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, count(*) FROM S3Object GROUP BY CounterID ORDER BY count(*) DESC LIMIT 10" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' -
51+
52+
An error occurred (ParseUnexpectedToken) when calling the SelectObjectContent operation: Unexpected token found KEYWORD:UNKNOWN at line 1, column 61.
53+
```
54+
55+
## Alternatives
56+
57+
You can use ClickHouse in AWS Lambda: https://github.com/aws-samples/aws-lambda-clickhouse
58+
59+
This project is made by AWS engineers.

0 commit comments

Comments
 (0)