Skip to content

Commit cedce7b

Browse files
[Doc] Add Schema Merge examples to Files (StarRocks#48904)
1 parent 82eba9d commit cedce7b

File tree

2 files changed

+266
-10
lines changed
  • docs
    • en/sql-reference/sql-functions/table-functions
    • zh/sql-reference/sql-functions/table-functions

2 files changed

+266
-10
lines changed

Diff for: docs/en/sql-reference/sql-functions/table-functions/files.md

+133-5
Original file line numberDiff line numberDiff line change
@@ -161,8 +161,13 @@ After the sampling, StarRocks unionizes the columns from all the data files acco
161161
162162
- For columns with different column names or indices, each column is identified as an individual column, and, eventually, the union of all individual columns is returned.
163163
- For columns with the same column name but different data types, they are identified as the same column but with a general data type on a relative fine granularity level. For example, if the column `col1` in file A is INT but DECIMAL in file B, DOUBLE is used in the returned column.
164+
- All integer columns will be unionized as an integer type on an overall rougher granularity level.
165+
- Integer columns together with FLOAT type columns will be unionized as the DECIMAL type.
166+
- String types are used for unionizing other types.
164167
- Generally, the STRING type can be used to unionize all data types.
165168
169+
You can refer to [Example 6](#example-6).
170+
166171
If StarRocks fails to unionize all the columns, it generates a schema error report that includes the error information and all the file schemas.
167172
168173
> **CAUTION**
@@ -366,7 +371,9 @@ From v3.2 onwards, FILES() further supports complex data types including ARRAY,
366371
367372
## Examples
368373
369-
Example 1: Query the data from the Parquet file **parquet/par-dup.parquet** within the AWS S3 bucket `inserttest`:
374+
#### Example 1
375+
376+
Query the data from the Parquet file **parquet/par-dup.parquet** within the AWS S3 bucket `inserttest`:
370377
371378
```Plain
372379
MySQL > SELECT * FROM FILES(
@@ -385,7 +392,9 @@ MySQL > SELECT * FROM FILES(
385392
2 rows in set (22.335 sec)
386393
```
387394
388-
Example 2: Insert the data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table `insert_wiki_edit`:
395+
#### Example 2
396+
397+
Insert the data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table `insert_wiki_edit`:
389398
390399
```Plain
391400
MySQL > INSERT INTO insert_wiki_edit
@@ -400,7 +409,9 @@ Query OK, 2 rows affected (23.03 sec)
400409
{'label':'insert_d8d4b2ee-ac5c-11ed-a2cf-4e1110a8f63b', 'status':'VISIBLE', 'txnId':'2440'}
401410
```
402411
403-
Example 3: Create a table named `ctas_wiki_edit` and insert the data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table:
412+
#### Example 3
413+
414+
Create a table named `ctas_wiki_edit` and insert the data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table:
404415
405416
```Plain
406417
MySQL > CREATE TABLE ctas_wiki_edit AS
@@ -415,7 +426,9 @@ Query OK, 2 rows affected (22.09 sec)
415426
{'label':'insert_1a217d70-2f52-11ee-9e4a-7a563fb695da', 'status':'VISIBLE', 'txnId':'3248'}
416427
```
417428
418-
Example 4: Query the data from the Parquet file **/geo/country=US/city=LA/file1.parquet** (which only contains two columns -`id` and `user`), and extract the key/value information in its path as columns returned.
429+
#### Example 4
430+
431+
Query the data from the Parquet file **/geo/country=US/city=LA/file1.parquet** (which only contains two columns -`id` and `user`), and extract the key/value information in its path as columns returned.
419432
420433
```Plain
421434
SELECT * FROM FILES(
@@ -435,7 +448,9 @@ SELECT * FROM FILES(
435448
2 rows in set (3.84 sec)
436449
```
437450
438-
Example 5: Unload all data rows in `sales_records` as multiple Parquet files under the path **/unload/partitioned/** in the HDFS cluster. These files are stored in different subpaths distinguished by the values in the column `sales_time`.
451+
#### Example 5
452+
453+
Unload all data rows in `sales_records` as multiple Parquet files under the path **/unload/partitioned/** in the HDFS cluster. These files are stored in different subpaths distinguished by the values in the column `sales_time`.
439454
440455
```SQL
441456
INSERT INTO
@@ -450,3 +465,116 @@ FILES(
450465
)
451466
SELECT * FROM sales_records;
452467
```
468+
469+
#### Example 6
470+
471+
Automatic schema detection and Unionization.
472+
473+
The following example is based on two Parquet files in the S3 bucket:
474+
475+
- File 1 contains three columns - INT column `c1`, FLOAT column `c2`, and DATE column `c3`.
476+
477+
```Plain
478+
c1,c2,c3
479+
1,0.71173,2017-11-20
480+
2,0.16145,2017-11-21
481+
3,0.80524,2017-11-22
482+
4,0.91852,2017-11-23
483+
5,0.37766,2017-11-24
484+
6,0.34413,2017-11-25
485+
7,0.40055,2017-11-26
486+
8,0.42437,2017-11-27
487+
9,0.67935,2017-11-27
488+
10,0.22783,2017-11-29
489+
```
490+
491+
- File 2 contains three columns - INT column `c1`, INT column `c2`, and DATETIME column `c3`.
492+
493+
```Plain
494+
c1,c2,c3
495+
101,9,2018-05-15T18:30:00
496+
102,3,2018-05-15T18:30:00
497+
103,2,2018-05-15T18:30:00
498+
104,3,2018-05-15T18:30:00
499+
105,6,2018-05-15T18:30:00
500+
106,1,2018-05-15T18:30:00
501+
107,8,2018-05-15T18:30:00
502+
108,5,2018-05-15T18:30:00
503+
109,6,2018-05-15T18:30:00
504+
110,8,2018-05-15T18:30:00
505+
```
506+
507+
Use a CTAS statement to create a table named `test_ctas_parquet` and insert the data rows from the two Parquet files into the table:
508+
509+
```SQL
510+
CREATE TABLE test_ctas_parquet AS
511+
SELECT * FROM FILES(
512+
"path" = "s3://inserttest/parquet/*",
513+
"format" = "parquet",
514+
"aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
515+
"aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB",
516+
"aws.s3.region" = "us-west-2"
517+
);
518+
```
519+
520+
View the table schema of `test_ctas_parquet`:
521+
522+
```SQL
523+
SHOW CREATE TABLE test_ctas_parquet\G
524+
```
525+
526+
```Plain
527+
*************************** 1. row ***************************
528+
Table: test_ctas_parquet
529+
Create Table: CREATE TABLE `test_ctas_parquet` (
530+
`c1` bigint(20) NULL COMMENT "",
531+
`c2` decimal(38, 9) NULL COMMENT "",
532+
`c3` varchar(1048576) NULL COMMENT ""
533+
) ENGINE=OLAP
534+
DUPLICATE KEY(`c1`, `c2`)
535+
COMMENT "OLAP"
536+
DISTRIBUTED BY RANDOM
537+
PROPERTIES (
538+
"bucket_size" = "4294967296",
539+
"compression" = "LZ4",
540+
"replication_num" = "3"
541+
);
542+
```
543+
544+
The result shows that the `c2` column, which contains both FLOAT and INT data, is merged as a DECIMAL column, and `c3`, which contains both DATE and DATETIME data, is merged as a VARCHAR column.
545+
546+
The above result stays the same when the Parquet files are changed to CSV files that contain the same data:
547+
548+
```Plain
549+
mysql> CREATE TABLE test_ctas_csv AS
550+
-> SELECT * FROM FILES(
551+
-> "path" = "s3://inserttest/csv/*",
552+
-> "format" = "csv",
553+
-> "csv.column_separator"=",",
554+
-> "csv.row_delimiter"="\n",
555+
-> "csv.enclose"='"',
556+
-> "csv.skip_header"="1",
557+
-> "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
558+
-> "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB",
559+
-> "aws.s3.region" = "us-west-2"
560+
-> );
561+
Query OK, 0 rows affected (30.90 sec)
562+
563+
mysql> SHOW CREATE TABLE test_ctas_csv\G
564+
*************************** 1. row ***************************
565+
Table: test_ctas_csv
566+
Create Table: CREATE TABLE `test_ctas_csv` (
567+
`c1` bigint(20) NULL COMMENT "",
568+
`c2` decimal(38, 9) NULL COMMENT "",
569+
`c3` varchar(1048576) NULL COMMENT ""
570+
) ENGINE=OLAP
571+
DUPLICATE KEY(`c1`, `c2`)
572+
COMMENT "OLAP"
573+
DISTRIBUTED BY RANDOM
574+
PROPERTIES (
575+
"bucket_size" = "4294967296",
576+
"compression" = "LZ4",
577+
"replication_num" = "3"
578+
);
579+
1 row in set (0.27 sec)
580+
```

Diff for: docs/zh/sql-reference/sql-functions/table-functions/files.md

+133-5
Original file line numberDiff line numberDiff line change
@@ -161,8 +161,13 @@ CSV 格式示例:
161161
162162
- 对于具有不同列名或索引的列,StarRocks 将每列识别为单独的列,最终返回所有单独列。
163163
- 对于列名相同但数据类型不同的列,StarRocks 将这些列识别为相同的列,并为其选择一个相对较小的通用数据类型。例如,如果文件 A 中的列 `col1` 是 INT 类型,而文件 B 中的列 `col1` 是 DECIMAL 类型,则在返回的列中使用 DOUBLE 数据类型。
164+
- 所有整数列将被统一为更粗粗粒度上的整数类型。
165+
- 整数列与 FLOAT 类型列将统一为 DECIMAL 类型。
166+
- 其他类型统一为字符串类型用。
164167
- 一般情况下,STRING 类型可用于统一所有数据类型。
165168
169+
您可以参考[示例六](#示例六)。
170+
166171
如果 StarRocks 无法统一所有列,将生成一个包含错误信息和所有文件 Schema 的错误报告。
167172
168173
> **注意**
@@ -366,7 +371,9 @@ unload_data_param::=
366371
367372
## 示例
368373
369-
示例一:查询 AWS S3 存储桶 `inserttest` 内 Parquet 文件 **parquet/par-dup.parquet** 中的数据
374+
#### 示例一
375+
376+
查询 AWS S3 存储桶 `inserttest` 内 Parquet 文件 **parquet/par-dup.parquet** 中的数据
370377
371378
```Plain
372379
MySQL > SELECT * FROM FILES(
@@ -385,7 +392,9 @@ MySQL > SELECT * FROM FILES(
385392
2 rows in set (22.335 sec)
386393
```
387394
388-
示例二:将 AWS S3 存储桶 `inserttest` 内 Parquet 文件 **parquet/insert_wiki_edit_append.parquet** 中的数据插入至表 `insert_wiki_edit` 中:
395+
#### 示例二
396+
397+
将 AWS S3 存储桶 `inserttest` 内 Parquet 文件 **parquet/insert_wiki_edit_append.parquet** 中的数据插入至表 `insert_wiki_edit` 中:
389398
390399
```Plain
391400
MySQL > INSERT INTO insert_wiki_edit
@@ -400,7 +409,9 @@ Query OK, 2 rows affected (23.03 sec)
400409
{'label':'insert_d8d4b2ee-ac5c-11ed-a2cf-4e1110a8f63b', 'status':'VISIBLE', 'txnId':'2440'}
401410
```
402411
403-
示例三:基于 AWS S3 存储桶 `inserttest` 内 Parquet 文件 **parquet/insert_wiki_edit_append.parquet** 中的数据创建表 `ctas_wiki_edit`:
412+
#### 示例三
413+
414+
基于 AWS S3 存储桶 `inserttest` 内 Parquet 文件 **parquet/insert_wiki_edit_append.parquet** 中的数据创建表 `ctas_wiki_edit`:
404415
405416
```Plain
406417
MySQL > CREATE TABLE ctas_wiki_edit AS
@@ -415,7 +426,9 @@ Query OK, 2 rows affected (22.09 sec)
415426
{'label':'insert_1a217d70-2f52-11ee-9e4a-7a563fb695da', 'status':'VISIBLE', 'txnId':'3248'}
416427
```
417428
418-
示例四:查询 HDFS 集群内 Parquet 文件 **/geo/country=US/city=LA/file1.parquet** 中的数据(其中仅包含两列 - `id` 和 `user`),并提取其路径中的 Key/Value 信息作为返回的列。
429+
#### 示例四
430+
431+
查询 HDFS 集群内 Parquet 文件 **/geo/country=US/city=LA/file1.parquet** 中的数据(其中仅包含两列 - `id` 和 `user`),并提取其路径中的 Key/Value 信息作为返回的列。
419432
420433
```Plain
421434
SELECT * FROM FILES(
@@ -435,7 +448,9 @@ SELECT * FROM FILES(
435448
2 rows in set (3.84 sec)
436449
```
437450
438-
示例五:将 `sales_records` 中的所有数据行导出为多个 Parquet 文件,存储在 HDFS 集群的路径 **/unload/partitioned/** 下。这些文件存储在不同的子路径中,这些子路径根据列 `sales_time` 中的值来区分。
451+
#### 示例五
452+
453+
将 `sales_records` 中的所有数据行导出为多个 Parquet 文件,存储在 HDFS 集群的路径 **/unload/partitioned/** 下。这些文件存储在不同的子路径中,这些子路径根据列 `sales_time` 中的值来区分。
439454
440455
```SQL
441456
INSERT INTO
@@ -450,3 +465,116 @@ FILES(
450465
)
451466
SELECT * FROM sales_records;
452467
```
468+
469+
#### 示例六
470+
471+
自动 Schema 检测和 Union 操作
472+
473+
以下示例基于 S3 桶中两个 Parquet 文件 File 1 和 File 2:
474+
475+
- File 1 中包含三列数据 - INT 列 `c1`、FLOAT 列 `c2` 以及 DATE 列 `c3`。
476+
477+
```Plain
478+
c1,c2,c3
479+
1,0.71173,2017-11-20
480+
2,0.16145,2017-11-21
481+
3,0.80524,2017-11-22
482+
4,0.91852,2017-11-23
483+
5,0.37766,2017-11-24
484+
6,0.34413,2017-11-25
485+
7,0.40055,2017-11-26
486+
8,0.42437,2017-11-27
487+
9,0.67935,2017-11-27
488+
10,0.22783,2017-11-29
489+
```
490+
491+
- File 2 中包含三列数据 - INT 列 `c1`、INT 列 `c2` 以及 DATETIME 列 `c3`。
492+
493+
```Plain
494+
c1,c2,c3
495+
101,9,2018-05-15T18:30:00
496+
102,3,2018-05-15T18:30:00
497+
103,2,2018-05-15T18:30:00
498+
104,3,2018-05-15T18:30:00
499+
105,6,2018-05-15T18:30:00
500+
106,1,2018-05-15T18:30:00
501+
107,8,2018-05-15T18:30:00
502+
108,5,2018-05-15T18:30:00
503+
109,6,2018-05-15T18:30:00
504+
110,8,2018-05-15T18:30:00
505+
```
506+
507+
使用 CTAS 语句创建表 `test_ctas_parquet` 并将两个 Parquet 文件中的数据导入表中:
508+
509+
```SQL
510+
CREATE TABLE test_ctas_parquet AS
511+
SELECT * FROM FILES(
512+
"path" = "s3://inserttest/parquet/*",
513+
"format" = "parquet",
514+
"aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
515+
"aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB",
516+
"aws.s3.region" = "us-west-2"
517+
);
518+
```
519+
520+
查看 `test_ctas_parquet` 的表结构:
521+
522+
```SQL
523+
SHOW CREATE TABLE test_ctas_parquet\G
524+
```
525+
526+
```Plain
527+
*************************** 1. row ***************************
528+
Table: test_ctas_parquet
529+
Create Table: CREATE TABLE `test_ctas_parquet` (
530+
`c1` bigint(20) NULL COMMENT "",
531+
`c2` decimal(38, 9) NULL COMMENT "",
532+
`c3` varchar(1048576) NULL COMMENT ""
533+
) ENGINE=OLAP
534+
DUPLICATE KEY(`c1`, `c2`)
535+
COMMENT "OLAP"
536+
DISTRIBUTED BY RANDOM
537+
PROPERTIES (
538+
"bucket_size" = "4294967296",
539+
"compression" = "LZ4",
540+
"replication_num" = "3"
541+
);
542+
```
543+
544+
由结果可知,`c2` 列因为包含 FLOAT 和 INT 数据,被合并为 DECIMAL 列,而 `c3` 列因为包含 DATE 和 DATETIME 数据,,被合并为 VARCHAR 列。
545+
546+
将 Parquet 文件替换为含有同样数据的 CSV 文件,以上结果依然成立:
547+
548+
```Plain
549+
mysql> CREATE TABLE test_ctas_csv AS
550+
-> SELECT * FROM FILES(
551+
-> "path" = "s3://inserttest/csv/*",
552+
-> "format" = "csv",
553+
-> "csv.column_separator"=",",
554+
-> "csv.row_delimiter"="\n",
555+
-> "csv.enclose"='"',
556+
-> "csv.skip_header"="1",
557+
-> "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
558+
-> "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB",
559+
-> "aws.s3.region" = "us-west-2"
560+
-> );
561+
Query OK, 0 rows affected (30.90 sec)
562+
563+
mysql> SHOW CREATE TABLE test_ctas_csv\G
564+
*************************** 1. row ***************************
565+
Table: test_ctas_csv
566+
Create Table: CREATE TABLE `test_ctas_csv` (
567+
`c1` bigint(20) NULL COMMENT "",
568+
`c2` decimal(38, 9) NULL COMMENT "",
569+
`c3` varchar(1048576) NULL COMMENT ""
570+
) ENGINE=OLAP
571+
DUPLICATE KEY(`c1`, `c2`)
572+
COMMENT "OLAP"
573+
DISTRIBUTED BY RANDOM
574+
PROPERTIES (
575+
"bucket_size" = "4294967296",
576+
"compression" = "LZ4",
577+
"replication_num" = "3"
578+
);
579+
1 row in set (0.27 sec)
580+
```

0 commit comments

Comments
 (0)