You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/sql-reference/sql-functions/table-functions/files.md
+133-5
Original file line number
Diff line number
Diff line change
@@ -161,8 +161,13 @@ After the sampling, StarRocks unionizes the columns from all the data files acco
161
161
162
162
- For columns with different column names or indices, each column is identified as an individual column, and, eventually, the union of all individual columns is returned.
163
163
- For columns with the same column name but different data types, they are identified as the same column but with a general data type on a relative fine granularity level. For example, if the column `col1` in file A is INT but DECIMAL in file B, DOUBLE is used in the returned column.
164
+
- All integer columns will be unionized as an integer type on an overall rougher granularity level.
165
+
- Integer columns together with FLOAT type columns will be unionized as the DECIMAL type.
166
+
- String types are used for unionizing other types.
164
167
- Generally, the STRING type can be used to unionize all data types.
165
168
169
+
You can refer to [Example 6](#example-6).
170
+
166
171
If StarRocks fails to unionize all the columns, it generates a schema error report that includes the error information and all the file schemas.
167
172
168
173
> **CAUTION**
@@ -366,7 +371,9 @@ From v3.2 onwards, FILES() further supports complex data types including ARRAY,
366
371
367
372
## Examples
368
373
369
-
Example 1: Query the data from the Parquet file **parquet/par-dup.parquet** within the AWS S3 bucket `inserttest`:
374
+
#### Example 1
375
+
376
+
Query the data from the Parquet file **parquet/par-dup.parquet** within the AWS S3 bucket `inserttest`:
370
377
371
378
```Plain
372
379
MySQL > SELECT * FROM FILES(
@@ -385,7 +392,9 @@ MySQL > SELECT * FROM FILES(
385
392
2 rows in set (22.335 sec)
386
393
```
387
394
388
-
Example 2: Insert the data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table `insert_wiki_edit`:
395
+
#### Example 2
396
+
397
+
Insert the data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table `insert_wiki_edit`:
Example 3: Create a table named `ctas_wiki_edit` and insert the data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table:
412
+
#### Example 3
413
+
414
+
Create a table named `ctas_wiki_edit` and insert the data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table:
Example 4: Query the data from the Parquet file **/geo/country=US/city=LA/file1.parquet** (which only contains two columns -`id` and `user`), and extract the key/value information in its path as columns returned.
429
+
#### Example 4
430
+
431
+
Query the data from the Parquet file **/geo/country=US/city=LA/file1.parquet** (which only contains two columns -`id` and `user`), and extract the key/value information in its path as columns returned.
419
432
420
433
```Plain
421
434
SELECT * FROM FILES(
@@ -435,7 +448,9 @@ SELECT * FROM FILES(
435
448
2 rows in set (3.84 sec)
436
449
```
437
450
438
-
Example 5: Unload all data rows in `sales_records` as multiple Parquet files under the path **/unload/partitioned/** in the HDFS cluster. These files are stored in different subpaths distinguished by the values in the column `sales_time`.
451
+
#### Example 5
452
+
453
+
Unload all data rows in `sales_records` as multiple Parquet files under the path **/unload/partitioned/** in the HDFS cluster. These files are stored in different subpaths distinguished by the values in the column `sales_time`.
439
454
440
455
```SQL
441
456
INSERT INTO
@@ -450,3 +465,116 @@ FILES(
450
465
)
451
466
SELECT * FROM sales_records;
452
467
```
468
+
469
+
#### Example 6
470
+
471
+
Automatic schema detection and Unionization.
472
+
473
+
The following example is based on two Parquet files in the S3 bucket:
474
+
475
+
- File 1 contains three columns - INT column `c1`, FLOAT column `c2`, and DATE column `c3`.
476
+
477
+
```Plain
478
+
c1,c2,c3
479
+
1,0.71173,2017-11-20
480
+
2,0.16145,2017-11-21
481
+
3,0.80524,2017-11-22
482
+
4,0.91852,2017-11-23
483
+
5,0.37766,2017-11-24
484
+
6,0.34413,2017-11-25
485
+
7,0.40055,2017-11-26
486
+
8,0.42437,2017-11-27
487
+
9,0.67935,2017-11-27
488
+
10,0.22783,2017-11-29
489
+
```
490
+
491
+
- File 2 contains three columns - INT column `c1`, INT column `c2`, and DATETIME column `c3`.
492
+
493
+
```Plain
494
+
c1,c2,c3
495
+
101,9,2018-05-15T18:30:00
496
+
102,3,2018-05-15T18:30:00
497
+
103,2,2018-05-15T18:30:00
498
+
104,3,2018-05-15T18:30:00
499
+
105,6,2018-05-15T18:30:00
500
+
106,1,2018-05-15T18:30:00
501
+
107,8,2018-05-15T18:30:00
502
+
108,5,2018-05-15T18:30:00
503
+
109,6,2018-05-15T18:30:00
504
+
110,8,2018-05-15T18:30:00
505
+
```
506
+
507
+
Use a CTAS statement to create a table named `test_ctas_parquet` and insert the data rows from the two Parquet files into the table:
The result shows that the `c2` column, which contains both FLOAT and INT data, is merged as a DECIMAL column, and `c3`, which contains both DATE and DATETIME data, is merged as a VARCHAR column.
545
+
546
+
The above result stays the same when the Parquet files are changed to CSV files that contain the same data:
0 commit comments