Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions go/docs/spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,17 +120,27 @@ These parameters can be specified in the URI as query parameters, or as connecti

## Limitations

Different backends have limitations; some limitations related to data type support are also noted further below.
Different backends and cluster configurations have limitations; some limitations related to data type support are also noted further below.

### HiveServer2/Thrift Protocol

- In Spark 3.x, binary data that does not happen to be valid UTF-8 will be corrupted.
- The client cannot tell whether a timestamp carries a time zone or not; all timestamps are assumed to be in UTC as a result.

### Livy
### Apache Livy

- Only the first 1000 rows of a result set can be fetched. This can be tuned by configuring Spark with `spark.sql.repl.eagerEval.maxNumRows`.
- In general, we have found that performance is worse than with Spark Connect or HiveServer2.
- Connecting to an Amazon EMR (Serverless) cluster via Livy requires setting the `emr-serverless.session.executionRoleArn` session config option to an appropriate role ARN.

### Spark Connect

- In our testing, connecting to an Amazon EMR (Serverless) cluster via Spark Connect does not work; we believe it is an incompatibility in the Spark Connect client library and plan to address this in a future version of the driver.

### Amazon EMR (Serverless)

- Bulk ingest with an AWS Glue catalog is not currently supported as there is no way to specify the `LOCATION` clause.
- Amazon EMR is not currently enabled in our automated integration testing.

## Feature & Type Support

Expand Down
Loading