-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(AWS) Docs: List all AWS S3 properties from all language impl. #11383
base: main
Are you sure you want to change the base?
Conversation
Added Amazon MSK Connect as option. Added HTTP client advice when high throughput scenarios. Added specific configs for data prefetching on EMR 7.1.0
For versions after 7.1.0 there is an specific config that can be used to enable data prefecth optimization. You just need to add the following property on your Spark config. | ||
|
||
```shell | ||
spark.sql.iceberg.data-prefetch.enabled=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this is an Iceberg property. If this is specific to EMR, I don't believe it should be included here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is specific to EMR yes ( internal iceberg runtime), however we are on the "aws" docs page.
I think that stating that you can add that parameter to improve the performance of Iceberg workloads on EMR is good to have/know?
**Note that for workloads with exceptionally high throughput against tables that S3 where you will likely to increase Retries, you will also like to increase the number of connections for the HTTP client** | ||
|
||
```shell | ||
spark.sql.catalog.my_catalog.http-client.apache.max-connections=200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like an Iceberg setting from what I can tell. If this is EMR specific, it should not be included here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a thing of AWS SDK and Spark ( not specifically to EMR). If you use Spark on your laptop writing to S3 and you are on this high throughput write scenario you will likely tune the parameter.
Any spark runtime will use this ( maybe photon runtime do use another S3 client but I don´t have that info :) ).
On the previous case agree with you that is super specific to EMR and it may or not be added on the aws "docs".
I mean, we are speaking about a AWS docs in this page ( the parameter is quite specific to the S3 client of the AWS SDK).
@Neuw84 it looks like we're duplicating what should be EMR documentation here. We already link off to the EMR docs, so I don't feel this is the right place for putting specific configuration info. |
@danielcweeks let me know your thoughts on the comments ( agree on the specific one about EMR, although for me it does not hurt as we are on aws docs page). What are your thoughts about the S3 clients info? |
As @hsiang-c made another pull request building a table here I didn't want to collide.
Fixes List all AWS S3 properties in the docs #10674
Therefore, I added:
As personal opinion, if using AWS SDKs most of the properties shouldn't be there ( there is a standard way of configure them, prioritize them, etc). However, is clear that using 3rd party libraries in different languages would require info like the tables @hsiang-c has built.
The problem with this is that different libraries will have different configs ( on the same language).
In my personal opinion instead of dividing by language, maybe by library (but here maybe just adding a link to the corresponding doc page should be enough)? And having a separate section/table for AWS SDKs supported ones (anything using official libraries will have the same config, no matter the language)?
Thanks!