Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(AWS) Docs: List all AWS S3 properties from all language impl. #11383

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Neuw84
Copy link
Contributor

@Neuw84 Neuw84 commented Oct 23, 2024

As @hsiang-c made another pull request building a table here I didn't want to collide.

Fixes List all AWS S3 properties in the docs #10674

Therefore, I added:

  • Added Amazon MSK Connect as option.
  • Added HTTP client advice when high throughput scenarios (not just tune the retries but also number of connections).
  • Added specific configs for data prefetching on EMR 7.1.0

As personal opinion, if using AWS SDKs most of the properties shouldn't be there ( there is a standard way of configure them, prioritize them, etc). However, is clear that using 3rd party libraries in different languages would require info like the tables @hsiang-c has built.

The problem with this is that different libraries will have different configs ( on the same language).

In my personal opinion instead of dividing by language, maybe by library (but here maybe just adding a link to the corresponding doc page should be enough)? And having a separate section/table for AWS SDKs supported ones (anything using official libraries will have the same config, no matter the language)?

Thanks!

Added Amazon MSK Connect as option.

Added HTTP client advice when high throughput scenarios. 

Added specific configs for data prefetching on EMR 7.1.0
@github-actions github-actions bot added the docs label Oct 23, 2024
@Neuw84 Neuw84 changed the title (AWS) Docs: List all AWS S3 properties from all language impl. #10674 (AWS) Docs: List all AWS S3 properties from all language impl. fixes #10674 Oct 23, 2024
@Neuw84 Neuw84 changed the title (AWS) Docs: List all AWS S3 properties from all language impl. fixes #10674 (AWS) Docs: List all AWS S3 properties from all language impl. Oct 23, 2024
For versions after 7.1.0 there is an specific config that can be used to enable data prefecth optimization. You just need to add the following property on your Spark config.

```shell
spark.sql.iceberg.data-prefetch.enabled=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this is an Iceberg property. If this is specific to EMR, I don't believe it should be included here.

Copy link
Contributor Author

@Neuw84 Neuw84 Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is specific to EMR yes ( internal iceberg runtime), however we are on the "aws" docs page.

I think that stating that you can add that parameter to improve the performance of Iceberg workloads on EMR is good to have/know?

**Note that for workloads with exceptionally high throughput against tables that S3 where you will likely to increase Retries, you will also like to increase the number of connections for the HTTP client**

```shell
spark.sql.catalog.my_catalog.http-client.apache.max-connections=200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like an Iceberg setting from what I can tell. If this is EMR specific, it should not be included here.

Copy link
Contributor Author

@Neuw84 Neuw84 Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a thing of AWS SDK and Spark ( not specifically to EMR). If you use Spark on your laptop writing to S3 and you are on this high throughput write scenario you will likely tune the parameter.

Any spark runtime will use this ( maybe photon runtime do use another S3 client but I don´t have that info :) ).

On the previous case agree with you that is super specific to EMR and it may or not be added on the aws "docs".

I mean, we are speaking about a AWS docs in this page ( the parameter is quite specific to the S3 client of the AWS SDK).

@danielcweeks
Copy link
Contributor

@Neuw84 it looks like we're duplicating what should be EMR documentation here. We already link off to the EMR docs, so I don't feel this is the right place for putting specific configuration info.

@Neuw84
Copy link
Contributor Author

Neuw84 commented Oct 31, 2024

@danielcweeks let me know your thoughts on the comments ( agree on the specific one about EMR, although for me it does not hurt as we are on aws docs page).

What are your thoughts about the S3 clients info?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

List all AWS S3 properties in the docs
2 participants