Skip to content

Commit 5cfb527

Browse files
authored
Arrow: Suppress warning and cache bucket location (#1709)
Attemt to remove the unneccessary warning, and cache the location of the bucket independent of the FileIO. Fixes #1705 Fixes #1708
1 parent 6388b4d commit 5cfb527

File tree

4 files changed

+45
-32
lines changed

4 files changed

+45
-32
lines changed

mkdocs/docs/configuration.md

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -108,22 +108,23 @@ For the FileIO there are several configuration options available:
108108

109109
<!-- markdown-link-check-disable -->
110110

111-
| Key | Example | Description |
112-
|----------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
113-
| s3.endpoint | <https://10.0.19.25/> | Configure an alternative endpoint of the S3 service for the FileIO to access. This could be used to use S3FileIO with any s3-compatible object storage service that has a different endpoint, or access a private S3 endpoint in a virtual private cloud. |
114-
| s3.access-key-id | admin | Configure the static access key id used to access the FileIO. |
115-
| s3.secret-access-key | password | Configure the static secret access key used to access the FileIO. |
116-
| s3.session-token | AQoDYXdzEJr... | Configure the static session token used to access the FileIO. |
117-
| s3.role-session-name | session | An optional identifier for the assumed role session. |
118-
| s3.role-arn | arn:aws:... | AWS Role ARN. If provided instead of access_key and secret_key, temporary credentials will be fetched by assuming this role. |
119-
| s3.signer | bearer | Configure the signature version of the FileIO. |
120-
| s3.signer.uri | <http://my.signer:8080/s3> | Configure the remote signing uri if it differs from the catalog uri. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. |
121-
| s3.signer.endpoint | v1/main/s3-sign | Configure the remote signing endpoint. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. (default : v1/aws/s3/sign). |
122-
| s3.region | us-west-2 | Configure the default region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to automatically resolve the region for each S3 bucket, falling back to this value if resolution fails. |
123-
| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. |
124-
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. |
125-
| s3.request-timeout | 60.0 | Configure socket read timeouts on Windows and macOS, in seconds. |
126-
| s3.force-virtual-addressing | False | Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access. |
111+
| Key | Example | Description |
112+
|-----------------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
113+
| s3.endpoint | <https://10.0.19.25/> | Configure an alternative endpoint of the S3 service for the FileIO to access. This could be used to use S3FileIO with any s3-compatible object storage service that has a different endpoint, or access a private S3 endpoint in a virtual private cloud. |
114+
| s3.access-key-id | admin | Configure the static access key id used to access the FileIO. |
115+
| s3.secret-access-key | password | Configure the static secret access key used to access the FileIO. |
116+
| s3.session-token | AQoDYXdzEJr... | Configure the static session token used to access the FileIO. |
117+
| s3.role-session-name | session | An optional identifier for the assumed role session. |
118+
| s3.role-arn | arn:aws:... | AWS Role ARN. If provided instead of access_key and secret_key, temporary credentials will be fetched by assuming this role. |
119+
| s3.signer | bearer | Configure the signature version of the FileIO. |
120+
| s3.signer.uri | <http://my.signer:8080/s3> | Configure the remote signing uri if it differs from the catalog uri. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. |
121+
| s3.signer.endpoint | v1/main/s3-sign | Configure the remote signing endpoint. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. (default : v1/aws/s3/sign). |
122+
| s3.region | us-west-2 | Configure the default region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to automatically tries to resolve the region if this isn't set (only supported for AWS S3 Buckets). |
123+
| s3.resolve-region | False | Only supported for `PyArrowFileIO`, when enabled, it will always try to resolve the location of the bucket (only supported for AWS S3 Buckets). |
124+
| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. |
125+
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. |
126+
| s3.request-timeout | 60.0 | Configure socket read timeouts on Windows and macOS, in seconds. |
127+
| s3.force-virtual-addressing | False | Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access. |
127128

128129
<!-- markdown-link-check-enable-->
129130

pyiceberg/io/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@
5959
S3_SECRET_ACCESS_KEY = "s3.secret-access-key"
6060
S3_SESSION_TOKEN = "s3.session-token"
6161
S3_REGION = "s3.region"
62+
S3_RESOLVE_REGION = "s3.resolve-region"
6263
S3_PROXY_URI = "s3.proxy-uri"
6364
S3_CONNECT_TIMEOUT = "s3.connect-timeout"
6465
S3_REQUEST_TIMEOUT = "s3.request-timeout"

pyiceberg/io/pyarrow.py

Lines changed: 25 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,7 @@
107107
S3_PROXY_URI,
108108
S3_REGION,
109109
S3_REQUEST_TIMEOUT,
110+
S3_RESOLVE_REGION,
110111
S3_ROLE_ARN,
111112
S3_ROLE_SESSION_NAME,
112113
S3_SECRET_ACCESS_KEY,
@@ -194,6 +195,17 @@
194195
T = TypeVar("T")
195196

196197

198+
@lru_cache
199+
def _cached_resolve_s3_region(bucket: str) -> Optional[str]:
200+
from pyarrow.fs import resolve_s3_region
201+
202+
try:
203+
return resolve_s3_region(bucket=bucket)
204+
except (OSError, TypeError):
205+
logger.warning(f"Unable to resolve region for bucket {bucket}")
206+
return None
207+
208+
197209
class UnsupportedPyArrowTypeException(Exception):
198210
"""Cannot convert PyArrow type to corresponding Iceberg type."""
199211

@@ -414,23 +426,22 @@ def _initialize_oss_fs(self) -> FileSystem:
414426
return S3FileSystem(**client_kwargs)
415427

416428
def _initialize_s3_fs(self, netloc: Optional[str]) -> FileSystem:
417-
from pyarrow.fs import S3FileSystem, resolve_s3_region
429+
from pyarrow.fs import S3FileSystem
418430

419-
# Resolve region from netloc(bucket), fallback to user-provided region
420431
provided_region = get_first_property_value(self.properties, S3_REGION, AWS_REGION)
421432

422-
try:
423-
bucket_region = resolve_s3_region(bucket=netloc)
424-
except (OSError, TypeError):
425-
bucket_region = None
426-
logger.warning(f"Unable to resolve region for bucket {netloc}, using default region {provided_region}")
427-
428-
bucket_region = bucket_region or provided_region
429-
if bucket_region != provided_region:
430-
logger.warning(
431-
f"PyArrow FileIO overriding S3 bucket region for bucket {netloc}: "
432-
f"provided region {provided_region}, actual region {bucket_region}"
433-
)
433+
# Do this when we don't provide the region at all, or when we explicitly enable it
434+
if provided_region is None or property_as_bool(self.properties, S3_RESOLVE_REGION, False) is True:
435+
# Resolve region from netloc(bucket), fallback to user-provided region
436+
# Only supported by buckets hosted by S3
437+
bucket_region = _cached_resolve_s3_region(bucket=netloc) or provided_region
438+
if provided_region is not None and bucket_region != provided_region:
439+
logger.warning(
440+
f"PyArrow FileIO overriding S3 bucket region for bucket {netloc}: "
441+
f"provided region {provided_region}, actual region {bucket_region}"
442+
)
443+
else:
444+
bucket_region = provided_region
434445

435446
client_kwargs: Dict[str, Any] = {
436447
"endpoint_override": self.properties.get(S3_ENDPOINT),

tests/io/test_pyarrow.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2285,14 +2285,14 @@ def _s3_region_map(bucket: str) -> str:
22852285
raise OSError("Unknown bucket")
22862286

22872287
# For a pyarrow io instance with configured default s3 region
2288-
pyarrow_file_io = PyArrowFileIO({"s3.region": user_provided_region})
2288+
pyarrow_file_io = PyArrowFileIO({"s3.region": user_provided_region, "s3.resolve-region": "true"})
22892289
with patch("pyarrow.fs.resolve_s3_region") as mock_s3_region_resolver:
22902290
mock_s3_region_resolver.side_effect = _s3_region_map
22912291

22922292
# The region is set to provided region if bucket region cannot be resolved
22932293
with caplog.at_level(logging.WARNING):
22942294
assert pyarrow_file_io.new_input("s3://non-exist-bucket/path/to/file")._filesystem.region == user_provided_region
2295-
assert f"Unable to resolve region for bucket non-exist-bucket, using default region {user_provided_region}" in caplog.text
2295+
assert "Unable to resolve region for bucket non-exist-bucket" in caplog.text
22962296

22972297
for bucket_region in bucket_regions:
22982298
# For s3 scheme, region is overwritten by resolved bucket region if different from user provided region

0 commit comments

Comments
 (0)