GH-32123: [R] Expose azure blob filesystem#49553
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
|
|
The CI is failing because the Azure C++ SDK depends on libxml2.
|
|
It looks like the failing CI job is from a warning in R CMD check about a non-API call to a C function, due to all the changes in R-devel. I don't think this is due to our PR. At this point, I'm fairly confident that we've implemented the Azure features correctly. The only thing I'm not sure about is the setup in @jonkeane, @thisisnic, @assignUser Please let me or @Collinbrown95 know if there are any changes needed to this PR. |
|
Thanks for the PR @marberts - if you rebase from the main branch now, the non-API call stuff should be resolved. Your best bet for the This PR where the GCS bindings were added might help: #13404, but let us know if you have any questions! |
|
We'll also need additional CI jobs to test this - the PR I linked to above contains examples of what we added there. |
|
Great, thanks @thisisnic! |
|
To do list.
|
|
@github-actions crossbow submit -g r |
|
|
@github-actions crossbow submit -g r |
|
Revision: 4da7280 Submitted crossbow builds: ursacomputing/crossbow @ actions-92325d6543 |
|
@github-actions crossbow submit -g r |
|
|
Some of those CI failures are from things which have been fixed on main - you'll need to rebase your branch. |
The hypothesis is that the errors in the build logs are caused by incompatibilities between WIL (Windows Implementation Library) and MinGW GCC, so install them directly instead of trying to build from source.
46a4770 to
c728602
Compare
|
I figured out the rebase and upstream changes are now correctly reflected in this PR. The big diff for github.packages.yml was due to a change in line ending from my editor; it's all good now and the diff makes sense. |
jonkeane
left a comment
There was a problem hiding this comment.
Thank you for persisting with this. I know rebases can be a pain when the go sideways! A few questions / comments around installation + confirming that we have the right dependencies. The operative code looks ok, though I did not dig too too deeply on it. Do you happen to have a public azure bucket that I could point this at and try it out locally?
I'm also going to run crossbow to run the tests.
| # pkg-config --libs libcurl | ||
| GCS_LIBS="-lcurl -lnormaliz -lssh2 -lgdi32 -lssl -lcrypto -lcrypt32 -lwldap32 \ | ||
| -lz -lws2_32 -lnghttp2 -ldbghelp" | ||
| # AZURE_LIBS="-lcurl -lssl -lxml2" |
There was a problem hiding this comment.
These are holdovers from when you were trying these on Windows, yeah?
There was a problem hiding this comment.
Yes, I left these commented out in case progress gets made in the future around the incompatibility with mingw and the azure SDK stuff. Happy to either remove them or make it clearer in the comments.
| print_warning("requires libcurl-devel (rpm) or libcurl4-openssl-dev (deb)") | ||
| arrow_s3 <- FALSE | ||
| arrow_gcs <- FALSE | ||
| arrow_azure <- FALSE |
There was a problem hiding this comment.
Is libcurl and openssl necessary for Azure as well? These additions seem to indicate yes, but I want to confirm that
There was a problem hiding this comment.
Although I did not test without openssl and libcurl, the azure c++ sdk requires them so I'm pretty sure they're needed.
| } else if (!cmake_find_package("libxml2", NULL, env_var_list)) { | ||
| print_warning("requires libxml2-devel (rpm), or libxml2-dev (deb), libxml2 (brew)", "AZURE") | ||
| arrow_azure <- FALSE | ||
| } |
There was a problem hiding this comment.
If I'm following correctly, this check will run regardless of if someone is trying to setup Azure. Is it possible to make this only run if someone has requested Azure?
|
@github-actions crossbow submit -g r |
|
Revision: 004e725 Submitted crossbow builds: ursacomputing/crossbow @ actions-f9bf4c7b3a |
Awesome, thanks for taking a look. Unfortunately I don't have a public container for Azure. If you want to try it locally you can use azurite: https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azurite. This is what we're using in the test suite. The only other tests I did was to try connecting to a made up storage container to check that I was getting the expected errors. |
Rationale for this change
This PR adds support for Azure. The Arrow R package already has support for AWS and GCS, and the Arrow C++ library has had support for Azure for a couple years now. Support for Azure is already available in pyarrow.
This would close #32123.
What changes are included in this PR?
A new class
AzureFileSystemthat's analogous toS3FileSystem/GcsFileSystem, along with a helper functionaz_container()that's analogous tos3_bucket()/gcs_bucket().Updates to
src/filesystem.cppto interact with the machinery inarrow/filesystem/azurefs.h.Updates to the configuration and build scripts to support building with support for Azure.
Updates to the vignettes on cloud storage, installation, and developer setup.
Are these changes tested?
Yes. See
tests/testthat/test-azure.R.Are there any user-facing changes?
Yes. There is a new function
az_container(), serving the analogous role tos3_bucket()/gcs_bucket(), along with an R6 classAzureFileSystem, again serving the same role asS3FileSystemandGcsFileSystem. There is also a functionarrow_with_azure()to indicate if Arrow was built with support for Azure.