-
Notifications
You must be signed in to change notification settings - Fork 110
feat: switch single-file dataset downloads to use browser native downloads #3995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: switch single-file dataset downloads to use browser native downloads #3995
Conversation
…oryName' instead of 'datasetName'
|
|
||
| // Initialize S3-compatible presigner for LakeFS S3 Gateway | ||
| private lazy val s3Presigner: S3Presigner = { | ||
| val fullUri = new URI(StorageConfig.lakefsEndpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid this override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe so. If we don't override to specify the endpoint our s3Presigner should go through, it will use the default AWS endpoint. Then, the generated presigned URL will point to AWS instead of LakeFS and won't be able to access the target file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why AWS is involved here. AWS provides libraries to use MinIO because MinIO is using same protocol, However, it should not cause any issue for us. Currently, the path to MinIO is defined in LakeFS config (called blockstore), and it was using it in previous S3 signed URL generation, why its not using that anymore? maybe a better question is: Why your implementation needs to override baseURL but previous one didn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous presigner used LakeFS's own API, so it already implicitly knew its own endpoint through the LakeFS config. However, the s3Presigner is a generic AWS SDK tool that MinIO happens to be compatible with. It isn't intended solely for LakeFS. As such, it doesn't know that we want to use it specifically for LakeFS, and instead defaults to the AWS S3 endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If LakeFS has the URL, why don't you get it from there instead of environment variable? this environment variable is not correct on production environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, as an amendment to my previous explanation, the LakeFS presigner actually does have the LakeFS endpoint explicitly specified:
private lazy val apiClient: ApiClient = {
val client = new ApiClient()
client.setApiKey(StorageConfig.lakefsPassword)
client.setUsername(StorageConfig.lakefsUsername)
client.setPassword(StorageConfig.lakefsPassword)
client.setServers(
List(
new ServerConfiguration(
StorageConfig.lakefsEndpoint,
"LakeFS API server endpoint",
new java.util.HashMap[String, ServerVariable]()
)
).asJava
)
client
}
This apiClient is used to generate the pre-signed URL. It uses the same StorageConfig.lakefsEndpoint that I use to override the s3Presigner endpoint. The only difference is that I parse to remove "api/v1" from the endpoint for the s3Presigner. Could this be causing the incorrect URL issue you mention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LakeFS presigner can be found under LakeFSStorageClient.scala:
https://github.com/apache/texera/blob/b7944f3129973fe8a73e92c09e7594b2bff4930b/common/workflow-core/src/main/scala/org/apache/amber/core/storage/util/LakeFSStorageClient.scala
|
@madisonmlin Please follow our PR template. Check #3960 as an example. |
file-service/src/main/scala/org/apache/texera/service/util/S3StorageClient.scala
Show resolved
Hide resolved
|
The new S3 presigner uses the same environment variable endpoint as the original LakeFS presigner. Here is reference to the apiClient used by LakeFS presigner, which sets the server endpoint ( The original LakeFS presigner can be found under |
|
@madisonmlin unfortunately, after iterating over your implementation, I realized that it is wrong since it expose LakeFS to outside and I could not find any implementation in their library, so I contacted the developers of LakeFS in Slack and they asked me to create an issue and request the feature. The issue is now open and I will get back to this PR after they add this feature. |
|
@aicam Thanks. Does the new finding affect any PR? |
What changes were proposed in this PR?
This pull request changes single-file dataset downloads to have the browser, rather than JavaScript, handle the download. Having the browser handle downloads enables the browser's file download UI, which includes a progress bar when downloading larger files.
Previously, we would use LakeFS API to generate a pre-signed URL for the file and request it as a Blob, then write the Blob to the local filesystem. Because the download is handled programmatically in JavaScript instead of by the browser, the browser’s download UI is not shown. Now, we instead create a hidden link element in the frontend that points to the pre-signed URL and programmatically click it. This lets the browser handle the file transfer and display its native download UI.
Additional Challenges
The previous pre-signed URL was generated through the LakeFS API, which doesn’t support custom headers such as
Content-Disposition. Without this header, the browser cannot correctly set the filename and filetype, leading to generic names (e.g. 'd3eo286eh7os77u8d0k0,Gw_-cWconI7VjtJviVRvW8nHb6NvkYRSIIx6cOy-RCs' ) and MIME type. To address this issue, we can instead use the S3 Gateway API to generate the pre-signed URL, which allows for customization of headers.Use of headers like
Content-Dispositionwith GET requests to specify filename is documented in the official AWS documentation (https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html).Overview of Changes
Any related issues, documentation, discussions?
Addresses issue #3842, which is a sub-issue of #3404.
How was this PR tested?
Dataset download functionality was manually tested locally across several browsers (Chrome, Edge, Firefox, Brave), and with larger file sizes that would trigger the browser download UI's progress bar. @aicam also tested the dataset download on a Kubernetes deployment.
Was this PR authored or co-authored using generative AI tooling?
No.