Skip to content

Conversation

@madisonmlin
Copy link
Contributor

@madisonmlin madisonmlin commented Oct 21, 2025

What changes were proposed in this PR?

This pull request changes single-file dataset downloads to have the browser, rather than JavaScript, handle the download. Having the browser handle downloads enables the browser's file download UI, which includes a progress bar when downloading larger files.

Previously, we would use LakeFS API to generate a pre-signed URL for the file and request it as a Blob, then write the Blob to the local filesystem. Because the download is handled programmatically in JavaScript instead of by the browser, the browser’s download UI is not shown. Now, we instead create a hidden link element in the frontend that points to the pre-signed URL and programmatically click it. This lets the browser handle the file transfer and display its native download UI.

Additional Challenges

The previous pre-signed URL was generated through the LakeFS API, which doesn’t support custom headers such as Content-Disposition. Without this header, the browser cannot correctly set the filename and filetype, leading to generic names (e.g. 'd3eo286eh7os77u8d0k0,Gw_-cWconI7VjtJviVRvW8nHb6NvkYRSIIx6cOy-RCs' ) and MIME type. To address this issue, we can instead use the S3 Gateway API to generate the pre-signed URL, which allows for customization of headers.

Use of headers like Content-Disposition with GET requests to specify filename is documented in the official AWS documentation (https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html).

Overview of Changes

  • added method to generate pre-signed URL via S3 Gateway API, with additional flexibility to specify file name and type
  • updated frontend and backend to programmatically click on a pre-signed URL generated via the new method for single file dataset downloads
  • removed success and error notifications for single file dataset downloads (the success and error of a file download cannot be obtained when using browser native download)

Any related issues, documentation, discussions?

Addresses issue #3842, which is a sub-issue of #3404.

How was this PR tested?

Dataset download functionality was manually tested locally across several browsers (Chrome, Edge, Firefox, Brave), and with larger file sizes that would trigger the browser download UI's progress bar. @aicam also tested the dataset download on a Kubernetes deployment.

Was this PR authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added frontend Changes related to the frontend GUI service labels Oct 21, 2025

// Initialize S3-compatible presigner for LakeFS S3 Gateway
private lazy val s3Presigner: S3Presigner = {
val fullUri = new URI(StorageConfig.lakefsEndpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid this override?

Copy link
Contributor Author

@madisonmlin madisonmlin Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so. If we don't override to specify the endpoint our s3Presigner should go through, it will use the default AWS endpoint. Then, the generated presigned URL will point to AWS instead of LakeFS and won't be able to access the target file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why AWS is involved here. AWS provides libraries to use MinIO because MinIO is using same protocol, However, it should not cause any issue for us. Currently, the path to MinIO is defined in LakeFS config (called blockstore), and it was using it in previous S3 signed URL generation, why its not using that anymore? maybe a better question is: Why your implementation needs to override baseURL but previous one didn't?

Copy link
Contributor Author

@madisonmlin madisonmlin Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous presigner used LakeFS's own API, so it already implicitly knew its own endpoint through the LakeFS config. However, the s3Presigner is a generic AWS SDK tool that MinIO happens to be compatible with. It isn't intended solely for LakeFS. As such, it doesn't know that we want to use it specifically for LakeFS, and instead defaults to the AWS S3 endpoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If LakeFS has the URL, why don't you get it from there instead of environment variable? this environment variable is not correct on production environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, as an amendment to my previous explanation, the LakeFS presigner actually does have the LakeFS endpoint explicitly specified:

  private lazy val apiClient: ApiClient = {
    val client = new ApiClient()
    client.setApiKey(StorageConfig.lakefsPassword)
    client.setUsername(StorageConfig.lakefsUsername)
    client.setPassword(StorageConfig.lakefsPassword)
    client.setServers(
      List(
        new ServerConfiguration(
          StorageConfig.lakefsEndpoint,
          "LakeFS API server endpoint",
          new java.util.HashMap[String, ServerVariable]()
        )
      ).asJava
    )
    client
  }

This apiClient is used to generate the pre-signed URL. It uses the same StorageConfig.lakefsEndpoint that I use to override the s3Presigner endpoint. The only difference is that I parse to remove "api/v1" from the endpoint for the s3Presigner. Could this be causing the incorrect URL issue you mention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenlica
Copy link
Contributor

@madisonmlin Please follow our PR template. Check #3960 as an example.

@madisonmlin madisonmlin marked this pull request as ready for review October 28, 2025 22:46
@madisonmlin
Copy link
Contributor Author

The new S3 presigner uses the same environment variable endpoint as the original LakeFS presigner. Here is reference to the apiClient used by LakeFS presigner, which sets the server endpoint (StorageConfig.lakefsEndpoint):

  private lazy val apiClient: ApiClient = {
    val client = new ApiClient()
    client.setApiKey(StorageConfig.lakefsPassword)
    client.setUsername(StorageConfig.lakefsUsername)
    client.setPassword(StorageConfig.lakefsPassword)
    client.setServers(
      List(
        new ServerConfiguration(
          StorageConfig.lakefsEndpoint,
          "LakeFS API server endpoint",
          new java.util.HashMap[String, ServerVariable]()
        )
      ).asJava
    )
    client
  }

The original LakeFS presigner can be found under LakeFSStorageClient.scala: https://github.com/apache/texera/blob/b7944f3129973fe8a73e92c09e7594b2bff4930b/common/workflow-core/src/main/scala/org/apache/amber/core/storage/util/LakeFSStorageClient.scala

@aicam
Copy link
Contributor

aicam commented Nov 14, 2025

@madisonmlin unfortunately, after iterating over your implementation, I realized that it is wrong since it expose LakeFS to outside and I could not find any implementation in their library, so I contacted the developers of LakeFS in Slack and they asked me to create an issue and request the feature. The issue is now open and I will get back to this PR after they add this feature.

@aicam aicam closed this Nov 14, 2025
@chenlica
Copy link
Contributor

@aicam Thanks. Does the new finding affect any PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend Changes related to the frontend GUI service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants