Skip to content

Conversation

@bobbai00
Copy link
Contributor

@bobbai00 bobbai00 commented Sep 22, 2025

Overview

This PR let file upload API (POST /dataset/{did}/upload) support both multipart and non-multipart uploads. This is needed because in some cases, e.g. loading example data when installing texera in single-node mode, the multipart upload is unable to reach to the host network.

Changes

  • Modified file upload endpoint to handle different upload methods.

@bobbai00 bobbai00 self-assigned this Sep 22, 2025
@bobbai00 bobbai00 force-pushed the fix/add-more-flexible-upload branch from 3b7a272 to bab69e2 Compare September 22, 2025 06:37
@bobbai00 bobbai00 changed the title fix(dataset): allow file upload api to upload without using multipart upload fix(dataset): allow file upload api not to use multipart-upload Sep 22, 2025
@bobbai00 bobbai00 requested a review from aicam September 22, 2025 06:39
@bobbai00 bobbai00 added the fix label Sep 22, 2025
@chenlica
Copy link
Contributor

@aicam and @aglinxinyuan : please review this PR.

@bobbai00 bobbai00 force-pushed the fix/add-more-flexible-upload branch from c99aa9c to 3b93199 Compare September 22, 2025 21:05
Add environment variable support for all configuration properties in default.conf,
following the same pattern used in gui.conf. This allows deployment configurations
to override defaults without modifying the configuration file directly.

Environment variables added:
- CONFIG_SERVICE_ALWAYS_RESET_CONFIGURATIONS_TO_DEFAULT_VALUES
- GUI_LOGO_LOGO, GUI_LOGO_MINI_LOGO, GUI_LOGO_FAVICON
- GUI_TABS_* for all tab configurations
- DATASET_* for dataset upload configurations
Copy link
Contributor

@aglinxinyuan aglinxinyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we introduce another endpoint? What's the reason for when installing texera in single-node mode, the multipart upload is unable to reach to the host network. It's not a good practice that when A does not work with B, we introduce C. The codebase will become extremely large and unmanageable.

config-service {
# Setting to true resets all site settings in the database to the defaults defined in this file.
always-reset-configurations-to-default-values = false
always-reset-configurations-to-default-values = ${?CONFIG_SERVICE_ALWAYS_RESET_CONFIGURATIONS_TO_DEFAULT_VALUES}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should not belong to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert this change later. This is for current release image building

@bobbai00
Copy link
Contributor Author

bobbai00 commented Sep 23, 2025

Why don't we introduce another endpoint? What's the reason for when installing texera in single-node mode, the multipart upload is unable to reach to the host network. It's not a good practice that when A does not work with B, we introduce C. The codebase will become extremely large and unmanageable.

I agree with the principle.

In single node case, all containers are running in its own internal network and they cannot reach to host machine via localhost. Therefore, when multipart uploads the example dataset, the service failed to do PUT http://localhost:8080/presigned-url.

Since current upload API forces the usage of multipart upload, I add a new parameter to toggle it. The reason why I didn't make it a separate endpoint is because at high level, either way (multipart or not) is logically doing file uploading. Therefore they should be handled by the same endpoint.

@aglinxinyuan
Copy link
Contributor

aglinxinyuan commented Sep 23, 2025

Why don't we introduce another endpoint? What's the reason for when installing texera in single-node mode, the multipart upload is unable to reach to the host network. It's not a good practice that when A does not work with B, we introduce C. The codebase will become extremely large and unmanageable.

I agree with the principle.

In single node case, all containers are running in its own internal network and they cannot reach to host machine via localhost. Therefore, when multipart uploads the example dataset, the service failed to do PUT http://localhost:8080/presigned-url.

Since current upload API forces the usage of multipart upload, I add a new parameter to toggle it. The reason why I didn't make it a separate endpoint is because at high level, either way (multipart or not) is logically doing file uploading. Therefore they should be handled by the same endpoint.

I think we should manage functions based on their underlying logic, rather than just looking at the function name at a high level. I reviewed the code carefully, and with this flag in place, the function essentially skips 99% of its existing logic, making none of the code being reused. In effect, the function now contains two separate implementations: if the flag is true, it executes logic A; otherwise, it executes logic B. This check should happen on the frontend; for example, if we want to do A, call API A, and if we want to do B, call API B. By adding one more flag in the API parameter, we are pushing the logic to be handled at the frontend to the backend.

That aside, I still don’t fully understand the issue: why does the service fail on PUT http://localhost:8080/presigned-url, while LakeFSStorageClient is able to access the service without problems?

@bobbai00
Copy link
Contributor Author

Why don't we introduce another endpoint? What's the reason for when installing texera in single-node mode, the multipart upload is unable to reach to the host network. It's not a good practice that when A does not work with B, we introduce C. The codebase will become extremely large and unmanageable.

I agree with the principle.
In single node case, all containers are running in its own internal network and they cannot reach to host machine via localhost. Therefore, when multipart uploads the example dataset, the service failed to do PUT http://localhost:8080/presigned-url.
Since current upload API forces the usage of multipart upload, I add a new parameter to toggle it. The reason why I didn't make it a separate endpoint is because at high level, either way (multipart or not) is logically doing file uploading. Therefore they should be handled by the same endpoint.

I think we should manage functions based on their underlying logic, rather than just looking at the function name at a high level. I reviewed the code carefully, and with this flag in place, the function essentially skips 99% of its existing logic, making none of the code being reused. In effect, the function now contains two separate implementations: if the flag is true, it executes logic A; otherwise, it executes logic B.

That aside, I still don’t fully understand the issue: why does the service fail on PUT http://localhost:8080/presigned-url, while LakeFSStorageClient is able to access the service without problems?

The key difference lies in where the code executes and which network it uses:

LakeFS Storage Client works because:

  • When it connects to LakeFS,
    since both LakeFS container and microservices are in the same internal network, communication can succeed using the correct internal service hostname, e.g. http://lakefs:8000,

Multipart upload fails because:

  • The presigned URL is generated with prefix http://localhost:8000/ (the address of minio available on the host machine)
  • This URL is returned to the client, in this case, the example-data-loader container
  • When the client tries to PUT to localhost:8000, it's attempting to connect to port 8000 within the
    container, not the host machine

@bobbai00
Copy link
Contributor Author

Why don't we introduce another endpoint? What's the reason for when installing texera in single-node mode, the multipart upload is unable to reach to the host network. It's not a good practice that when A does not work with B, we introduce C. The codebase will become extremely large and unmanageable.

I agree with the principle.
In single node case, all containers are running in its own internal network and they cannot reach to host machine via localhost. Therefore, when multipart uploads the example dataset, the service failed to do PUT http://localhost:8080/presigned-url.
Since current upload API forces the usage of multipart upload, I add a new parameter to toggle it. The reason why I didn't make it a separate endpoint is because at high level, either way (multipart or not) is logically doing file uploading. Therefore they should be handled by the same endpoint.

I think we should manage functions based on their underlying logic, rather than just looking at the function name at a high level. I reviewed the code carefully, and with this flag in place, the function essentially skips 99% of its existing logic, making none of the code being reused. In effect, the function now contains two separate implementations: if the flag is true, it executes logic A; otherwise, it executes logic B. This check should happen on the frontend; for example, if we want to do A, call API A, and if we want to do B, call API B. By adding one more flag in the API parameter, we are pushing the logic to be handled at the frontend to the backend.

That aside, I still don’t fully understand the issue: why does the service fail on PUT http://localhost:8080/presigned-url, while LakeFSStorageClient is able to access the service without problems?

In this case, what's your suggestion on the API design for supporting non-multipart upload?

@aglinxinyuan
Copy link
Contributor

Why don't we introduce another endpoint? What's the reason for when installing texera in single-node mode, the multipart upload is unable to reach to the host network. It's not a good practice that when A does not work with B, we introduce C. The codebase will become extremely large and unmanageable.

I agree with the principle.
In single node case, all containers are running in its own internal network and they cannot reach to host machine via localhost. Therefore, when multipart uploads the example dataset, the service failed to do PUT http://localhost:8080/presigned-url.
Since current upload API forces the usage of multipart upload, I add a new parameter to toggle it. The reason why I didn't make it a separate endpoint is because at high level, either way (multipart or not) is logically doing file uploading. Therefore they should be handled by the same endpoint.

I think we should manage functions based on their underlying logic, rather than just looking at the function name at a high level. I reviewed the code carefully, and with this flag in place, the function essentially skips 99% of its existing logic, making none of the code being reused. In effect, the function now contains two separate implementations: if the flag is true, it executes logic A; otherwise, it executes logic B. This check should happen on the frontend; for example, if we want to do A, call API A, and if we want to do B, call API B. By adding one more flag in the API parameter, we are pushing the logic to be handled at the frontend to the backend.
That aside, I still don’t fully understand the issue: why does the service fail on PUT http://localhost:8080/presigned-url, while LakeFSStorageClient is able to access the service without problems?

In this case, what's your suggestion on the API design for supporting non-multipart upload?

We can introduce one more endpoint just for non-multipart upload.

@chenlica
Copy link
Contributor

@bobbai00 @aglinxinyuan : It may be easier to have a meeting to discuss this topic, then include the discussion result here.

@aglinxinyuan
Copy link
Contributor

@bobbai00 @aglinxinyuan : It may be easier to have a meeting to discuss this topic, then include the discussion result here.

For code review, it’s better to keep the discussion here. I don’t have any concerns with the design — the discussion is more about coding style.

@chenlica
Copy link
Contributor

If a discussion can be done efficiently here, we can do so. If not, please feel free to do an offline discussion and summarize the decision.

@github-actions github-actions bot added the backend Anything related to backend services label Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Anything related to backend services fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants