Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support storage_options param when reading from table #139

Closed
kazdy opened this issue Sep 14, 2024 · 3 comments · Fixed by #163
Closed

feat: support storage_options param when reading from table #139

kazdy opened this issue Sep 14, 2024 · 3 comments · Fixed by #163
Assignees
Labels
Milestone

Comments

@kazdy
Copy link
Contributor

kazdy commented Sep 14, 2024

To integrate hudi-rs with AWS SDK for Pandas (aws wrangler), we must be able to pass boto_session related aws authentication params (mostly AWS_* params) directly and not only rely on env variable inference.

I want to propose adding an option to handle this:

storage_options = {"AWS_ACCESS_KEY_ID": "xxxx", "AWS_SECRET_ACCESS_KEY":"xxxx", "AWS_SECRET_ACCESS_TOKEN":"xxxx"}
hudi_table = HudiTable("/tmp/trips_table", storage_options=storage_options)
records = hudi_table.read_snapshot()

Although I want to add this for S3, it should work for other storage backends.
I'm happy to contribute and add this.

@xushiyan xushiyan added this to the release-0.2.0 milestone Sep 14, 2024
@xushiyan
Copy link
Member

@kazdy sounds good. feel free to take this up and send a pr

@kazdy
Copy link
Contributor Author

kazdy commented Sep 24, 2024

I'll wait until #72 gets merged.
I did the first strawman impl and it requires some refactoring in the Table itself.

@xushiyan I also have some questions about this, maybe you can give me your opinion on these:

  1. Should we rename Table to HudiTable?
  2. I don't know why Timeline and FileSystemView both use separate storage instances, can't they share it, maybe there's a reason why it's done this way I can't see atm?
  3. Does it make sense to introduce something that will hold both Timeline and FileSystemView (basically table state) and expose coherent API?

thanks

@xushiyan
Copy link
Member

hey @kazdy

  1. we keep name Table within hudi-core to avoid redundant prefix; everything in hudi-core is about Hudi. When import to other crates, we can give it an alias like HudiTable. We can also add an alias in hudi crate for external facing API when needed. As of now, no strong need for this.

  2. Timeline is responsible for data stored in timeline files under .hoodie/, and FileSystemView is responsible for the data stored under the table excluding .hoodie/. It's good to keep things less coupled, unless there is a need for sharing - it's a stateless client performing IO anyway. Maybe you can make a case about why sharing it?

  3. Currently Table holds Timeline and FileSystemView. You want to elaborate on what you meant by coherent API?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants