You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To integrate hudi-rs with AWS SDK for Pandas (aws wrangler), we must be able to pass boto_session related aws authentication params (mostly AWS_* params) directly and not only rely on env variable inference.
I want to propose adding an option to handle this:
storage_options = {"AWS_ACCESS_KEY_ID":"xxxx","AWS_SECRET_ACCESS_KEY":"xxxx","AWS_SECRET_ACCESS_TOKEN":"xxxx"}
hudi_table = HudiTable("/tmp/trips_table", storage_options=storage_options)
records = hudi_table.read_snapshot()
Although I want to add this for S3, it should work for other storage backends.
I'm happy to contribute and add this.
The text was updated successfully, but these errors were encountered:
I'll wait until #72 gets merged.
I did the first strawman impl and it requires some refactoring in the Table itself.
@xushiyan I also have some questions about this, maybe you can give me your opinion on these:
Should we rename Table to HudiTable?
I don't know why Timeline and FileSystemView both use separate storage instances, can't they share it, maybe there's a reason why it's done this way I can't see atm?
Does it make sense to introduce something that will hold both Timeline and FileSystemView (basically table state) and expose coherent API?
we keep name Table within hudi-core to avoid redundant prefix; everything in hudi-core is about Hudi. When import to other crates, we can give it an alias like HudiTable. We can also add an alias in hudi crate for external facing API when needed. As of now, no strong need for this.
Timeline is responsible for data stored in timeline files under .hoodie/, and FileSystemView is responsible for the data stored under the table excluding .hoodie/. It's good to keep things less coupled, unless there is a need for sharing - it's a stateless client performing IO anyway. Maybe you can make a case about why sharing it?
Currently Table holds Timeline and FileSystemView. You want to elaborate on what you meant by coherent API?
To integrate hudi-rs with AWS SDK for Pandas (aws wrangler), we must be able to pass boto_session related aws authentication params (mostly AWS_* params) directly and not only rely on env variable inference.
I want to propose adding an option to handle this:
Although I want to add this for S3, it should work for other storage backends.
I'm happy to contribute and add this.
The text was updated successfully, but these errors were encountered: