Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop fastparquet and use pyarrow. this is required on latest versions of dask #70

Open
gnzsnz opened this issue Mar 22, 2024 · 4 comments

Comments

@gnzsnz
Copy link
Contributor

gnzsnz commented Mar 22, 2024

drop default engine fastparquet and implement pyarrow

latest versions of dask use pyarrow as default engine. pandas is heading in the same direction.

@gnzsnz
Copy link
Contributor Author

gnzsnz commented Mar 22, 2024

pull request #71 created

@gnzsnz
Copy link
Contributor Author

gnzsnz commented May 6, 2024

After a series of commits PR #71 provides a working version of pystore.

In it's current state master branch and latest pipy package are not usable (at least not with latest versions of parquet and dask). at the very least an update on requirements.txt would be needed with a restriction on pandas and dask version. I have not identified what would be the right combination.

This PR has 2 problems:

  • multi-threading is causing data looses. at least for appends. the reason seems to be related to the fact that any append is now writing into a temp item (temp item being "__" + item), followed by a deletion of current item and a rename. this is required because dask does not support overwrite by the same dataframe. there are 2 alternatives to solve this, one is to load the full dataset into memory and then write. or to write into a temp file and then move. I have chosen the later as a fix. in multi-thread mode write goes in parallel to delete and move operations, which is causing problems. Currently this is an open issue.
  • any data stored in metadata.json is not moved into the new metadata file pystore_metadata.json. the renaming is needed to avoid conflicts with pyarrow. so any operational datastore would require a script to update any metadata.

besides that pystore is now working.

@ranaroussi please review this PR

@ranaroussi if you need help maintaining this package. I offer as volunteer.

@gnzsnz
Copy link
Contributor Author

gnzsnz commented May 9, 2024

fix for multi-threading applied on f8e2dfd

still making my mind if this is THE fix, or just a step on the right direction

@ranaroussi I would appreciate your comments

@gnzsnz
Copy link
Contributor Author

gnzsnz commented Jun 18, 2024

any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant