You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
based on this code , on each append we load all the data into memory to check for duplicates then doing a write on all the data to rewrite parquet.
doing that for some items with 100k existing record with multiple threads, the task is consuming 100% of memory for each 1 record append
try:
if epochdate or ("datetime" in str(data.index.dtype) and
any(data.index.nanosecond) > 0):
data = utils.datetime_to_int64(data)
old_index = dd.read_parquet(self._item_path(item, as_string=True),
columns=[], engine=self.engine
).index.compute()
data = data[~data.index.isin(old_index)]
except Exception:
return
if data.empty:
return
if data.index.name == "":
data.index.name = "index"
# combine old dataframe with new
current = self.item(item)
new = dd.from_pandas(data, npartitions=1)
combined = dd.concat([current.data, new]).drop_duplicates(keep="last")
The text was updated successfully, but these errors were encountered:
Hi @ovresko , for information, I have started an alternative lib, oups that has some similarities with pystore. Please, beware this is my first project, but I would gladly accept any feedback on it.
@ranaroussi, I am aware this post may not be welcome and I am sorry if it is a bit rude. Please, remove it if it does.
based on this code , on each append we load all the data into memory to check for duplicates then doing a write on all the data to rewrite parquet.
doing that for some items with 100k existing record with multiple threads, the task is consuming 100% of memory for each 1 record append
why not use fastparquet write method to append the data, (with True / False / overwrite)
https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
The text was updated successfully, but these errors were encountered: