Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow: TypeError: __cinit__() got an unexpected keyword argument 'times' #75

Open
JAC28 opened this issue Oct 17, 2024 · 0 comments
Open

Comments

@JAC28
Copy link

JAC28 commented Oct 17, 2024

I stumpled over the following bizarre error when writing data to a collection:

in pyarrow._parquet.ParquetWriter.__cinit__()
TypeError: __cinit__() got an unexpected keyword argument 'times'

The error is caused by:

in Collection.write(self, item, data, metadata, npartitions, overwrite, epochdate, reload_items, **kwargs)
dd.to_parquet(data, self._item_path(item, as_string=True), overwrite=overwrite, compression="snappy", engine=self.engine, **kwargs)
in to_parquet(df, path, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, name_function, filesystem, engine, **kwargs)

After some research I found this line responsible

if (1 == data.index.nanosecond).any() and "times" not in kwargs:

as it adds the keyword argument 'times' that is forwarded through all functions but not referenced by dask or ParquetWriter. This addition is done when any row-Index has a 1 on the nanosecond-decimal e.g. from measurements or import of data:

import pandas as pd
import numpy as np
from pystore import store

index = pd.date_range('1/1/2024 00:00:00', '1/1/2024 10:00:00', freq='1s')
index += pd.to_timedelta(np.random.default_rng().integers(low=0, high=10, size=len(index))  , unit='ns') # Generate random fragments such as inaccuracies 
columns = ["A", "B", "C"]
data = np.random.rand(len(index), len(columns))
df = pd.DataFrame(data=data, index=index, columns=columns)

If you try to save this data to a collection this will fail:

Store = store("ExampleStore")
collection = Store.collection("TestCollection")
Store.collection("TestCollection").write("TestItem", df, overwrite=False)

while rounding the index beforehand will succeed:

df.index = df.index.round(freq="0.000001s")
Store.collection("TestCollection").write("TestItemRoundedIndex", df, overwrite=False)

I can't understand why the argument is inserted at this point – does it come from the version where fastparquet was the engine? The majority of users probably won't use a temporal resolution in nanoseconds, but if an entry with 1ns occurs by chance due to inaccuracies, measuring devices or similar, the search for the cause is difficult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant