Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix handling values out of bounds in pandas #272

Closed
wants to merge 2 commits into from

Conversation

lucasfcosta
Copy link
Member

@lucasfcosta lucasfcosta commented Nov 30, 2024

I JUST LOVE DATES IN PYTHON DON'T YOU??????????????

I only got three hours of sleep last night because of this PR. It was great.

It solves a bug that would cause our queries to break when handling timestamps like -62135596800000000.

ArrowInvalid - Casting from timestamp[us, tz=UTC] to timestamp[ns] would result in out of bounds timestamp: -62135596800000000

The reason this was such a painful pull-request is that:

  1. Google's BigQuery library simply ignores the passed dtypes. Therefore I couldn't just tell it to force timestamp columns to datetime64[us] when creating a pyarrow table (which is then used to create a pandas data frame). Even if I could, I would have eventually found out that it wouldn't work as well.
    This conversion wouldn't work because it would still be possible for numbers to be too small for pyarrow to handle - which would make it crash.
  2. The right way to handle this was to ask pyarrow to get timestamp_as_object when reading the BQ table.
    Cast non-nanosecond timestamps (np.datetime64) to objects. This is useful in pandas version 1.x if you have timestamps that don’t fit in the normal date range of nanosecond timestamps (1678 CE-2262 CE). Non-nanosecond timestamps are supported in pandas version 2.0. If False, all timestamps are converted to datetime64 dtype.
    
    The problem with using py_arrow is that I had to pass this argument myself because the BigQuery's library doesn't take any arguments that allows us to control its internal behaviour.
    This lack of control led me to use to_arrow_iterable instead of to_dataframe_iterable.
    I then had to call to_pandas for each chunk so that I could pass timestamp_as_object=True.
  3. Once I did that, things were still breaking because Pandas is absolutely great and can only handle dates as old as 1677-09-21 00:12:43.145225. That means it used to break when calling pd.to_datetime.

    Ah but can't you just store dates in micros, millis, or seconds?
    NO, YOU CAN'T
    Even if you only want the date with resolution of seconds or microseconds, pandas will still store it internally in nanoseconds. There is no option in pandas to store a timestamp outside of the above-mentioned range. (Source)
    Please notice that you also can't just coerce out-of-bounds values to NaT and replace them later because you will end up storing the fallback value as nanos anyway 🤷‍♂️

  4. "Ah, but then it was fine, no?" — asks the reader.
    NO, IT WASN'T FINE.
    It turns out we use Pandas' read_parquet. Can you guess what it does?
    YES, IT TRIES TO USE PANDAS' DATETIME TYPE FOR THE TIMESTAMP COLUMNS SO IT BREAKS AGAIN
    The way to solve this issue was to read the parquet as a pyarrow.Table so that we could pass to_pandas the timestamp_as_object=True arg once again.

It doesn't sound that bad when you read this step-by-step explanation but I can assure you that figuring all of this out was terrible. I just don't understand how Python docs can be so difficult to navigate.

Anyway, that's my rant.

@lucasfcosta lucasfcosta force-pushed the pandas-out-of-bounds branch 4 times, most recently from b3fd5a8 to 7a9553f Compare December 1, 2024 15:54
@vieiralucas
Copy link
Member

fixed by 45bf2a7

@vieiralucas vieiralucas closed this Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants