fix handling values out of bounds in pandas #272

lucasfcosta · 2024-11-30T00:23:03Z

I JUST LOVE DATES IN PYTHON DON'T YOU??????????????

I only got three hours of sleep last night because of this PR. It was great.

It solves a bug that would cause our queries to break when handling timestamps like -62135596800000000.

ArrowInvalid - Casting from timestamp[us, tz=UTC] to timestamp[ns] would result in out of bounds timestamp: -62135596800000000

The reason this was such a painful pull-request is that:

Google's BigQuery library simply ignores the passed dtypes. Therefore I couldn't just tell it to force timestamp columns to datetime64[us] when creating a pyarrow table (which is then used to create a pandas data frame). Even if I could, I would have eventually found out that it wouldn't work as well.
This conversion wouldn't work because it would still be possible for numbers to be too small for pyarrow to handle - which would make it crash.
The right way to handle this was to ask pyarrow to get timestamp_as_object when reading the BQ table.
```
Cast non-nanosecond timestamps (np.datetime64) to objects. This is useful in pandas version 1.x if you have timestamps that don’t fit in the normal date range of nanosecond timestamps (1678 CE-2262 CE). Non-nanosecond timestamps are supported in pandas version 2.0. If False, all timestamps are converted to datetime64 dtype.
```
The problem with using py_arrow is that I had to pass this argument myself because the BigQuery's library doesn't take any arguments that allows us to control its internal behaviour.
This lack of control led me to use to_arrow_iterable instead of to_dataframe_iterable.
I then had to call to_pandas for each chunk so that I could pass timestamp_as_object=True.
Once I did that, things were still breaking because Pandas is absolutely great and can only handle dates as old as 1677-09-21 00:12:43.145225. That means it used to break when calling pd.to_datetime.

Ah but can't you just store dates in micros, millis, or seconds?
NO, YOU CAN'T
Even if you only want the date with resolution of seconds or microseconds, pandas will still store it internally in nanoseconds. There is no option in pandas to store a timestamp outside of the above-mentioned range. (Source)
Please notice that you also can't just coerce out-of-bounds values to NaT and replace them later because you will end up storing the fallback value as nanos anyway 🤷‍♂️
"Ah, but then it was fine, no?" — asks the reader.
NO, IT WASN'T FINE.
It turns out we use Pandas' read_parquet. Can you guess what it does?
YES, IT TRIES TO USE PANDAS' DATETIME TYPE FOR THE TIMESTAMP COLUMNS SO IT BREAKS AGAIN
The way to solve this issue was to read the parquet as a pyarrow.Table so that we could pass to_pandas the timestamp_as_object=True arg once again.

It doesn't sound that bad when you read this step-by-step explanation but I can assure you that figuring all of this out was terrible. I just don't understand how Python docs can be so difficult to navigate.

Anyway, that's my rant.

vieiralucas · 2024-12-02T13:37:54Z

fixed by 45bf2a7

lucasfcosta force-pushed the pandas-out-of-bounds branch 4 times, most recently from b3fd5a8 to 7a9553f Compare December 1, 2024 15:54

fix handling values out of bounds in pandas

7e267f9

lucasfcosta force-pushed the pandas-out-of-bounds branch from 7a9553f to 7e267f9 Compare December 1, 2024 15:56

handle null dates

c41b18e

vieiralucas closed this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix handling values out of bounds in pandas #272

fix handling values out of bounds in pandas #272

lucasfcosta commented Nov 30, 2024 •

edited

Loading

vieiralucas commented Dec 2, 2024

fix handling values out of bounds in pandas #272

fix handling values out of bounds in pandas #272

Conversation

lucasfcosta commented Nov 30, 2024 • edited Loading

I JUST LOVE DATES IN PYTHON DON'T YOU??????????????

vieiralucas commented Dec 2, 2024

lucasfcosta commented Nov 30, 2024 •

edited

Loading