fix handling values out of bounds in pandas #272
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I JUST LOVE DATES IN PYTHON DON'T YOU??????????????
I only got three hours of sleep last night because of this PR. It was great.
It solves a bug that would cause our queries to break when handling timestamps like
-62135596800000000
.The reason this was such a painful pull-request is that:
dtypes
. Therefore I couldn't just tell it to force timestamp columns todatetime64[us]
when creating a pyarrow table (which is then used to create a pandas data frame). Even if I could, I would have eventually found out that it wouldn't work as well.This conversion wouldn't work because it would still be possible for numbers to be too small for
pyarrow
to handle - which would make it crash.pyarrow
to gettimestamp_as_object
when reading the BQ table.py_arrow
is that I had to pass this argument myself because the BigQuery's library doesn't take any arguments that allows us to control its internal behaviour.This lack of control led me to use
to_arrow_iterable
instead ofto_dataframe_iterable
.I then had to call
to_pandas
for each chunk so that I could passtimestamp_as_object=True
.1677-09-21 00:12:43.145225
. That means it used to break when callingpd.to_datetime
.NO, IT WASN'T FINE.
It turns out we use Pandas'
read_parquet
. Can you guess what it does?YES, IT TRIES TO USE PANDAS' DATETIME TYPE FOR THE TIMESTAMP COLUMNS SO IT BREAKS AGAIN
The way to solve this issue was to read the parquet as a
pyarrow.Table
so that we could passto_pandas
thetimestamp_as_object=True
arg once again.It doesn't sound that bad when you read this step-by-step explanation but I can assure you that figuring all of this out was terrible. I just don't understand how Python docs can be so difficult to navigate.
Anyway, that's my rant.