Skip to content

Conversation

@theroggy
Copy link
Member

@theroggy theroggy commented Oct 17, 2024

This PR improves support for datetime columns, mainly in read_dataframe and write_dataframe.

In general the PR tries to accomplish that:

  • datetime column data from a file can be read to a GeoDataFrame without data loss. For this, two parameter have been added to read_dataframe:
    • datetime_as_string: return datetime columns as ISO8601 strings.
    • mixed_offsets_as_utc:
      • True: always return datetime columns as pandas datetime64 columns. If a column contains e.g. data with mixed timezone offsets the datetimes will be converted to UTC as pandas datetime64 columns don't support such data. This was the behaviour before this PR and stays the default.
      • False: return the datetime column values with timezone information as they were read from the file. In this case mixed timezone columns are returned as object columns with python datetime values. This is to avoid the timezone information being lost. This option should be used if you want datetime data being roundtripped correctly in most situations. This is also ~ the default behaviour of the pandas.to_datetime function in pandas < 3.
  • (try to) get the treatment of datetimes consistent between when arrow is used or not. For use_arrow=True there are several situations where GDAL 3.11 is needed to get correct results.

More specifically:

  • Fix: when a GPKG was read with use_arrow, naive datetimes (no timezone) were interpreted as being UTC. So a naive time of 05:00 h was interpreted as 05:00 UTC.
  • Fix: when a .fgb was read with use_arrow, for datetime columns with a timezone the timezone was dropped, so 05:00+5:00 was read as 05:00.
  • Fix: when a file was written with use_arrow, for datetime columns with any timezone but UTC, the timezone was dropped, so 05:00+5:00 was written as 05:00 (a naive datetime), for all filetypes.
  • When reading datetimes with use_arrow, don't convert/represent them as being in UTC time if they have another timezone offset in the dataset.
  • Add support to write columns with mixed timezones. Typically the column needs to be of the object type with pandas.Timestamp or datetime objects in them as "standard" pandas datetime64 colums don't support mixed timezone offsets in a column.
  • Add support to read mixed timezone datetimes. These are returned in an object column with datetime's.
  • For the cases with use_arrow, the fixes typically require GDAL >= 3.11 (OGRLayer::GetArrowStream(): add a DATETIME_AS_STRING=YES/NO option OSGeo/gdal#11213).

Resolves #487
Resolves #123
Resolves #553

@theroggy theroggy changed the title ENH: deal properly with naive datetimes with arrow TST: add tests exposing some issues with datetimes with arrow? Oct 18, 2024
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for diving into this and improving the test coverage!

@theroggy theroggy changed the title TST: add tests exposing some issues with datetimes with arrow? ENH: improve datetime support with arrow for GDAL >= 3.11 Jan 16, 2025
@theroggy theroggy changed the title ENH: improve datetime support with arrow for GDAL >= 3.11 ENH: improve read support for naive and mixed datetimes with arrow for GDAL >= 3.11 Jan 16, 2025
@theroggy theroggy changed the title ENH: improve read support for naive and mixed datetimes with arrow for GDAL >= 3.11 ENH: improve read support for datetimes with arrow for GDAL >= 3.11 Jan 16, 2025
@theroggy theroggy changed the title ENH: improve read support for datetimes with arrow for GDAL >= 3.11 ENH: improve read support for datetime columns with arrow for GDAL >= 3.11 Jan 16, 2025
@theroggy theroggy changed the title ENH: improve read support for datetime columns with arrow for GDAL >= 3.11 ENH: improve support for datetime columns with mixed or naive times Jan 17, 2025
@theroggy theroggy marked this pull request as ready for review January 18, 2025 08:43
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theroggy thanks for further looking into this!

I do have some doubts about how much effort we should do to cover corner cases and what the desired default behaviour should be, see my comments below.

@theroggy theroggy marked this pull request as ready for review November 12, 2025 23:39
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some doc comments (I am happy to push some rephrased suggestions) and noticed one more issue in the datetime conversion logic

Comment on lines +586 to +588
assert is_datetime64_dtype(df.datetime_col)
assert df.datetime_col.iloc[0] == pd.Timestamp(1670, 1, 1, 9, 0, 0)
assert df.datetime_col.iloc[0].unit == "ms"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you removed the assert_series_equal(df.datetime_col, exp_dates) from a previous iteration?

Copy link
Member Author

@theroggy theroggy Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure, but I suppose I didn't find an easy solution to create the exp_dates series when pandas<3.0 and use_arrow=True, as pd.to_datetime overflows on pandas < 3.0. And as an alternative I replaced it with just checking if the value and unit is correct instead.

@jorisvandenbossche
Copy link
Member

I will take a look at the failing tests (I assume some issue with older pandas versions with my latest suggestion ..)

@jorisvandenbossche
Copy link
Member

All green! I want to do some more clarifications in the tests, but that can wait for later ;)

I also made some small changes to the docstrings. The diff makes it look like it all changes because of reflowing the text, but it were only minor edits I think

@theroggy
Copy link
Member Author

All green! I want to do some more clarifications in the tests, but that can wait for later ;)

Great!

I also made some small changes to the docstrings. The diff makes it look like it all changes because of reflowing the text, but it were only minor edits I think

OK! Minor edits indeed. As you changed "timezone" to "time zone" in some places, I suppose you have a preference for that... so I made the change consistent through the code base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants