Skip to content

BUG: pd.read_parquet raises exception filtering on Period type columns #62769

@spillz

Description

@spillz

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

#!/usr/bin/env python3
# Repro: pandas.read_parquet(filters=...) does not accept pandas Period values,
# and there is no documented way to pass the correct physical scalar via pandas API.
#
# Expected: Either accept Period in filters (map to physical storage), or document
# an official helper to build Arrow-coercible filters from pandas logical types.

import os, sys, json, tempfile
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

print("Versions:")
print("  python  :", sys.version.split()[0])
print("  pandas  :", pd.__version__)
print("  pyarrow :", pa.__version__)
print()

# --- build a tiny dataset: ints, Period[M], datetimes ---
months = pd.period_range("2024-01", "2024-12", freq="M")
df = pd.DataFrame({
    "n": np.arange(1, 13, dtype="int64"),
    "per_m": months,                                 # pandas Period[M]
    "ts_m": months.to_timestamp(how="start")         # datetime64[ns], first of month
})

# write to a temp dir/file
tmpdir = tempfile.mkdtemp(prefix="pandas_period_filter_repro_")
path = os.path.join(tmpdir, "mini.parquet")
df.to_parquet(path, engine="pyarrow", index=False)

print("Wrote:", path)
print()

# --- inspect physical schema + pandas metadata ---
dset = ds.dataset(path, format="parquet")
schema = dset.schema
print("Arrow physical schema:", schema)
pmeta = json.loads((schema.metadata or {}).get(b"pandas", b"{}").decode() or "{}")
cur_meta = {c["name"]: c for c in pmeta.get("columns", [])}
print("Pandas metadata for columns:")
for k, v in cur_meta.items():
    if "metadata" in v and isinstance(v["metadata"], dict):
        v = dict(v, metadata={"keys": list(v["metadata"].keys())})
    print(" ", k, "→ pandas_type:", v.get("pandas_type"), "metadata:", v.get("metadata"))
print()

# Helper for pretty printing results
def show(label, pdf):
    print(label)
    print(pdf.sort_values(["n"]).reset_index(drop=True))
    print()

# --- 1) numeric filter: works ---
pdf_num = pd.read_parquet(
    path,
    engine="pyarrow",
    columns=["n", "per_m", "ts_m"],
    filters=[("n", ">=", 7)]
)
show("Numeric filter n>=7 (EXPECTED TO WORK):", pdf_num)

# --- 2) timestamp range filter: works ---
start = pd.Timestamp("2024-07-01")
pdf_ts = pd.read_parquet(
    path,
    engine="pyarrow",
    columns=["n", "per_m", "ts_m"],
    filters=[("ts_m", ">=", start)]
)
show("Timestamp range filter July 2024 (EXPECTED TO WORK):", pdf_ts)

# --- 3a) period equality filter: fails (cannot convert Period) ---
m = pd.Period("2024-07", freq="M")
try:
    pdf_per = pd.read_parquet(
        path,
        engine="pyarrow",
        columns=["n", "per_m", "ts_m"],
        filters=[("per_m", ">=", m)]
    )
    show("Period equality filter per_m >= 2024-07 (UNEXPECTED, but if this prints it worked):", pdf_per)
except Exception as e:
    print("Period equality filter per_m>=Period('2024-07','M') raised (EXPECTED BUG/UNSUPPORTED):")
    print(" ", type(e).__name__ + ":", e)
    print()

# --- 3b) period ordinal equality filter: also does not match via pandas filters ---
# Even though the physical column is integer (ordinals), passing m.ordinal here still
# goes through pandas' filter adapter, which treats values as Python scalars and does
# not apply the pandas metadata mapping used on read.
try:
    pdf_ord = pd.read_parquet(
        path,
        engine="pyarrow",
        columns=["n", "per_m", "ts_m"],
        filters=[("per_m", ">=", m.ordinal)]  # try integer ordinal directly
    )
    show("Period ordinal filter per_m>=m.ordinal (CURRENT BEHAVIOR):", pdf_ord)
    if pdf_ord.empty:
        print("  Note: No rows returned. Ordinal inequality through pandas filters did not match.\n")
except Exception as e:
    print("Period ordinal filter raised:")
    print(" ", type(e).__name__ + ":", e)
    print()

Issue Description

I created a small Parquet file with 3 columns:

  • n – integers 1 .. 12

  • per_m – period[M] values (2024-01 .. 2024-12)

  • ts_m – timestamps (2024-01-01 .. 2024-12-01)

Then used pandas.read_parquet(..., filters=[...]) to test inequality filters (>=) on each column.

Expected Behavior

All three columns should be filterable. For a Period column, either:

  • pandas should accept Period scalars in filters=... and coerce them to the correct Arrow scalar, or

  • there should be a documented way to build a filter that matches the underlying Arrow storage.

Observed behavior

Numeric (n >= 7) ✅ works

Timestamp (ts_m >= '2024-07-01') ✅ works

Period (per_m >= Period('2024-07','M')) ❌ fails

The error message on the Period column with a date comparison is:

ArrowInvalid: Could not convert Period('2024-07', 'M') with type Period: did not recognize Python value type when inferring an Arrow data type

And on the Period.ordinal (which matches the internal arrow representation) I get:

ArrowNotImplementedError: Function 'greater_equal' has no kernel matching input types (extension<pandas.period<ArrowPeriodType>>, int16)

Installed Versions

INSTALLED VERSIONS

commit : 9c8bc3e
python : 3.13.2
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.26100
machine : AMD64
processor : AMD64 Family 26 Model 36 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 2.3.3
numpy : 2.2.4
pytz : 2025.2
dateutil : 2.9.0.post0
pip : 24.3.1
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.4
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2025.9.0
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.6
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 21.0.0
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.15.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2025.2
qtpy : None
pyqt5 : None

Full Program Output:

Versions:
python : 3.13.2
pandas : 2.3.3
pyarrow : 21.0.0

Wrote: C:\Users\damie\AppData\Local\Temp\pandas_period_filter_repro_min05yqt\mini.parquet

Arrow physical schema: n: int64
per_m: extension<pandas.period>
ts_m: timestamp[ns]
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 403
Pandas metadata for columns:
n → pandas_type: int64 metadata: None
per_m → pandas_type: object metadata: None
ts_m → pandas_type: datetime metadata: None

Numeric filter n>=7 (EXPECTED TO WORK):
n per_m ts_m
0 7 2024-07 2024-07-01
1 8 2024-08 2024-08-01
2 9 2024-09 2024-09-01
3 10 2024-10 2024-10-01
4 11 2024-11 2024-11-01
5 12 2024-12 2024-12-01

Timestamp range filter July 2024 (EXPECTED TO WORK):
n per_m ts_m
0 7 2024-07 2024-07-01
1 8 2024-08 2024-08-01
2 9 2024-09 2024-09-01
3 10 2024-10 2024-10-01
4 11 2024-11 2024-11-01
5 12 2024-12 2024-12-01

Period equality filter per_m>=Period('2024-07','M') raised (EXPECTED BUG/UNSUPPORTED):
ArrowInvalid: Could not convert Period('2024-07', 'M') with type Period: did not recognize Python value type when inferring an Arrow data type

Period ordinal filter raised:
ArrowNotImplementedError: Function 'greater_equal' has no kernel matching input types (extension<pandas.period>, int16)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityBugIO Parquetparquet, feather

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions