Skip to content

pydata egress: crashes on DataFrame columns with non-identifier names #116

@discreteds

Description

@discreteds

Problem

pydata.egress converts DataFrame columns to namedtuple field names during the to_list_of_dataclasses() path. If a column name isn't a valid Python identifier (e.g., foo-bar, class, or contains spaces/special chars), the namedtuple construction fails.

Reproduction

import polars as pl
from mountainash.pydata import PydataEgress

df = pl.DataFrame({"valid_col": [1], "foo-bar": [2], "class": [3]})
# Attempt to convert to dataclasses or named tuples → crash

Context

Discovered while designing mountainash-wearables' data querying layer. When conforming API responses with keep_only_mapped=False, unmapped provider columns (which can have arbitrary names like average_heartrate_bpm, foo-bar, or even reserved words) flow into egress and crash.

Workaround: Use keep_only_mapped=True in TypeSpecs to avoid the issue entirely. Raw data is preserved via a sidecar list, not through unmapped columns.

Suggestion

Egress should either:

  1. Sanitize column names before namedtuple construction (replace invalid chars with _, prefix digits)
  2. Skip/warn on non-identifier columns rather than crashing
  3. Use a dict-based intermediate instead of namedtuple when column names aren't all valid identifiers

This is low priority since keep_only_mapped=True avoids it, but it's a surprising failure mode for users who don't expect column names to matter.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions