Skip to content

ENH: read_csv with usecols shouldn't change column order #61386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
amarvin opened this issue May 1, 2025 · 5 comments
Open
1 of 3 tasks

ENH: read_csv with usecols shouldn't change column order #61386

amarvin opened this issue May 1, 2025 · 5 comments
Assignees
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@amarvin
Copy link

amarvin commented May 1, 2025

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The documentation for pandas.read_csv(usecols=[...]) says that it treats the iterable list of columns like an unordered set (updated in #18673 and #53763), so the returned dataframe won't necessarily have the same column order. This is different behaviour from other pandas data reading methods (e.g., pandas.read_parquet(columns=[...])). I think the order should be preserved. If usecols is converted to a set, I think it should instead be converted to OrderedSet or keys of collections.OrderedDict (or just dict in Python >3.6).

Feature Description

import pandas as pd

# Example CSV file (replace with your actual file)
csv_data = """
col1,col2,col3,col4
A,1,X,10
B,2,Y,20
C,3,Z,30
"""

with open("example.csv", "w") as f:
    f.write(csv_data)

# Desired column order
desired_order = ['col3', 'col1', 'col4']

# Read CSV with usecols (selects columns but doesn't order)
df = pd.read_csv("example.csv", usecols=desired_order)

print(df)  # incorrect column order

# Reindex DataFrame to enforce desired order (a popular workaround that I think shouldn't be required)
# One solution is to include this line in `read_csv`, when using `usecols` kwarg
df = df[desired_order]

print(df)  # correct column order

Alternative Solutions

Instead of converting usecols to set, convert it to dict.keys() which preserved order in Python >3.6

Additional Context

No response

@amarvin amarvin added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 1, 2025
@amarvin
Copy link
Author

amarvin commented May 1, 2025

This is also an issue for pandas.read_excel(usecols=[...]).

@amarvin
Copy link
Author

amarvin commented May 1, 2025

Others are confused by the current feature too and have to do a workaround: https://stackoverflow.com/a/40024462/6068036

@eicchen
Copy link
Contributor

eicchen commented May 3, 2025

take

@AnkitPrasad364
Copy link

Replace

if usecols:
usecols = set(usecols)

With

if usecols:
usecols = dict.fromkeys(usecols) # preserves order

@eicchen
Copy link
Contributor

eicchen commented May 6, 2025

@mroeschke could I get your opinion on this before I dig deeper into it? You were the last person to work with the function (_validate_usecols_arg) and I'm mainly worried about backwards compatibility rather than feasibility. But considering that pandas is having a major version update, it could be justifiable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

3 participants