Skip to content

Error when joining dataframes with duplicate column names if dataframes generated from file #14147

Open
@fullstart

Description

@fullstart

Describe the bug

Encountered an issue joining dataframes with duplicate column names if they generated from file read (I tried csv and parquet).
Dataframes produced from python dict do join without problem.

I did my testing with latest version of Datafusion on Windows.

To Reproduce

Fine with dataframes from dict

from datafusion import SessionContext
ctx = SessionContext()
x1 = ctx.from_pydict({'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 'col3': [3, 4, 1, 2, 3]})
x2 = ctx.from_pydict({'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 'col3': [5, 6, 7, 8, 9]})
x1.join(x2, on="id1")
Out[16]:
DataFrame()
+-----+------+------+-----+------+------+
| id1 | col2 | col3 | id1 | col2 | col3 |
+-----+------+------+-----+------+------+
| 1   | 3    | 3    | 1   | 3    | 5    |
| 2   | 4    | 4    | 2   | 4    | 6    |
| 4   | 3    | 1    | 4   | 3    | 7    |
| 5   | 5    | 2    | 5   | 5    | 8    |
| 6   | 2    | 3    | 6   | 2    | 9    |
+-----+------+------+-----+------+------+

Continue to file read

x1.write_csv("df1.csv")
x2.write_csv("df2.csv")

x1_f = ctx.read_csv("df1.csv")
x2_f = ctx.read_csv("df2.csv")

x1_f.join(x2_f, on="id1")
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 x1_f.join(x2_f, on="id1")

File ~\prj\datafusion_test\venv\Lib\site-packages\datafusion\dataframe.py:468, in DataFrame.join(self, right, on, how, left_on, right_on, join_keys)
    465 if isinstance(right_on, str):
    466     right_on = [right_on]
--> 468 return DataFrame(self.df.join(right.df, how, left_on, right_on))

Exception: Schema error: No field named id1. Valid fields are "?table?"."1", "?table?"."3".

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions