Open
Description
Describe the bug
Encountered an issue joining dataframes with duplicate column names if they generated from file read (I tried csv and parquet).
Dataframes produced from python dict do join without problem.
I did my testing with latest version of Datafusion on Windows.
To Reproduce
Fine with dataframes from dict
from datafusion import SessionContext
ctx = SessionContext()
x1 = ctx.from_pydict({'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 'col3': [3, 4, 1, 2, 3]})
x2 = ctx.from_pydict({'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 'col3': [5, 6, 7, 8, 9]})
x1.join(x2, on="id1")
Out[16]:
DataFrame()
+-----+------+------+-----+------+------+
| id1 | col2 | col3 | id1 | col2 | col3 |
+-----+------+------+-----+------+------+
| 1 | 3 | 3 | 1 | 3 | 5 |
| 2 | 4 | 4 | 2 | 4 | 6 |
| 4 | 3 | 1 | 4 | 3 | 7 |
| 5 | 5 | 2 | 5 | 5 | 8 |
| 6 | 2 | 3 | 6 | 2 | 9 |
+-----+------+------+-----+------+------+
Continue to file read
x1.write_csv("df1.csv")
x2.write_csv("df2.csv")
x1_f = ctx.read_csv("df1.csv")
x2_f = ctx.read_csv("df2.csv")
x1_f.join(x2_f, on="id1")
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[21], line 1
----> 1 x1_f.join(x2_f, on="id1")
File ~\prj\datafusion_test\venv\Lib\site-packages\datafusion\dataframe.py:468, in DataFrame.join(self, right, on, how, left_on, right_on, join_keys)
465 if isinstance(right_on, str):
466 right_on = [right_on]
--> 468 return DataFrame(self.df.join(right.df, how, left_on, right_on))
Exception: Schema error: No field named id1. Valid fields are "?table?"."1", "?table?"."3".
Expected behavior
No response
Additional context
No response