You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug, including details regarding any error messages, version, and platform.
Description
If a file has one line of JSON, and there is no line ending at the end of the file, pyarrow.dataset.dataset will correctly infer a schema, but then fail to load the values from that row.
Reproduction
Bug case:
echo -n '{"field": 1}'> test.json
python3 -c 'import pyarrow.dataset; print(pyarrow.dataset.dataset("test.json", format="json").to_table().to_pandas())'
field
0 NaN
You'll notice it's inferring the schema correctly in all cases (it knows there's one field named field). If I change the data type to string, it also changes the default null value correctly, even though the value shouldn't be null:
Python: 3.12.3
PyArrow: 19.0.0
OS: Linux - Ubuntu 24.04.1
(I'm reporting this against the Python component because that's what easy for me to test locally and where I saw the issue, but I assume that the actual bug is lower in the stack.)
Component(s)
Python
The text was updated successfully, but these errors were encountered:
raulcd
changed the title
A file with one line of JSON without a line ending is incorrectly loaded when used as a Dataset
[Python][C++] A file with one line of JSON without a line ending is incorrectly loaded when used as a Dataset
Feb 6, 2025
Describe the bug, including details regarding any error messages, version, and platform.
Description
If a file has one line of JSON, and there is no line ending at the end of the file,
pyarrow.dataset.dataset
will correctly infer a schema, but then fail to load the values from that row.Reproduction
Bug case:
With a newline it works as I'd expect:
With multiple lines but no final trailing newline it also works as I'd expect:
You'll notice it's inferring the schema correctly in all cases (it knows there's one field named
field
). If I change the data type to string, it also changes the default null value correctly, even though the value shouldn't be null:AFAICT, this only affects datasets.
pyarrow.json.read_json()
works just fine, for example:Debug info
Python: 3.12.3
PyArrow: 19.0.0
OS: Linux - Ubuntu 24.04.1
(I'm reporting this against the Python component because that's what easy for me to test locally and where I saw the issue, but I assume that the actual bug is lower in the stack.)
Component(s)
Python
The text was updated successfully, but these errors were encountered: