Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++] A file with one line of JSON without a line ending is incorrectly loaded when used as a Dataset #45394

Open
mikix opened this issue Jan 30, 2025 · 0 comments

Comments

@mikix
Copy link

mikix commented Jan 30, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Description

If a file has one line of JSON, and there is no line ending at the end of the file, pyarrow.dataset.dataset will correctly infer a schema, but then fail to load the values from that row.

Reproduction

Bug case:

echo -n '{"field": 1}' > test.json
python3 -c 'import pyarrow.dataset; print(pyarrow.dataset.dataset("test.json", format="json").to_table().to_pandas())'
   field
0    NaN

With a newline it works as I'd expect:

echo '{"field": 1}' > test.json
python3 -c 'import pyarrow.dataset; print(pyarrow.dataset.dataset("test.json", format="json").to_table().to_pandas())'
   field
0    1

With multiple lines but no final trailing newline it also works as I'd expect:

echo -en '{"field": 1}\n{"field": 2}' > test.json
python3 -c 'import pyarrow.dataset; print(pyarrow.dataset.dataset("test.json", format="json").to_table().to_pandas())'
   field
0    1
1    2

You'll notice it's inferring the schema correctly in all cases (it knows there's one field named field). If I change the data type to string, it also changes the default null value correctly, even though the value shouldn't be null:

echo -n '{"field": "value"}' > test.json
python3 -c 'import pyarrow.dataset; print(pyarrow.dataset.dataset("test.json", format="json").to_table().to_pandas())'
   field
0    None

AFAICT, this only affects datasets. pyarrow.json.read_json() works just fine, for example:

echo -n '{"field": 1}' > test.json
python3 -c 'import pyarrow.json; print(pyarrow.json.read_json("test.json").to_pandas())'
   field
0      1

Debug info

Python: 3.12.3
PyArrow: 19.0.0
OS: Linux - Ubuntu 24.04.1

(I'm reporting this against the Python component because that's what easy for me to test locally and where I saw the issue, but I assume that the actual bug is lower in the stack.)

Component(s)

Python

JOBIN-SABU added a commit to JOBIN-SABU/arrow that referenced this issue Feb 6, 2025
@raulcd raulcd changed the title A file with one line of JSON without a line ending is incorrectly loaded when used as a Dataset [Python][C++] A file with one line of JSON without a line ending is incorrectly loaded when used as a Dataset Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants