Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-699] Fix issue with not closing parquet files. #1749

Merged
merged 3 commits into from
Jan 17, 2025

Conversation

Imbruced
Copy link
Member

@Imbruced Imbruced commented Jan 7, 2025

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

  • Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-XXX. The PR name follows the format [SEDONA-XXX] my subject.

  • No:

    • this is a documentation update. The PR name follows the format [DOCS] my subject
    • this is a CI update. The PR name follows the format [CI] my subject

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

  • Yes, I am adding a new API. I am using the current SNAPSHOT version number in vX.Y.Z format.
  • Yes, I have updated the documentation.
  • No, this PR does not affect any public API so no need to change the documentation.

@Imbruced
Copy link
Member Author

Imbruced commented Jan 7, 2025

For some file size thresholds and a certain number of them (in my case, 15 files with 30 MB in size), the data frame show gets stuck. Closing parquet files seems to fix it; I am unsure if we want to add one huge file and copy it in a test to replicate it.

@Kontinuation
Copy link
Member

Kontinuation commented Jan 8, 2025

Thank you for fixing this. I have just found that the way we are reading Parquet file metadata from footer is not optimal. We should simply use ParquetFileReader.readFooter(configuration, path, SKIP_ROW_GROUPS).getFileMetaData to read the file metadata, then we don't have to bother closing the readers. It also does less work by skip reading row groups, as row groups of huge files can be quite large.

I don't think we need to add heavy weight tests for this by the way, as the fix is very straightforward, and it is not easy to write simple and quick tests for it.

@Imbruced
Copy link
Member Author

Imbruced commented Jan 8, 2025

@Kontinuation I'll adjust to use the readFooter method. Also, I'll create a ticket.

@Imbruced Imbruced force-pushed the fix-issue-with-reading-geoparquet-metadata branch from 0006471 to 289c9e7 Compare January 16, 2025 05:45
@Imbruced Imbruced changed the title Fix issue with not closing parquet files. [SEDONA-699] Fix issue with not closing parquet files. Jan 16, 2025
@Imbruced Imbruced marked this pull request as ready for review January 16, 2025 06:10
@Imbruced Imbruced requested a review from jiayuasu as a code owner January 16, 2025 06:10
@jiayuasu jiayuasu requested a review from Kontinuation January 16, 2025 06:50
@Imbruced
Copy link
Member Author

@jiayuasu do we proceed with that? My only concern is the readFooter function is deprecated for spark3.3.

@jiayuasu jiayuasu added the bug label Jan 17, 2025
@jiayuasu jiayuasu added this to the sedona-1.7.1 milestone Jan 17, 2025
@jiayuasu jiayuasu merged commit 0e0fb9f into master Jan 17, 2025
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants