-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-2417: Add statistics support to geometry logical type #2971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
wgtmac
merged 101 commits into
apache:master
from
zhangfengcdt:feature-apache-parquet-2417-geospatial
May 18, 2025
Merged
Changes from all commits
Commits
Show all changes
101 commits
Select commit
Hold shift + click to select a range
6a2e051
PARQUET-2471: Add support for geometry logical type
zhangfengcdt d354012
fix types
zhangfengcdt 969b696
Refactor BoundingBox and GeometryTypes
zhangfengcdt 87ee8ea
revert naming changes
zhangfengcdt d67e03b
revert TestDecimalUtils
zhangfengcdt ccf1c4a
revert more
zhangfengcdt 35342c2
refactor statistics
zhangfengcdt bcfefb7
modify EnvelopeCovering
zhangfengcdt cf615f6
add more unit tests
zhangfengcdt c6ae733
update comments
zhangfengcdt 80a629e
add comment for envelope converging expand calculation
zhangfengcdt e0ec9ef
Fix the boundingbox initial values in constructor
zhangfengcdt 7c728b1
Update the poc implementation for the changes to the spec
zhangfengcdt fa36d06
remove print
zhangfengcdt 2a1c62a
implement evelop covering for spherical coordinates
zhangfengcdt d0e7d3d
throw a not-implemented exception for the covering statistics when th…
zhangfengcdt 30d64be
Merge branch 'master' of github.com:apache/parquet-java into feature-…
zhangfengcdt 29a86b5
remove unused comment codes
zhangfengcdt 0c4b8b4
address some review comments
zhangfengcdt a56e9ad
revert changes that are not desired
zhangfengcdt 1ae3d99
refactor toString and remove test scope of jts-core in parquet-hadoop…
zhangfengcdt 198642d
refactor the converings to be map to avoid ordering issues in stats m…
zhangfengcdt 0cf2de9
address review comments
zhangfengcdt 4f0f7ed
fix formating
zhangfengcdt a378a9a
address comments to remove string parsing (to be consistent with spec)
zhangfengcdt 698325a
update according to the changes to the upstream pqrquet-format pr
zhangfengcdt d65ba8e
remove coverings
zhangfengcdt 1b0f5b9
add GeometryStatistics to ColumnMetaData
zhangfengcdt 342e400
more code cleanup for covering
zhangfengcdt 3cc9b68
add toParquetGeometryStatistics
zhangfengcdt dc05cfd
fix check errors
zhangfengcdt 6f1d586
Merge branch 'master' of github.com:apache/parquet-java into feature-…
zhangfengcdt e4e3cae
Merge branch 'master' of github.com:apache/parquet-java into feature-…
zhangfengcdt 69d950f
change and remove the encoding and edges from geometry type (spec cha…
zhangfengcdt d296c6a
fix unit tests
zhangfengcdt 93e28b7
handle the wraparound case for X values
zhangfengcdt e688d05
support GEOGRAPHY type
zhangfengcdt 01ac560
revert import changes
zhangfengcdt f9585cd
address pr review comments
zhangfengcdt b805cf4
fix formatting issue
zhangfengcdt 6aa7028
refactor geography logic type
zhangfengcdt 4536e5f
revert the edge algorithm to use string to avoid loop dependency
zhangfengcdt f188341
add enum EdgeInterpolationAlgorithm for the geography edge algorithm
zhangfengcdt 214fd39
address pr comments: rename, hash, and equal
zhangfengcdt 907acce
remove column index changes
zhangfengcdt ef83a2e
revert unnecessary changes
zhangfengcdt c5d542b
refactor GeospatialStatistics
zhangfengcdt c1fb37b
refactor writers
zhangfengcdt dac9ad0
refactor to use colomn value collector instead of binary statistics
zhangfengcdt e0a319e
revert getLogicalTypeAnnotation
zhangfengcdt 4fd177d
fix TestColumnChunkPageWriteStore
zhangfengcdt 9f0482e
remove DummyBoundingBox
zhangfengcdt ebd45a9
handle noop builder and null values
zhangfengcdt 0114701
refactor normalizeLongitude
zhangfengcdt 4cd8dfd
add ShowGeospatialStatisticsCommand and tests
zhangfengcdt 5ec6546
add shouldNormalizeLongitude
zhangfengcdt 23af52f
add allowWraparound
zhangfengcdt d0081be
Merge remote-tracking branch 'apache/master' into feature-apache-parq…
zhangfengcdt 375de55
add allowWraparound in bbox to ParquetProperties
zhangfengcdt 92f45c2
update to use the latest parquet-format release
zhangfengcdt 1ecb5b6
improve handling null, empty, and nan cases in boundingbox
zhangfengcdt 4c10575
remove wrap around logic
zhangfengcdt 8937907
set initial state of bbox to NaN
zhangfengcdt 928109d
add more unit tests to boundingbox
zhangfengcdt 9d08f92
fix comments
zhangfengcdt f751eb3
make sure no NaN is saved to parquet meta
zhangfengcdt 9916908
upgrade jts version
zhangfengcdt df461cc
address pr review comments
zhangfengcdt f3c4979
address pr review comments
zhangfengcdt 7b98a10
refactor BoundingBox
zhangfengcdt 79c80c8
revamp check valid in the geo statistics
zhangfengcdt f6610e7
update handling of invalid and empty bbox
zhangfengcdt 3fdef44
Merge remote-tracking branch 'apache/master' into feature-apache-parq…
zhangfengcdt 91e9d26
fix unit tests
zhangfengcdt af23c11
add update + merge test cases with NaN inputs
zhangfengcdt 819b3e3
Merge remote-tracking branch 'apache/master' into feature-apache-parq…
zhangfengcdt d8ffe0e
address some pr comments
zhangfengcdt e88268d
add test to TestTypeBuilders
zhangfengcdt 398dd0b
add more tests and fix some comments
zhangfengcdt efe8875
fix fmt
zhangfengcdt 455d15b
fix the test dependency issue
zhangfengcdt 5e6579f
fix dependency issue
zhangfengcdt eecd915
clean up the code and add more unit tests
zhangfengcdt 5c63b4e
Merge remote-tracking branch 'apache/master' into feature-apache-parq…
zhangfengcdt ccd0715
address pr comments
zhangfengcdt ef8e343
address pr comment to only check the first coordinate for types update
zhangfengcdt 81eb57d
change the behavior of merge null or invalid statistics
zhangfengcdt e904e3e
change null or invalid only
zhangfengcdt 882ef38
address review comments - part 1
zhangfengcdt e9d2426
address review comments - part 2
zhangfengcdt 7955496
address review comments
zhangfengcdt 74e94ab
fix tests
zhangfengcdt 9f3d31a
update the test to reflect the changes of "Failed to parse WKB geomet…
zhangfengcdt 136bbf4
fix testInvalidGeometryPresented
zhangfengcdt 945ecdf
fix spotless errors
zhangfengcdt f5dd1d0
address review comments - part 1
zhangfengcdt dbc8798
add validity check before checkking any bbox values
zhangfengcdt 33d3b77
split xy into x and y for valid and empty check
zhangfengcdt 1957dca
fix scala fmt issue
zhangfengcdt aec732e
rename geometry package to geospatial package
zhangfengcdt 476dfff
clean up code
zhangfengcdt File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
110 changes: 110 additions & 0 deletions
110
...et-cli/src/main/java/org/apache/parquet/cli/commands/ShowGeospatialStatisticsCommand.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
| package org.apache.parquet.cli.commands; | ||
|
|
||
| import com.beust.jcommander.Parameter; | ||
| import com.beust.jcommander.Parameters; | ||
| import com.google.common.base.Preconditions; | ||
| import com.google.common.collect.Lists; | ||
| import java.io.IOException; | ||
| import java.util.List; | ||
| import org.apache.commons.text.TextStringBuilder; | ||
| import org.apache.parquet.cli.BaseCommand; | ||
| import org.apache.parquet.column.statistics.geospatial.GeospatialStatistics; | ||
| import org.apache.parquet.hadoop.ParquetFileReader; | ||
| import org.apache.parquet.hadoop.metadata.BlockMetaData; | ||
| import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData; | ||
| import org.apache.parquet.hadoop.metadata.ParquetMetadata; | ||
| import org.apache.parquet.schema.MessageType; | ||
| import org.slf4j.Logger; | ||
|
|
||
| @Parameters(commandDescription = "Print geospatial statistics for a Parquet file") | ||
| public class ShowGeospatialStatisticsCommand extends BaseCommand { | ||
|
|
||
| public ShowGeospatialStatisticsCommand(Logger console) { | ||
| super(console); | ||
| } | ||
|
|
||
| @Parameter(description = "<parquet path>") | ||
| List<String> targets; | ||
|
|
||
| @Override | ||
| @SuppressWarnings("unchecked") | ||
| public int run() throws IOException { | ||
| Preconditions.checkArgument(targets != null && !targets.isEmpty(), "A Parquet file is required."); | ||
| Preconditions.checkArgument(targets.size() == 1, "Cannot process multiple Parquet files."); | ||
|
|
||
| String source = targets.get(0); | ||
| try (ParquetFileReader reader = ParquetFileReader.open(getConf(), qualifiedPath(source))) { | ||
| ParquetMetadata footer = reader.getFooter(); | ||
| MessageType schema = footer.getFileMetaData().getSchema(); | ||
|
|
||
| console.info("\nFile path: {}", source); | ||
|
|
||
| List<BlockMetaData> rowGroups = footer.getBlocks(); | ||
| for (int index = 0, n = rowGroups.size(); index < n; index++) { | ||
| printRowGroupGeospatialStats(console, index, rowGroups.get(index), schema); | ||
| console.info(""); | ||
| } | ||
| } | ||
|
|
||
| return 0; | ||
| } | ||
|
|
||
| private void printRowGroupGeospatialStats(Logger console, int index, BlockMetaData rowGroup, MessageType schema) { | ||
| int maxColumnWidth = Math.max( | ||
| "column".length(), | ||
| rowGroup.getColumns().stream() | ||
| .map(col -> col.getPath().toString().length()) | ||
| .max(Integer::compare) | ||
| .orElse(0)); | ||
|
|
||
| console.info(String.format("\nRow group %d\n%s", index, new TextStringBuilder(80).appendPadding(80, '-'))); | ||
|
|
||
| String formatString = String.format("%%-%ds %%-15s %%-40s", maxColumnWidth); | ||
| console.info(String.format(formatString, "column", "bounding box", "geospatial types")); | ||
|
|
||
| for (ColumnChunkMetaData column : rowGroup.getColumns()) { | ||
| printColumnGeospatialStats(console, column, schema, maxColumnWidth); | ||
| } | ||
| } | ||
|
|
||
| private void printColumnGeospatialStats( | ||
zhangfengcdt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Logger console, ColumnChunkMetaData column, MessageType schema, int columnWidth) { | ||
| GeospatialStatistics stats = column.getGeospatialStatistics(); | ||
|
|
||
| if (stats != null && stats.isValid()) { | ||
| String boundingBox = | ||
| stats.getBoundingBox() != null ? stats.getBoundingBox().toString() : "-"; | ||
zhangfengcdt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| String geospatialTypes = stats.getGeospatialTypes() != null | ||
| ? stats.getGeospatialTypes().toString() | ||
| : "-"; | ||
| String formatString = String.format("%%-%ds %%-15s %%-40s", columnWidth); | ||
| console.info(String.format(formatString, column.getPath(), boundingBox, geospatialTypes)); | ||
| } else { | ||
| String formatString = String.format("%%-%ds %%-15s %%-40s", columnWidth); | ||
| console.info(String.format(formatString, column.getPath(), "-", "-")); | ||
| } | ||
| } | ||
|
|
||
| @Override | ||
| public List<String> getExamples() { | ||
| return Lists.newArrayList("# Show geospatial statistics for a Parquet file", "sample.parquet"); | ||
| } | ||
| } | ||
37 changes: 37 additions & 0 deletions
37
...li/src/test/java/org/apache/parquet/cli/commands/ShowGeospatialStatisticsCommandTest.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
| package org.apache.parquet.cli.commands; | ||
|
|
||
| import java.io.File; | ||
| import java.io.IOException; | ||
| import java.util.Arrays; | ||
| import org.apache.hadoop.conf.Configuration; | ||
| import org.junit.Assert; | ||
| import org.junit.Test; | ||
|
|
||
| public class ShowGeospatialStatisticsCommandTest extends ParquetFileTest { | ||
| @Test | ||
| public void testShowGeospatialStatisticsCommand() throws IOException { | ||
| File file = parquetFile(); | ||
| ShowGeospatialStatisticsCommand command = new ShowGeospatialStatisticsCommand(createLogger()); | ||
| command.targets = Arrays.asList(file.getAbsolutePath()); | ||
| command.setConf(new Configuration()); | ||
| Assert.assertEquals(0, command.run()); | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
zhangfengcdt marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.