Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Expose spatial partitioning from SpatialRDD #1751

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Jan 10, 2025

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

Closes #1268.

What changes were proposed in this PR?

This PR exposes spatial partitioning information from the SpatialRDD API. Sedona is exceptionally good at this and the spatial community would love to have access to this information!

There are two pieces of information that would be helpful:

  • The actual boundaries
  • A partitioned RDD that remembers the partition identifier (i.e., partitioned results).

There are a few ideas in this PR...the boundaries seem straightforward but I'm a little new to the RDD API to know what the options are for returning these things.

How was this patch tested?

Working on it!

Did this PR include necessary documentation updates?

  • Yes, I am adding a new API. I am using the current SNAPSHOT version number in vX.Y.Z format.
  • Yes, I have updated the documentation. (Or will when the API is settled)

Copy link
Member Author

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also would benefit from a SpatialPartitioner that removes duplicates (perhaps by wrapping a SpatialPartitioner, consuming the result of placeObject and deterministically choosing one of the results), since most of the time having duplicates when partitioning is not really desired.

Comment on lines +330 to +348
public JavaPairRDD<Integer, T> spatialPartitioningWithIds(GridType gridType, int numPartitions)
throws Exception {
calc_partitioner(gridType, numPartitions);
return spatialPartitioningWithIds(partitioner);
}

public JavaPairRDD<Integer, T> spatialPartitioningWithIds(final SpatialPartitioner partitioner) {
this.partitioner = partitioner;
return this.rawSpatialRDD
.flatMapToPair(
new PairFlatMapFunction<T, Integer, T>() {
@Override
public Iterator<Tuple2<Integer, T>> call(T spatialObject) throws Exception {
return partitioner.placeObject(spatialObject);
}
})
.partitionBy(partitioner);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not needed (the changes in the Adapter to preserve the partitioning of spatialPartitionedRDD into the output data frame should eliminate the need to keep any identifier alongside the partition).

val stringRow = extractUserData(geom)
castRowToSchema(stringRow = stringRow, schema = schema)
})
val rdd = spatialRDD.rawSpatialRDD.rdd.mapPartitions(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(But moved to a different overload since most of the time this will introduce duplicates)

Suggested change
val rdd = spatialRDD.rawSpatialRDD.rdd.mapPartitions(
val rdd = spatialRDD.spatialPartitionedRDD.rdd.mapPartitions(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preserve Spatial Partitioning From RDD to Dataframe
1 participant