[SPARK-56454][DOCS][FOLLOWUP] Document supported SRIDs in geospatial types#55207
[SPARK-56454][DOCS][FOLLOWUP] Document supported SRIDs in geospatial types#55207pratham76 wants to merge 1 commit intoapache:masterfrom
Conversation
|
@szehon-ho @cloud-fan @uros-db Could you have a look at the doc additions for #54780? Thanks! |
|
@szehon-ho gentle ping! |
szehon-ho
left a comment
There was a problem hiding this comment.
Thanks for adding SRID documentation, @pratham76 — the "Commonly Used SRIDs" table is a useful addition. I have some suggestions to improve accuracy and reduce redundancy with existing content on the page.
Also, minor note: the PR description references #54780, but that PR was closed (not merged). You may want to update the description to reference the correct merged work.
| | 2154 | RGF93 / Lambert-93 | French national coordinate system | France-specific mapping and GIS | | ||
| | 32633 | WGS 84 / UTM zone 33N | Universal Transverse Mercator, zone 33 North | Central Europe (6°E to 12°E) | | ||
| | 32634 | WGS 84 / UTM zone 34N | Universal Transverse Mercator, zone 34 North | Eastern Europe (12°E to 18°E) | | ||
| | 32635 | WGS 84 / UTM zone 35N | Universal Transverse Mercator, zone 35 North | Eastern Europe/Western Asia (18°E to 24°E) | |
There was a problem hiding this comment.
Consider adding a CRS Identifier column. Spark maps SRIDs to CRS strings internally, and these strings are visible to users in df.schema.json() output and in Parquet/Delta/Iceberg storage metadata. For example, GEOMETRY(4326) stores as geometry(OGC:CRS84) in JSON schema — not EPSG:4326. This is a common source of confusion.
The key mappings are:
| SRID | CRS Identifier |
|---|---|
| 0 | SRID:0 |
| 3857 | EPSG:3857 |
| 4326 | OGC:CRS84 |
| 4267 | OGC:CRS27 |
| 4269 | OGC:CRS83 |
Also worth noting which SRIDs are valid for GEOGRAPHY vs GEOMETRY. For instance, GEOMETRY(3857) works but GEOGRAPHY(3857) will error because 3857 is a projected (non-geographic) CRS. That's a real pitfall for users.
There was a problem hiding this comment.
Noted! Have added CRS identifier column to the table, which contains the corresponding mappings as stated above. I have also introduced a column type which indicated if the SRIDs are valid for GEOGRAPHY or GEOMETRY, or both.
Along with these i have added some notes based on the above comments. Do inform if this helps. Thanks!
| | 32634 | WGS 84 / UTM zone 34N | Universal Transverse Mercator, zone 34 North | Eastern Europe (12°E to 18°E) | | ||
| | 32635 | WGS 84 / UTM zone 35N | Universal Transverse Mercator, zone 35 North | Eastern Europe/Western Asia (18°E to 24°E) | | ||
|
|
||
| The registry includes many additional SRIDs for various UTM zones, national coordinate systems, and other projections. For a complete list, refer to the [EPSG Geodetic Parameter Dataset](https://epsg.org/). |
There was a problem hiding this comment.
The registry also includes ESRI entries (e.g., ESRI:102100), not just EPSG. And it's pinned to PROJ 9.7.1 — not synced live with EPSG. The link to epsg.org could be misleading since users may find SRIDs there that aren't in Spark's registry, or miss ESRI SRIDs that are. Consider referencing the actual registry CSV or at least mentioning the PROJ version and ESRI inclusion.
There was a problem hiding this comment.
Thanks for pointing this out, have updated the note.
|
|
||
| #### Using Different SRIDs | ||
|
|
||
| **Creating tables with specific SRIDs:** |
There was a problem hiding this comment.
Most of the examples in sections "Using Different SRIDs", "Converting between SRIDs", and "SRID Validation" repeat what the page already covers in "Creating Tables" (lines 62–79) and "Built-in Geospatial Functions" (lines 129–137). Consider replacing them with examples that show genuinely new behavior:
- SRID validation error: The 99999 case is useful — keep it.
- GEOGRAPHY vs GEOMETRY pitfall: Show that
GEOGRAPHY(3857)errors because 3857 is non-geographic — this is a real user trap not documented elsewhere. - OGC CRS strings in metadata: Show that
df.schema.json()forGEOMETRY(4326)containsOGC:CRS84, so users know what to expect in Parquet/storage metadata.
There was a problem hiding this comment.
Thank you for the comments! have updated the examples.
| ); | ||
| ``` | ||
|
|
||
| **Converting between SRIDs:** |
There was a problem hiding this comment.
The heading "Converting between SRIDs" implies coordinate reprojection, but ST_SetSrid only changes metadata. Suggest renaming to something like "Setting or Changing SRID Metadata".
Also, the example changes a point from SRID 4326 (lat/lon in degrees) to 3857 (Web Mercator in meters) — this produces a semantically incorrect result since the coordinates are still degree values but now labeled as meters. A better example would set SRID on data that was created without one, e.g. SRID 0 → 4326, which is the common real-world use case. The existing doc already shows an ST_SetSrid example (line 136) that does this correctly.
There was a problem hiding this comment.
Thank you, noted, have removed the repeated section.
|
|
||
| ```sql | ||
| -- Valid: 4326 is in the registry | ||
| SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 4326); |
There was a problem hiding this comment.
The 4326 and 3857 examples here repeat what's already shown in the "Built-in Geospatial Functions" section above. Consider trimming to just the 99999 error case — that's the genuinely new and useful example. You could also add a GEOGRAPHY(3857) failure example here, since that's a real pitfall not documented elsewhere.
There was a problem hiding this comment.
Thanks! incorporated
|
|
||
| #### SRID 0 (Unspecified) | ||
|
|
||
| SRID 0 represents an unspecified or unknown coordinate system. It is allowed for GEOMETRY types but should be used with caution: |
There was a problem hiding this comment.
A few issues here:
-
"should be used with caution" is overstated — SRID 0 is the default for
ST_GeomFromWKB(wkb)and is actively used inCREATE TABLE(e.g.,CREATE TABLE t (geom GEOMETRY(0)) USING PARQUETin the test suite). It's a standard convention (PostGIS uses the same). -
Missing GEOGRAPHY restriction — SRID 0 is not valid for GEOGRAPHY types (it's registered as non-geographic, so
GeographicSpatialReferenceSystemMapperrejects it). This is important to document. -
Could be confused with
GEOMETRY(ANY)— Worth clarifying thatGEOMETRY(0)means a fixed SRID of 0 (Cartesian, no defined CRS), not "per-row SRID." Per-row SRIDs useGEOMETRY(ANY).
There was a problem hiding this comment.
Thanks! incorporated
Thank you @szehon-ho for the review comments. I had referenced #54780 as this seems to be the PR referenced in the JIRA. It also seems to be the one that was merged. Please do let know if i missed anything, and also if the changes are okay. Thanks! |
6b2bdf7 to
41989b4
Compare
There was a problem hiding this comment.
The Apache Spark community recommends to file a proper JIRA issue for trace-ability, @pratham76 . Please create a JIRA ID and use it in the PR title.
Thanks @dongjoon-hyun for notifying, have updated the PR title to point to corresponding JIRA issue. |
|
@szehon-ho Could you have a look at the changes? I've addressed all the above review comments. Please do let know if any other improvements needed. Thanks! |
|
@szehon-ho gentle ping! |
|
|
||
| ### Supported SRIDs | ||
|
|
||
| Spark includes a pre-built registry of standard Spatial Reference Identifiers (SRIDs) from the PROJ database, with overrides to support OGC standards. This registry enables validation and proper handling of coordinate systems for geospatial data. |
There was a problem hiding this comment.
As PROJ can be queried by user, we should just list the OGC overrides
|
|
||
| #### Commonly Used SRIDs | ||
|
|
||
| | SRID | CRS Identifier | Name | Type | Description | Typical Use Case | |
There was a problem hiding this comment.
The "Type" column says GEOMETRY only or GEOGRAPHY or GEOMETRY per SRID. This frames the restriction backwards and requires readers to memorize per-SRID compatibility. The actual rule is simple:
- GEOMETRY accepts all SRIDs in the registry (geographic + projected + SRID 0)
- GEOGRAPHY only accepts geographic SRIDs (lat/lon coordinate systems)
I'd suggest:
- State this rule clearly before the table (once), rather than repeating it per row.
- Replace the "Type" column with "CRS Type" showing
GeographicorProjected— which is the intrinsic property of the CRS. The GEOMETRY/GEOGRAPHY compatibility follows naturally: ifGeographic, it works with both GEOGRAPHY and GEOMETRY; ifProjected, GEOMETRY only.
Also consider dropping the "Typical Use Case" column — between "Name" and "Description" it's already covered, and 6 columns is very wide in Markdown.
| CREATE TABLE locations (id BIGINT, point GEOMETRY(4326)); | ||
|
|
||
| -- The schema will show OGC:CRS84, not EPSG:4326 | ||
| SELECT schema_of_json('{"point": ...}'); |
There was a problem hiding this comment.
schema_of_json is a function for inferring a schema from a JSON string value — it doesn't inspect a table's schema. This example won't produce the claimed output.
A correct way to show this would be:
CREATE TABLE locations (id BIGINT, point GEOMETRY(4326)) USING PARQUET;
DESCRIBE locations;
-- point column shows: geometry(4326)
-- In Scala/Python the JSON schema shows the CRS string:
-- spark.table("locations").schema.json() contains "geometry(OGC:CRS84)"Or simply remove this sub-section and fold the key point ("GEOMETRY(4326) stores as geometry(OGC:CRS84) in schema JSON and Parquet metadata") into the existing "CRS Identifier Mapping" bullet in the Important Notes — that bullet already says this.
| ```sql | ||
| -- Error: 99999 is not a valid SRID in the registry | ||
| SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 99999); | ||
| -- Throws error: Invalid SRID |
There was a problem hiding this comment.
The error examples here show invented text. The actual error is:
[ST_INVALID_SRID_VALUE] Invalid or unsupported SRID (spatial reference identifier) value: <srid>.
Either show the real error class/message, or just say -- Throws ST_INVALID_SRID_VALUE without inventing the phrasing. Users who search for the error text in the doc should be able to find it. Same applies to the GEOGRAPHY(3857) example below.
| | 4267 | `OGC:CRS27` | NAD27 | GEOGRAPHY or GEOMETRY | North American Datum 1927 | Legacy North American data | | ||
| | 4269 | `OGC:CRS83` | NAD83 | GEOGRAPHY or GEOMETRY | North American Datum 1983 | North American mapping | | ||
| | 3857 | `EPSG:3857` | Web Mercator | GEOMETRY only | Pseudo-Mercator projection | Web maps (Google Maps, OpenStreetMap, Bing Maps) | | ||
| | 2154 | `EPSG:2154` | RGF93 / Lambert-93 | GEOMETRY only | French national coordinate system | France-specific mapping and GIS | |
There was a problem hiding this comment.
The table mixes truly common SRIDs (0, 4326, 3857) with region-specific ones (2154, 32633–32635) that most users won't encounter. Consider trimming to the 4–5 most universal entries (0, 4326, 4267, 4269, 3857) and noting that the full registry includes thousands more. The existing bullet at the bottom already says this, so these extra rows add length without proportional value.
| -- Output includes: geometry(OGC:CRS84) | ||
| ``` | ||
|
|
||
| This CRS identifier is also stored in Parquet, Delta, and Iceberg metadata, so downstream tools see `OGC:CRS84` rather than `EPSG:4326`. |
There was a problem hiding this comment.
it depends on the data source, we should be just mention spark's native parquet data source
| | 32635 | `EPSG:32635` | WGS 84 / UTM zone 35N | GEOMETRY only | Universal Transverse Mercator, zone 35 North | Eastern Europe/Western Asia (18°E to 24°E) | | ||
|
|
||
| **Important Notes:** | ||
| * **GEOGRAPHY vs GEOMETRY**: Only geographic (latitude/longitude) SRIDs can be used with GEOGRAPHY types. Projected coordinate systems like Web Mercator (3857) or UTM zones, as well as SRID 0, work only with GEOMETRY. |
There was a problem hiding this comment.
If the GEOMETRY/GEOGRAPHY SRID rule is stated once before the table (per the earlier inline comment on the table header), this bullet restates the same thing. Consider removing it.
| **Important Notes:** | ||
| * **GEOGRAPHY vs GEOMETRY**: Only geographic (latitude/longitude) SRIDs can be used with GEOGRAPHY types. Projected coordinate systems like Web Mercator (3857) or UTM zones, as well as SRID 0, work only with GEOMETRY. | ||
| * **SRID 0**: Represents Cartesian coordinates with no defined CRS. `GEOMETRY(0)` means a fixed SRID of 0 (all rows use SRID 0), not per-row SRIDs. For per-row SRIDs, use `GEOMETRY(ANY)`. | ||
| * **CRS Identifier Mapping**: When you create `GEOMETRY(4326)`, it stores as `geometry(OGC:CRS84)` in JSON schema, not `EPSG:4326`. This is an OGC standard override. |
There was a problem hiding this comment.
This is technically accurate, but the CRS string (OGC:CRS84 vs EPSG:4326) only surfaces if you call df.schema.json() programmatically or inspect Parquet/Iceberg file metadata directly. In all user-facing contexts (DESCRIBE, printSchema(), dtypes, error messages), users see the integer SRID (geometry(4326)). This is a niche detail that may cause more confusion than it resolves — consider dropping it, or shortening to a parenthetical like: "Note: the programmatic schema.json() API and storage-level metadata use CRS strings (e.g. OGC:CRS84) rather than integer SRIDs."
| * **SRID 0**: Represents Cartesian coordinates with no defined CRS. `GEOMETRY(0)` means a fixed SRID of 0 (all rows use SRID 0), not per-row SRIDs. For per-row SRIDs, use `GEOMETRY(ANY)`. | ||
| * **CRS Identifier Mapping**: When you create `GEOMETRY(4326)`, it stores as `geometry(OGC:CRS84)` in JSON schema, not `EPSG:4326`. This is an OGC standard override. | ||
| * **Registry Source**: The SRID registry is based on PROJ 9.7.1 and includes both EPSG and ESRI coordinate systems (e.g., `ESRI:102100`). The registry is pinned to this PROJ version and not synced live with external databases. | ||
| * The registry includes many additional SRIDs for various UTM zones, national coordinate systems, and other projections from both EPSG and ESRI sources. |
There was a problem hiding this comment.
This restates what the "Registry Source" bullet directly above already says. Consider removing.
| * **Registry Source**: The SRID registry is based on PROJ 9.7.1 and includes both EPSG and ESRI coordinate systems (e.g., `ESRI:102100`). The registry is pinned to this PROJ version and not synced live with external databases. | ||
| * The registry includes many additional SRIDs for various UTM zones, national coordinate systems, and other projections from both EPSG and ESRI sources. | ||
|
|
||
| #### CRS Identifiers in Metadata |
There was a problem hiding this comment.
The schema_of_json example is incorrect (see earlier inline comment), and the only useful content here — that GEOMETRY(4326) uses OGC:CRS84 in metadata — is already covered by the "CRS Identifier Mapping" bullet above. Consider removing this entire sub-section.
| -- Throws error: SRID 3857 is not valid for GEOGRAPHY (projected coordinate system) | ||
| ``` | ||
|
|
||
| This is a common pitfall: Web Mercator (3857) and UTM zones are projected (planar) coordinate systems and can only be used with GEOMETRY, not GEOGRAPHY. Only geographic (latitude/longitude) SRIDs like 4326, 4267, or 4269 work with GEOGRAPHY. |
There was a problem hiding this comment.
This repeats the GEOGRAPHY/GEOMETRY rule for the third time (pre-table intro, Important Notes bullet, and now here). The code example above already demonstrates the pitfall. Consider removing this paragraph.
|
To tie together the inline comments: the section has good content but a lot of redundancy. Here's a suggestion for the overall shape after trimming (~25 lines instead of ~60): |
|
Thanks @szehon-ho for detailed comments. Have incorporated all the changes that you have mentioned above. Please do have a look if any more changes are required. |
|
Gentle ping @szehon-ho — I’ve addressed all the comments from your previous review. Could you take another look when you get a chance? |
| persisted — the [Parquet](https://github.com/apache/parquet-format/blob/master/Geospatial.md) | ||
| and [Iceberg](https://github.com/apache/iceberg/blob/master/format/spec.md) geospatial | ||
| specifications require a fixed SRID per column. | ||
| * The registry is based on PROJ 9.7.1 and includes both EPSG and ESRI coordinate systems. |
There was a problem hiding this comment.
let's be more specific here. Say , spark 4.2 => Proj 9.7.1, we can expand the table later.
Let's also include the OGC overrides exactly (and in separate table)
There was a problem hiding this comment.
Do we want to have a pinned version as 4.2.0 here? Also, when we mean seperate table for OGC overrides, how do we want to keep the content?
There was a problem hiding this comment.
yea, something like 'since version' table maybe? example from other docs: https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html
Yes, separate table, I am thinking:
- Proj version table
- OGC overrides table
- Commonly used SRID's
My initial hunch was there's no huge value in 'commonly used srids', as it can be probably be found on the web, but I can go either way.
The first two are more important imo, as it exactly describes the algorithm to select supported srids.
There was a problem hiding this comment.
Agree with your thoughts, have updated the doc to have table mapping spark releases to pinned Proj version. Also have added an OGC override table, but now the table with commonly used SRIDs seems a bit repetitive.Do let know your thoughts. Thanks!
| | 3857 | `EPSG:3857` | Web Mercator | Projected | Pseudo-Mercator projection used by web mapping services | | ||
|
|
||
| **Notes:** | ||
| * `GEOMETRY(0)` means a fixed SRID of 0 (all rows use SRID 0), not per-row SRIDs. |
There was a problem hiding this comment.
nit: a bit repetitive, can remove second part (as covered by the second bullet)
| * `GEOMETRY(0)` means a fixed SRID of 0 (all rows use SRID 0), not per-row SRIDs. | ||
| For per-row SRIDs, use `GEOMETRY(ANY)`. | ||
| * `GEOMETRY(ANY)` and `GEOGRAPHY(ANY)` are valid for in-memory and query use, but cannot be | ||
| persisted — the [Parquet](https://github.com/apache/parquet-format/blob/master/Geospatial.md) |
There was a problem hiding this comment.
Let's reprhase (not saying it cannot be persisted in general), there may be other formats later that support it. In any case its orthogonal to spark (as compute)
Persistence Notes - Iceberg (...) and Parquet (...) cannot persist GEOMETRY(ANY) and GEOGRAPHY(ANY)
|
@uros-db can you also take a look? |
|
@szehon-ho @uros-db Have addressed all the above comments, PTAL |
There was a problem hiding this comment.
Doc accurately reflects the registry implementation — verified PROJ 9.7.1, the OGC overrides (4326/4267/4269), SRID 0 handling, default SRIDs for ST_GeomFromWKB/ST_GeogFromWKB, and the ST_INVALID_SRID_VALUE error class against the source. One small URL nit below.
| **Notes:** | ||
| * `GEOMETRY(0)` means a fixed SRID of 0. For per-row SRIDs, use `GEOMETRY(ANY)`. | ||
| * [Parquet](https://github.com/apache/parquet-format/blob/master/Geospatial.md) | ||
| and [Iceberg](https://github.com/apache/iceberg/blob/master/format/spec.md) geospatial |
There was a problem hiding this comment.
apache/iceberg's default branch is main; the /master URL works via GitHub redirect but isn't canonical.
| and [Iceberg](https://github.com/apache/iceberg/blob/master/format/spec.md) geospatial | |
| and [Iceberg](https://github.com/apache/iceberg/blob/main/format/spec.md) geospatial |
There was a problem hiding this comment.
Thanks @cloud-fan. Have fixed this. Could you please let know if this could be checked in, if no more changed?
| | 3857 | `EPSG:3857` | Web Mercator | Projected | Pseudo-Mercator projection used by web mapping services | | ||
|
|
||
| **Notes:** | ||
| * `GEOMETRY(0)` means a fixed SRID of 0. For per-row SRIDs, use `GEOMETRY(ANY)`. |
There was a problem hiding this comment.
| * `GEOMETRY(0)` means a fixed SRID of 0. For per-row SRIDs, use `GEOMETRY(ANY)`. | |
| * `GEOMETRY(0)` means a fixed SRID of 0. For mixed per-row SRIDs, use `GEOMETRY(ANY)`. |
There was a problem hiding this comment.
Thanks @uros-db, have accommodated this change.
uros-db
left a comment
There was a problem hiding this comment.
LGTM, thanks @pratham76! Let's just make sure to include this in 4.2, cc @cloud-fan.
Also cc @szehon-ho PTAL and make sure to update this section after PROJ upgrade.
|
thanks, merging to master/4.x/4.2! |
…types ### What changes were proposed in this pull request? #54780 added support for a pre-built SRID registry with standard spatial reference systems, but seems like corresponding documentation is missed out, adding through this PR. ### Why are the changes needed? Users would know supported SRIDs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only Doc Changes. ### Was this patch authored or co-authored using generative AI tooling? No Closes #55207 from pratham76/srid-doc. Authored-by: Pratham Manja <prathammanja76@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3bb6a67) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…types ### What changes were proposed in this pull request? #54780 added support for a pre-built SRID registry with standard spatial reference systems, but seems like corresponding documentation is missed out, adding through this PR. ### Why are the changes needed? Users would know supported SRIDs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only Doc Changes. ### Was this patch authored or co-authored using generative AI tooling? No Closes #55207 from pratham76/srid-doc. Authored-by: Pratham Manja <prathammanja76@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3bb6a67) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
#54780 added support for a pre-built SRID registry with standard spatial reference systems, but seems like corresponding documentation is missed out, adding through this PR.
Why are the changes needed?
Users would know supported SRIDs
Does this PR introduce any user-facing change?
No
How was this patch tested?
Only Doc Changes.
Was this patch authored or co-authored using generative AI tooling?
No