Skip to content

Fixes to the notebooks and adding GeoParquet generation notebook #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mbforr
Copy link
Contributor

@mbforr mbforr commented Apr 1, 2025

No description provided.

Copy link

gitnotebooks bot commented Apr 1, 2025

Found 3 changed notebooks. Review the changes at https://app.gitnotebooks.com/wherobots/wherobots-examples/pull/52

@mbforr mbforr requested a review from james-willis April 1, 2025 18:57
@RoboDonut RoboDonut self-requested a review April 22, 2025 20:59
Copy link
Contributor

@RoboDonut RoboDonut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@rbavery
Copy link
Member

rbavery commented Apr 25, 2025

@mbforr looks like this is good to merge if you don't have other local changes.

"---\n",
"\n",
"```python\n",
"mobile = sedona.read.format(\"parquet\")\\\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can probably be written more idiomatically as:

sedona.read.parquet("s3://ookla-open-data/parquet/performance/").where("type = 'mobile'")

"---\n",
"\n",
"\n",
"```python\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do what I suggested in the above comment you will get the fields you want from the path in the df.

" .withColumn(\"year\", regexp_extract(\"file_path\", r\"year=(\\d+)\", 1)) \\\n",
" .withColumn(\"quarter\", regexp_extract(\"file_path\", r\"quarter=(\\d+)\", 1)) \\\n",
" .withColumn(\"geometry\", expr(\"ST_GeomFromText(tile)\")) \\\n",
" .withColumn(\"bbox\", expr(\"struct(st_xmin(ST_GeomFromText(tile)) as xmin, st_ymin(ST_GeomFromText(tile)) as ymin, st_xmax(ST_GeomFromText(tile)) as xmax, st_ymax(ST_GeomFromText(tile)) as ymax) as bbox\")) \\\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont repeatedly call ST_GeomFromText. use the geometry column above.

" .withColumn(\"quarter\", regexp_extract(\"file_path\", r\"quarter=(\\d+)\", 1)) \\\n",
" .withColumn(\"geometry\", expr(\"ST_GeomFromText(tile)\")) \\\n",
" .withColumn(\"bbox\", expr(\"struct(st_xmin(ST_GeomFromText(tile)) as xmin, st_ymin(ST_GeomFromText(tile)) as ymin, st_xmax(ST_GeomFromText(tile)) as xmax, st_ymax(ST_GeomFromText(tile)) as ymax) as bbox\")) \\\n",
" .withColumn(\"geohash\", expr(\"ST_GeoHash(ST_GeomFromText(tile), 10)\")) \\\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. reuse geometry column here

" .withColumn(\"geometry\", expr(\"ST_GeomFromText(tile)\")) \\\n",
" .withColumn(\"bbox\", expr(\"struct(st_xmin(ST_GeomFromText(tile)) as xmin, st_ymin(ST_GeomFromText(tile)) as ymin, st_xmax(ST_GeomFromText(tile)) as xmax, st_ymax(ST_GeomFromText(tile)) as ymax) as bbox\")) \\\n",
" .withColumn(\"geohash\", expr(\"ST_GeoHash(ST_GeomFromText(tile), 10)\")) \\\n",
" .selectExpr(\"*\", ''' \"mobile\" as type''') \\\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wont be needed once you change the loading logic.

}
],
"source": [
"from pyspark.sql.functions import input_file_name, regexp_extract\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the same comments from this cell

"id": "d4022da3-5a43-445e-bc11-ccc1a490452f",
"metadata": {},
"source": [
"```python\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not the default value? seems uneeded

"---\n",
"\n",
"```python\n",
"mobile = mobile.repartition(1)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you doing this when you repartitionByRange in the cell below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this is a way of ensuring one file per partition.

@james-willis
Copy link
Contributor

Matt Powers has a good article about Hive Style Partitioning: https://delta.io/blog/pros-cons-hive-style-partionining/

"```python\n",
" .orderBy(expr(\"ST_GeoHash(ST_GeomFromText(tile), 6)\")) \\\n",
"```\n",
"**What this does:**\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this "what this does" could be omitted since code is self explanatory. repetitive "what this does" could be removed and markdown explanations refactored to more narrative style.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be showing creating/writing PROJJSON by hand in the notebook cell. Better to use a library like proj or geopandas to create the projjson geometry and crs representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants