Fixes to the notebooks and adding GeoParquet generation notebook #52

mbforr · 2025-04-01T18:57:31Z

No description provided.

gitnotebooks · 2025-04-01T18:57:34Z

Found 3 changed notebooks. Review the changes at https://app.gitnotebooks.com/wherobots/wherobots-examples/pull/52

RoboDonut

LGTM.

rbavery · 2025-04-25T22:04:47Z

~~@mbforr looks like this is good to merge if you don't have other local changes.~~

james-willis · 2025-05-06T04:28:33Z

Reading_and_Writing_Data/Creating_Efficient_GeoParquet_Files.ipynb

+    "---\n",
+    "\n",
+    "```python\n",
+    "mobile = sedona.read.format(\"parquet\")\\\n",


this can probably be written more idiomatically as:

sedona.read.parquet("s3://ookla-open-data/parquet/performance/").where("type = 'mobile'")

james-willis · 2025-05-06T04:30:09Z

Reading_and_Writing_Data/Creating_Efficient_GeoParquet_Files.ipynb

+    "---\n",
+    "\n",
+    "\n",
+    "```python\n",


if you do what I suggested in the above comment you will get the fields you want from the path in the df.

james-willis · 2025-05-06T04:33:32Z

Reading_and_Writing_Data/Creating_Efficient_GeoParquet_Files.ipynb

+    "    .withColumn(\"year\", regexp_extract(\"file_path\", r\"year=(\\d+)\", 1)) \\\n",
+    "    .withColumn(\"quarter\", regexp_extract(\"file_path\", r\"quarter=(\\d+)\", 1)) \\\n",
+    "    .withColumn(\"geometry\", expr(\"ST_GeomFromText(tile)\")) \\\n",
+    "    .withColumn(\"bbox\", expr(\"struct(st_xmin(ST_GeomFromText(tile)) as xmin, st_ymin(ST_GeomFromText(tile)) as ymin, st_xmax(ST_GeomFromText(tile)) as xmax, st_ymax(ST_GeomFromText(tile)) as ymax) as bbox\")) \\\n",


dont repeatedly call ST_GeomFromText. use the geometry column above.

james-willis · 2025-05-06T04:33:47Z

Reading_and_Writing_Data/Creating_Efficient_GeoParquet_Files.ipynb

+    "    .withColumn(\"quarter\", regexp_extract(\"file_path\", r\"quarter=(\\d+)\", 1)) \\\n",
+    "    .withColumn(\"geometry\", expr(\"ST_GeomFromText(tile)\")) \\\n",
+    "    .withColumn(\"bbox\", expr(\"struct(st_xmin(ST_GeomFromText(tile)) as xmin, st_ymin(ST_GeomFromText(tile)) as ymin, st_xmax(ST_GeomFromText(tile)) as xmax, st_ymax(ST_GeomFromText(tile)) as ymax) as bbox\")) \\\n",
+    "    .withColumn(\"geohash\", expr(\"ST_GeoHash(ST_GeomFromText(tile), 10)\")) \\\n",


same. reuse geometry column here

james-willis · 2025-05-06T04:34:11Z

Reading_and_Writing_Data/Creating_Efficient_GeoParquet_Files.ipynb

+    "    .withColumn(\"geometry\", expr(\"ST_GeomFromText(tile)\")) \\\n",
+    "    .withColumn(\"bbox\", expr(\"struct(st_xmin(ST_GeomFromText(tile)) as xmin, st_ymin(ST_GeomFromText(tile)) as ymin, st_xmax(ST_GeomFromText(tile)) as xmax, st_ymax(ST_GeomFromText(tile)) as ymax) as bbox\")) \\\n",
+    "    .withColumn(\"geohash\", expr(\"ST_GeoHash(ST_GeomFromText(tile), 10)\")) \\\n",
+    "    .selectExpr(\"*\", ''' \"mobile\" as type''') \\\n",


this wont be needed once you change the loading logic.

james-willis · 2025-05-06T04:34:54Z

Reading_and_Writing_Data/Creating_Efficient_GeoParquet_Files.ipynb

+    }
+   ],
+   "source": [
+    "from pyspark.sql.functions import input_file_name, regexp_extract\n",


all the same comments from this cell

james-willis · 2025-05-06T04:35:52Z

Reading_and_Writing_Data/Creating_Efficient_GeoParquet_Files.ipynb

+   "id": "d4022da3-5a43-445e-bc11-ccc1a490452f",
+   "metadata": {},
+   "source": [
+    "```python\n",


is this not the default value? seems uneeded

james-willis · 2025-05-06T04:37:33Z

Reading_and_Writing_Data/Creating_Efficient_GeoParquet_Files.ipynb

+    "---\n",
+    "\n",
+    "```python\n",
+    "mobile = mobile.repartition(1)\n",


why are you doing this when you repartitionByRange in the cell below?

maybe this is a way of ensuring one file per partition.

james-willis · 2025-05-06T04:40:51Z

Matt Powers has a good article about Hive Style Partitioning: https://delta.io/blog/pros-cons-hive-style-partionining/

rbavery · 2025-05-06T18:30:53Z

Reading_and_Writing_Data/Creating_Efficient_GeoParquet_Files.ipynb

+    "```python\n",
+    "    .orderBy(expr(\"ST_GeoHash(ST_GeomFromText(tile), 6)\")) \\\n",
+    "```\n",
+    "**What this does:**\n",


this "what this does" could be omitted since code is self explanatory. repetitive "what this does" could be removed and markdown explanations refactored to more narrative style.

rbavery · 2025-05-09T17:41:33Z

Reading_and_Writing_Data/Creating_Efficient_GeoParquet_Files.ipynb

We should not be showing creating/writing PROJJSON by hand in the notebook cell. Better to use a library like proj or geopandas to create the projjson geometry and crs representation.

Fixes to the notebooks and adding GeoParquet generation notebook

26e00a5

mbforr requested a review from james-willis April 1, 2025 18:57

RoboDonut self-requested a review April 22, 2025 20:59

RoboDonut approved these changes Apr 22, 2025

View reviewed changes

james-willis requested changes May 6, 2025

View reviewed changes

rbavery requested changes May 6, 2025

View reviewed changes

rbavery requested changes May 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes to the notebooks and adding GeoParquet generation notebook #52

Fixes to the notebooks and adding GeoParquet generation notebook #52

Uh oh!

mbforr commented Apr 1, 2025

Uh oh!

gitnotebooks bot commented Apr 1, 2025

Uh oh!

RoboDonut left a comment

Uh oh!

rbavery commented Apr 25, 2025 •

edited

Loading

Uh oh!

james-willis May 6, 2025

Uh oh!

james-willis May 6, 2025

Uh oh!

james-willis May 6, 2025

Uh oh!

james-willis May 6, 2025

Uh oh!

james-willis May 6, 2025

Uh oh!

james-willis May 6, 2025

Uh oh!

james-willis May 6, 2025

Uh oh!

james-willis May 6, 2025

Uh oh!

james-willis May 6, 2025

Uh oh!

james-willis commented May 6, 2025

Uh oh!

rbavery May 6, 2025

Uh oh!

rbavery May 9, 2025

Uh oh!

Uh oh!

Fixes to the notebooks and adding GeoParquet generation notebook #52

Are you sure you want to change the base?

Fixes to the notebooks and adding GeoParquet generation notebook #52

Uh oh!

Conversation

mbforr commented Apr 1, 2025

Uh oh!

gitnotebooks bot commented Apr 1, 2025

Uh oh!

RoboDonut left a comment

Choose a reason for hiding this comment

Uh oh!

rbavery commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

james-willis commented May 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rbavery commented Apr 25, 2025 •

edited

Loading