-
Notifications
You must be signed in to change notification settings - Fork 2
Fixes to the notebooks and adding GeoParquet generation notebook #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Found 3 changed notebooks. Review the changes at https://app.gitnotebooks.com/wherobots/wherobots-examples/pull/52 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
"---\n", | ||
"\n", | ||
"```python\n", | ||
"mobile = sedona.read.format(\"parquet\")\\\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can probably be written more idiomatically as:
sedona.read.parquet("s3://ookla-open-data/parquet/performance/").where("type = 'mobile'")
"---\n", | ||
"\n", | ||
"\n", | ||
"```python\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you do what I suggested in the above comment you will get the fields you want from the path in the df.
" .withColumn(\"year\", regexp_extract(\"file_path\", r\"year=(\\d+)\", 1)) \\\n", | ||
" .withColumn(\"quarter\", regexp_extract(\"file_path\", r\"quarter=(\\d+)\", 1)) \\\n", | ||
" .withColumn(\"geometry\", expr(\"ST_GeomFromText(tile)\")) \\\n", | ||
" .withColumn(\"bbox\", expr(\"struct(st_xmin(ST_GeomFromText(tile)) as xmin, st_ymin(ST_GeomFromText(tile)) as ymin, st_xmax(ST_GeomFromText(tile)) as xmax, st_ymax(ST_GeomFromText(tile)) as ymax) as bbox\")) \\\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont repeatedly call ST_GeomFromText. use the geometry column above.
" .withColumn(\"quarter\", regexp_extract(\"file_path\", r\"quarter=(\\d+)\", 1)) \\\n", | ||
" .withColumn(\"geometry\", expr(\"ST_GeomFromText(tile)\")) \\\n", | ||
" .withColumn(\"bbox\", expr(\"struct(st_xmin(ST_GeomFromText(tile)) as xmin, st_ymin(ST_GeomFromText(tile)) as ymin, st_xmax(ST_GeomFromText(tile)) as xmax, st_ymax(ST_GeomFromText(tile)) as ymax) as bbox\")) \\\n", | ||
" .withColumn(\"geohash\", expr(\"ST_GeoHash(ST_GeomFromText(tile), 10)\")) \\\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same. reuse geometry column here
" .withColumn(\"geometry\", expr(\"ST_GeomFromText(tile)\")) \\\n", | ||
" .withColumn(\"bbox\", expr(\"struct(st_xmin(ST_GeomFromText(tile)) as xmin, st_ymin(ST_GeomFromText(tile)) as ymin, st_xmax(ST_GeomFromText(tile)) as xmax, st_ymax(ST_GeomFromText(tile)) as ymax) as bbox\")) \\\n", | ||
" .withColumn(\"geohash\", expr(\"ST_GeoHash(ST_GeomFromText(tile), 10)\")) \\\n", | ||
" .selectExpr(\"*\", ''' \"mobile\" as type''') \\\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this wont be needed once you change the loading logic.
} | ||
], | ||
"source": [ | ||
"from pyspark.sql.functions import input_file_name, regexp_extract\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the same comments from this cell
"id": "d4022da3-5a43-445e-bc11-ccc1a490452f", | ||
"metadata": {}, | ||
"source": [ | ||
"```python\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not the default value? seems uneeded
"---\n", | ||
"\n", | ||
"```python\n", | ||
"mobile = mobile.repartition(1)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you doing this when you repartitionByRange in the cell below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this is a way of ensuring one file per partition.
Matt Powers has a good article about Hive Style Partitioning: https://delta.io/blog/pros-cons-hive-style-partionining/ |
"```python\n", | ||
" .orderBy(expr(\"ST_GeoHash(ST_GeomFromText(tile), 6)\")) \\\n", | ||
"```\n", | ||
"**What this does:**\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this "what this does" could be omitted since code is self explanatory. repetitive "what this does" could be removed and markdown explanations refactored to more narrative style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not be showing creating/writing PROJJSON by hand in the notebook cell. Better to use a library like proj or geopandas to create the projjson geometry and crs representation.
No description provided.