Merge branch 'main' of https://github.com/hotosm/osm-rawdata

hotosm · Oct 24, 2023 · 99e41c1 · 99e41c1
2 parents c83ebe0 + 09f109f
commit 99e41c1
Show file tree

Hide file tree

Showing 2 changed files with 112 additions and 64 deletions.
diff --git a/docs/overture.md b/docs/overture.md
@@ -18,7 +18,7 @@ time. Each file has features spread across the planet, instead of a
 subset in a geographical region. If you wish to get all the data for a
 region, you have to load all 120 files into a database.
 
-While the Overture recommends using [Amazon
+While Overture recommends using [Amazon
 Athena](https://aws.amazon.com/athena/) or [Microsoft
 Synapse](https://learn.microsoft.com/en-us/azure/synapse-analytics/get-started-create-workspace),
 you can also use a database.
@@ -30,28 +30,22 @@ import a parquet file into postgres. In these cases the database
 schema will resemble the Overture schema. Since HOT maintains it's own
 database schema that is also optimized for query performance, you can
 use the [importer](https://hotosm.github.io/osm-rawdata/importer/)
-program to import into the Underpass schema.
+program to import into the Underpass schema. The importer utility can
+parse any of the data files that are using the V2 schema into GeoJson.
 
 ## Schema
 
+There are two versions of the file schema. The original schemawas had
+less columns in it, and each data type had a schema oriented towards
+that data type. The new schema (Oct 2023) is larger, but all the data
+types are supported in the same schema.
+
 The schema used in the Overture data files is [documented here](
 https://docs.overturemaps.org/reference). This document is just a
 summary with some implementation details.
 
 ### Buildings
 
-* id: tmp_[Giant HEX number]
-* updatetime: The last time a feature was updated
-* version: The version of the feature
-* names: The names of the buiding
-* height: The heigth of the feature in meters
-* numfloors: The numbers of floors in the building
-* class: The type of building, residential, commericial, etc...
-* geometry: The feature geometry
-* sources: A list of dataset sources with optional recordId
-* level: This appears to be unused
-* bbox: A bounding box of the feature
-
 The current list of buildings datasets is:
 
 * Austin Building Footprints Year 2013 2D Buildings
@@ -69,6 +63,51 @@ The current list of buildings datasets is:
 * USGS Lidar
 * Washington DC Open Data 3D Buildings
 
+Since the Microsoft ML Buildings and the OpenStreetMap data is
+available elsewhere, and is more up-to-date for global coverage, all
+of the other datasets are US only at this time.
+
+The primary columns of interest to OSM are the number of building
+floors, the height in meters, and the name if it has one. These
+columns are not set in all of the datasets, but where they exist, they
+can be added to OSM during conflation.
+
+As a warning, the USGS Lidar dataset has many really bad building
+geometries, so it's only the height column that is useful, if
+accurate.
+
+### Places
+
+The *places* data are POIs of places. This appears to be for
+amenities, and contains tags related to that OSM category. This
+dataset is from Meta, and the data appears derived from Facebook.
+
+The columns that are of interest to OSM are:
+
+* freeform - The address of the amenity, although the format is not
+  consistent
+* socials - An array of social media links for this amenity.
+* phone - The phone number if it has one
+* websites - The website URL if it has one
+* value - The name of the amenity if known
+
+### Highways
+
+In the current highway *segment* data files, the only source is
+OSM. In that cases it's better to use uptodate OSM data. It'll be
+interesting to see if Overture imports the publically available
+highway datasets from the USGS, or some state governments. That would
+be very useful.
+
+The Overture *segments* data files are equivalent to an OSM way, with
+tags specific to that highway linestring. There are separate data
+files for *connections*, that are equivalant to an OSM relation.
+
+### Admin Boundaries
+
+The administrative boundaries data is only OSM data, so there is no
+reason to care about these files.
+
 # Special Columns
 
 ## names
@@ -81,6 +120,9 @@ a language value as well.
 * alternate
 * short
 
+Each of these can have multiple values, each of which consists of a
+value and the language.
+
 ## sources
 
 The sources column is an array of with two entries. The first entry is

diff --git a/osm_rawdata/overture.py b/osm_rawdata/overture.py
@@ -35,7 +35,7 @@
 from codetiming import Timer
 
 # Instantiate logger
-log = logging.getLogger(__name__)
+log = logging.getLogger('osm-rawdata')
 
 class Overture(object):
     def __init__(self,
@@ -52,27 +52,9 @@ def __init__(self,
         self.filespec = filespec
         log.debug(f"Read {len(self.data)} entries from {filespec}")
 
-    # def parsePlace(self,
-    #                data: Series,
-    #                 ):
-    #     entry = dict()
-    #     log.debug(data)
-
-    # def parseHighway(self,
-    #                data: Series,
-    #                 ):
-    #     entry = dict()
-    #     log.debug(data)
-
-    # def parseLocality(self,
-    #                data: Series,
-    #                 ):
-    #     entry = dict()
-    #     log.debug(data)
-
     def parse(self,
-                    data: Series,
-                    ):
+                data: Series,
+                  ):
         # log.debug(data)
         entry = dict()
         # timer = Timer(text="importParquet() took {seconds:.0f}s")
@@ -85,30 +67,57 @@ def parse(self,
             if key == 'geometry':
                 geom = wkb.loads(value)
             if type(value) == ndarray:
-                # the sources column is the only list
                 # print(f"LIST: {key} = {value}")
-                entry['source'] = value[0]['dataset']
+                if type(value[0]) == dict:
+                    for k1, v1 in value[0].items():
+                        if v1 is not None:
+                            if type(v1) == ndarray:
+                                import epdb; epdb.st()
+                            entry[k1] = v1
+                else:
+                    # FIXME: for now the data only has one entry in the array,
+                    # but this could change.
+                    if type(value[0]) == ndarray:
+                        import epdb; epdb.st()
+                    entry[key] = value[0]
+                continue
+            if key == 'sources' and type(value) == list:
+                if type(value) == ndarray:
+                    import epdb; epdb.st()
+                if type(value[0]) == ndarray:
+                    import epdb; epdb.st()
+                if 'dataset' in value[0]:
+                    entry['source'] = value[0]['dataset']
+                if 'recordId' in valve[0] and ['recordId'] is not None:
+                    entry['record'] = value[0]['recordId']
+                if value[0]['confidence'] is not None:
+                    entry['confidence'] = value[0]['confidence']
+                else:
+                    entry['source'] = value['dataset']
                 if value[0]['recordId'] is not None:
                     entry['record'] = value[0]['recordId']
                 if value[0]['confidence'] is not None:
                     entry['confidence'] = value[0]['confidence']
             if type(value) == dict:
                 if key == 'bbox':
                     continue
-                # print(f"DICT: {key} = {value}")
-                # the names column is the only dictionary we care about
                 for k1, v1 in value.items():
-                    if type(v1) == ndarray and len(v1) == 0:
+                    if v1 is None:
+                        continue
+                    if type(v1) == dict:
+                        # print(f"DICT: {key} = {value}")
+                        for k2, v2 in v1.items():
+                            if v2 is None:
+                                continue
+                            if type(v2) == ndarray:
+                                for k3,v3 in v2.tolist()[0].items():
+                                    if v3 is not None:
+                                        entry[k3] = v3
+                            elif type(v2) == str:
+                                entry[k2] = v2
                         continue
                     # FIXME: we should use the language to adjust the name tag
-                    lang = v1[0]['language']
-                    if k1 == 'common':
-                        entry['loc_name'] = v1[0]['value']
-                    if k1 == 'official':
-                        entry['name'] = v1[0]['value']
-                    if k1 == 'alternate':
-                        entry['alt_name'] = v1[0]['value']
-                    # print(f"ROW: {k1} = {v1}")
+                    # lang = v1[0]['language']
         #timer.stop()
         return Feature(geometry=geom, properties=entry)
 
@@ -126,7 +135,6 @@ def main():
     parser.add_argument("-v", "--verbose", action="store_true", help="verbose output")
     parser.add_argument("-i", "--infile", required=True, help="Input file")
     parser.add_argument("-o", "--outfile", default='overture.geojson', help="Output file")
-    parser.add_argument("-c", "--category", choices=categories, required=True, help="Data category")
 
     args = parser.parse_args()
 
@@ -150,21 +158,19 @@ def main():
     for index in overture.data.index:
         spin.next()
         feature = overture.data.loc[index]
-        if args.category == 'buildings':
-            entry = overture.parse(feature)
-        # elif args.category == 'places':
-        #     entry = overture.parsePlace(feature)
-        # elif args.category == 'highway':
-        #     entry = overture.parseHighway(feature)
-        # elif args.category == 'locality':
-        #     entry = overture.parseLocality(feature)
-        features.append(entry)
-
-    file = open(args.outfile, 'w')
-    geojson.dump(FeatureCollection(features), file)
-    timer.stop()
-
-    log.info(f"Wrote {args.outfile}")
+        entry = overture.parse(feature)
+        if entry['properties']['dataset'] != 'OpenStreetMap':
+            features.append(entry)
+
+    if len(features) > 0:
+        file = open(args.outfile, 'w')
+        geojson.dump(FeatureCollection(features), file)
+        timer.stop()
+        log.info(f"Wrote {args.outfile}")
+    else:
+        log.info(f"There was no non OSM data in {args.infile}")
+
+    spin.finish()
 
 if __name__ == "__main__":
     """This is just a hook so this file can be run standlone during development."""