First commit

rcorrero · Jul 4, 2020 · dfba246 · dfba246
commit dfba246
Show file tree

Hide file tree

Showing 12 changed files with 429 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,13 @@
+# Text editor backups #
+#######################
+*~
+*.pyc
+*.pyo
+
+# Irrelevant background files #
+###############################
+/meta/
+
+# Personal files #
+##################
+/poisson/private/
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,11 @@
+Copyright 2020 Richard Correro
+
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+
+3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/README.md b/README.md
@@ -0,0 +1,32 @@
+poisson &mdash; Richard Correro
+==============================
+
+poisson is a python module for small vessel detection in optical satellite imagery. This module provides a framework for training ship detection models and using them to identify vessels in satellite imagery from [planet](https://www.planet.com/). 
+
+This repository contains the module itself, a trained model and its associated files, working notes from the modules creation, and several papers relavant to vessel detection methods.
+
+Repository Structure
+------------
+```
+.
+├── LICENSE
+├── README.md
+├── notes
+│   ├── Panoptis\ ?\200\ Imagery\ Processing\ Pipeline.md
+│   └── Poisson\ ?\200\ Development.md
+├── papers
+│   ├── remotesensing-10-00511.pdf
+│   └── vessel_detect_survey.pdf
+├── poisson
+│   ├── panoptis
+│   └── stropheus
+└── setup.py
+
+```    
+
+Support
+-----------
+Poisson was developed with the support of a research grant from the Stanford University Department of Statistics.
+
+------------
+Created by Richard Correro in 2020. Contact me at rcorrero at stanford dot edu
diff --git a/notes/Panoptis ǀ Imagery Processing Pipeline.md b/notes/Panoptis ǀ Imagery Processing Pipeline.md
@@ -0,0 +1,46 @@
+---
+title: Panoptis | Imagery Processing Pipeline
+created: '2020-06-25T20:23:26.524Z'
+modified: '2020-07-02T21:08:49.188Z'
+---
+
+# Panoptis | Imagery Processing Pipeline
+
+This is a working paper recording the development of _panoptis_, a satellite image processing pipeline designed for [poisson](https://github.com/rcorrero/poisson). πᾰνόπτηϛ means "all-seeing" in Attic Greek ([Woodhouse](http://artflsrv02.uchicago.edu/cgi-bin/efts/dicos/woodhouse_test.pl?keyword=^All-seeing,%20adj.)).
+
+## Architecture
+The image processing pipeline itself is written entirely in python using several third-party packages. As of now my intention is to build panoptis for use on a single machine until satisfactory performance is attained, at which point I will refactor the code, containerize it, and run it at scale using a container orchestration framework. Satellite image processing is [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) in that the task may be separated by area of interest (AOI), image type, band, etc. When searching for objects on near-shore open ocean large areas need to be analyzed, and this means that any interesting applications require large-scale image processing. To run the code on several replicates the code may be structured such that each instance processes images linked to from a database containing the list of images to be processed. Once the statistics of interest are obtained from the imagery, these results may be written to another database containing the results from all replicates. __Update:__ Initial calculations suggest that a trained model running on a single machine should be able to handle the largest AOIs we will need to label. Containerization is likely unecessary, but this is not certain.
+
+
+The pipeline consists of the following operations in order:
+1. Accessing imagery from storage (Google Cloud Storage in my case)
+2. Clipping the imagery to the AOI
+3. Identifying land and other noise in the imagery (e.g. cloudcover)
+4. Object detection using a set of techniques (so that objects of different sizes may be detected)
+5. Object classification by size, shape, and other factors of interest
+6. Writing detected object data to a database or dataframe 
+
+The basic statistics we need for each detected object are its
+- Location (Lat/Long or other coordinate system)
+- Time (timestamp of image in which the object is detected)
+- Size 
+- Other classification data based on the above
+
+
+## The Learning Problem
+Achieving acceptable performance and generalizability requires framing vessel detection as a machine learning problem. 
+
+As described by [this survey](https://doi.org/10.1016/j.rse.2017.12.033) the learning workflow is
+1. Mask land using a coastline shapefile
+2. Correct environmental distortions in the images and mask any areas covered by thick clouds
+3. Using image processing techniques identify potential vessels
+4. Using a trained discriminator label candidate vessels as `vessel` or `not vessel`
+
+Steps three and four may be combined into a single discriminator which takes corrected images as input. The decision of whether to combine these steps will be made based on the structure of previous vessel detection algorithms. If it seems possible to train discriminators with acceptable levels of performance which do not require candidate detection then I'll do that.
+
+The logical avenue for development is toward deeper models, but there is likely wisdom in begining with a shallow model and developing the infrastructure necessary to train it – the training socket. Once this is built, model refinement, and importantly, the development of deeper models may proceed easily and at the same level of abstraction: model design and implementation. By abstracting away the finicky details of image preprocessing, I/O, etc., you can focus on designing models which yield better performance.
+
+### Signal Source
+
+[This dataset](https://www.iuii.ua.es/datasets/masati/) contains images of land and sea with seaborne vessels labeled with bounding boxes. [This dataset](https://www.kaggle.com/c/airbus-ship-detection/overview) contains roughly a quarter-million land and sea images with similar labels, of various resolutions and clearly gathered from several different imaging platforms. The latter lacks the clear catagorization which the former sports, but its being much larger makes it more attractive as a first training set. 
+
diff --git a/notes/Poisson ǀ Development.md b/notes/Poisson ǀ Development.md
@@ -0,0 +1,21 @@
+---
+title: Poisson | Development
+created: '2020-06-26T17:02:47.180Z'
+modified: '2020-07-04T17:43:16.524Z'
+---
+
+# Poisson | Development
+
+At the highest level this project encompasses the design, development, and implementation of a satellite imagery processing pipeline. This pipeline takes images of near-shore open seas and extracts statistics relevant to the study of the behavior of small- to medium-sized vessels and other objects. The focus is on small vessels because they are unlikely to use active reporting systems such as [VMS](https://en.wikipedia.org/wiki/Vessel_monitoring_system) or [AIS](). Consequently much of the illegal, unreported, and unregulated ([IUU](https://en.wikipedia.org/wiki/Illegal,_unreported_and_unregulated_fishing)) fishing activity globally is done by [smaller vessels](http://biblioimarpe.imarpe.gob.pe/bitstream/123456789/2328/1/THESIS%20final%20-%20post%20defense.pdf#page=159
+).
+
+## Architecture
+The first milestone is to develop the satellite image processing pipeline, _panoptis_. The inputs to panoptis are raw satellite images (geotiff files). Panoptis processes these images, identifies vessels on the water, and creates a dateset containing vessel locations, sizes (length, width, area, bounding boxes, etc.) and timestamps associated with the time at which the image was captured. Panoptis can be thought of as three separate components strung together sequentially:
+
+1. Image preprocessor
+2. Vessel detector (the "model")
+3. Data postprocessor
+
+One and three are scaffolding which supports the main development, the model. The model itself is by far the most computationally complex part of poisson because it must identify vessels from raw satellite imagery. 
+
+To create and train a model with acceptable performance we need a training socket, called _stropheus_. This handles the I/O for the model, as well as hyperparameter selection, performance analysis, and report generation (describing the performance of the model). 
diff --git a/papers/remotesensing-10-00511.pdf b/papers/remotesensing-10-00511.pdf
diff --git a/papers/vessel_detect_survey.pdf b/papers/vessel_detect_survey.pdf
diff --git a/poisson/panoptis/clip_imgs.py b/poisson/panoptis/clip_imgs.py
@@ -0,0 +1,99 @@
+import os
+import errno
+import json
+import gdal
+
+from googleapiclient.http import MediaFileUpload
+
+from google.cloud import storage
+
+# Create the service client
+from googleapiclient.discovery import build
+from apiclient.http import MediaIoBaseDownload
+
+
+GOOGLE_APPLICATION_CREDENTIALS = os.getenv('APPLICATION_CREDENTIALS')
+BUCKET_NAME = os.getenv('BUCKET_NAME')
+GEO_FILTER_PATH = os.getenv('GEO_FILTER_PATH')
+PATH_PREFIX = os.getenv('PATH_PREFIX')
+ORDER_ID = os.getenv('ORDER_ID')
+ITEM_TYPE = os.getenv('ITEM_TYPE')
+ITEM_ID_PATH = os.getenv('ITEM_ID_PATH')
+DL_IMAGE_PATH = os.getenv('DL_IMAGE_PATH')
+BAND_ID = os.getenv('BAND_ID')
+
+
+def download_img(dl_path, id_num):
+    gcs_service = build('storage', 'v1')
+    if not os.path.exists(os.path.dirname(dl_path)):
+        try:
+            os.makedirs(os.path.dirname(dl_path))
+        except OSError as exc: # Guard against race condition
+            if exc.errno != errno.EEXIST:
+                raise
+    with open(dl_path, 'wb') as f:
+      # Download the file from the Google Cloud Storage bucket.
+      request = gcs_service.objects().get_media(bucket=BUCKET_NAME,
+                                                object=dl_path)
+      media = MediaIoBaseDownload(f, request)
+      print('Downloading image ', id_num, '...')
+      print('Download Progress: ')
+      done = False
+      while not done:
+          prog, done = media.next_chunk()
+          print(prog.progress())
+
+    print('Image ', id_num, ' downloaded.')
+    return dl_path
+
+
+def clip_img(img, id_num):
+  img_cropped = img[:-4] + '_cropped.tif'
+  if not os.path.exists(os.path.dirname(img_cropped)):
+      try:
+          os.makedirs(os.path.dirname(img_cropped))
+      except OSError as exc: # Guard against race condition
+          if exc.errno != errno.EEXIST:
+              raise
+  print('Clipping image ', id_num, '...')
+  cmd = 'gdalwarp -of GTiff -cutline ' + GEO_FILTER_PATH + ' -crop_to_cutline '\
+        + DL_IMAGE_PATH + img + ' ' + DL_IMAGE_PATH + img_cropped
+  response = os.system(cmd)
+  if response != 0:
+      raise RuntimeError('Clip command exited with nonzero status. Status: ' \
+                         + str(response))
+  return img_cropped
+
+
+def upload_img(img_clipped, item_id, ul_path, BUCKET_NAME):
+    media = MediaFileUpload(img_clipped, 
+                            mimetype='image/tif',
+                            resumable=True)
+
+    request = gcs_service.objects().insert(bucket=BUCKET_NAME, 
+                                           name=ul_path,
+                                           media_body=media)
+
+    print('Uploading image ', id_num, '...')
+    response = None
+    while response is None:
+        # _ is a placeholder for a progress object that we ignore.
+        # (Our file is small, so we skip reporting progress.)
+        _, response = request.next_chunk()
+        print('Upload complete')
+    return response
+
+
+if __name__ == '__main__':
+    inpath = r'' + PATH_PREFIX +  ORDER_ID + '/' + ITEM_TYPE + '/'
+    with open(ITEM_ID_PATH) as f:
+        item_ids = f.read().splitlines()
+    for id_num, item_id in enumerate(item_ids):
+        dl_path = r'' + inpath + item_id + BAND_ID + '.tif'
+        ul_path = r'' + PATH_PREFIX +  ORDER_ID + '/clipped/' \
+            + ITEM_TYPE + '/' + item_id + BAND_ID + '.tif'
+        img = download_img(dl_path, id_num)
+        img_clipped = clip_img(img, id_num)
+        response = upload_img(img_clipped, item_id, ul_path, BUCKET_NAME)
+        #print(response)
+    print('Done.')
diff --git a/poisson/panoptis/download_items.py b/poisson/panoptis/download_items.py
@@ -0,0 +1,100 @@
+import json
+import os
+import pathlib
+import time
+import requests
+
+from requests.auth import HTTPBasicAuth
+
+
+PLANET_API_KEY = os.getenv('PL_API_KEY')
+PLANET_USER = os.getenv('PL_USER')
+PLANET_PASSWORD = os.getenv('PL_PASSWORD')
+ORDER_NAME = os.getenv('ORDER_NAME')
+ITEM_ID_PATH = os.getenv('ITEM_ID_PATH')
+PATH_PREFIX = os.getenv('PATH_PREFIX')
+
+GOOGLE_CREDENTIALS = os.getenv('APPLICATION_CREDENTIALS')
+BUCKET_NAME = os.getenv('BUCKET_NAME')
+
+orders_url = 'https://api.planet.com/compute/ops/orders/v2'
+auth = HTTPBasicAuth(PLANET_API_KEY, '')
+headers = {'content-type': 'application/json'}
+user = PLANET_USER
+password = PLANET_PASSWORD
+name = ORDER_NAME
+subscription_id = 0
+item_type = "PSScene4Band" # Make env var
+product_bundle = "analytic"
+single_archive = False
+archive_filename = "test_01"
+bucket = BUCKET_NAME
+path_prefix = PATH_PREFIX
+email = True
+
+
+def create_request(user, password, name, subscription_id, item_ids, item_type, 
+                   product_bundle, single_archive, archive_filename, 
+                   bucket, credentials, path_prefix, email):
+  request = {
+      "name": name,
+      "subscription_id": subscription_id,
+      "products": [
+        {
+          "item_ids": item_ids,
+          "item_type": item_type,
+          "product_bundle": product_bundle
+        }
+      ],
+      "delivery": {
+        "single_archive": single_archive,
+        #"archive_filename": archive_filename,
+        "google_cloud_storage": {
+          "bucket": bucket,
+          "credentials": credentials,
+          "path_prefix": path_prefix
+        }
+      },
+      "notifications": {
+        "email": email
+      },
+      "order_type": "full"
+    }
+  return request
+
+
+def place_order(request, auth):
+    response = requests.post(orders_url, data=json.dumps(request), auth=auth, headers=headers)
+    print("Response ok? ", response.ok)
+    order_id = response.json()['id']
+    print("Order id: ", order_id)
+    order_url = orders_url + '/' + order_id
+    return order_url
+
+
+def poll_for_success(order_url, auth, num_loops=100):
+    print('Order status: ')
+    count = 0
+    while(count < num_loops):
+        count += 1
+        r = requests.get(order_url, auth=auth)
+        response = r.json()
+        state = response['state']
+        print(state)
+        end_states = ['success', 'failed', 'partial']
+        if state in end_states:
+            break
+        time.sleep(10)
+
+
+if __name__ == '__main__':
+  with open(ITEM_ID_PATH) as f:
+      item_ids = f.read().splitlines()
+  with open(GOOGLE_CREDENTIALS) as f:
+      credentials = f.read()
+      request = create_request(user, password, name, subscription_id, 
+                               item_ids, item_type, product_bundle, single_archive, 
+                               archive_filename, bucket, credentials, 
+                               path_prefix, email)
+      order_url = place_order(request, auth)
+      poll_for_success(order_url, auth)
diff --git a/poisson/panoptis/make_cred_str.sh b/poisson/panoptis/make_cred_str.sh
@@ -0,0 +1 @@
+cat google_creds.json | base64 | tr -d '\n' > google_creds_str.txt
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		cat google_creds.json \| base64 \| tr -d '\n' > google_creds_str.txt