uw-biomedical-ml
diff --git a/‎README.md
+98 b/‎README.md
+98
diff --git a/‎code/01_pull_ZCTA_INTPT.R
+12 b/‎code/01_pull_ZCTA_INTPT.R
+12
diff --git a/‎code/02_pull_DNI_latlon.R
+19 b/‎code/02_pull_DNI_latlon.R
+19
diff --git a/‎code/03_pull_GHI_latlon.R
+18 b/‎code/03_pull_GHI_latlon.R
+18
diff --git a/‎code/04_pull_DEM_ul_lr_zmean.R
+11 b/‎code/04_pull_DEM_ul_lr_zmean.R
+11
diff --git a/‎code/05_latlon_weather_metric_mapping.py
+50 b/‎code/05_latlon_weather_metric_mapping.py
+50
diff --git a/‎code/06_make_zcta_to_zmean.py
+37 b/‎code/06_make_zcta_to_zmean.py
+37
diff --git a/‎code/07_make_zcta_ghi_zcta_dni.py
+25 b/‎code/07_make_zcta_ghi_zcta_dni.py
+25
diff --git a/‎code/08_make_zcta_weather_metrics_tsv.py
+24 b/‎code/08_make_zcta_weather_metrics_tsv.py
+24
diff --git a/‎code/09_make_ZCTA_master_info.R
+60 b/‎code/09_make_ZCTA_master_info.R
+60
diff --git a/‎code/__pycache__/python_utils.cpython-38.pyc
4.65 KB b/‎code/__pycache__/python_utils.cpython-38.pyc
4.65 KB
@@ -0,0 +1,98 @@
+This repository contains the code used to generate our environmental data used in the analysis for [[PAPER TITLE HERE]]. We have provided both our generated data and our source code in hope that it will facilitate future analyses. Refer to the methods section for more detailed explanation of data preparation.
+
+The final data files used in our analysis are located in "final_data/". They are "final_data/ghi_matched_master_cleaned_plus_zcta.tsv" and "final_data/zcta_master_with_pollution.tsv". Files match those used in our analysis to within rounding error. 
+
+-"final_data/zcta_master_with_pollution.tsv" contains each Zip Code Tabulation Area internal point matched to its nearest-neighbor environmental metric in each category. Each ZCTA is given a row in this dataset. This file was used in our analysis to assign environmental exposures to each patient in our study, as patients could be approximately localized to a ZCTA.
+-"final_data/ghi_matched_master_cleaned_plus_zcta.tsv" is used to generate high-resolution maps of environmental variables and risk ratios. In this file, each point of measurement for GHI and DNI has been matched to their nearest neighbor for every other environmental variable. This permits plotting up to the resolution of GHI and DNI, our highest-resolution data.
+
+To generate these data files from scratch, run "./code/sh_run_all.sh".
+
+Notes: 
+- The zcta column in final_data/ghi_matched_master_cleaned_plus_zcta.tsv refers to the nearest ZCTA internal point, not necessarily the ZCTA within which the GHI and DNI latitude and longitude point reside.
+- Data generated in this repo matches our analysis data to 5 decimal places
+- An improvement to the mapping code would map each environmental variable at its native resolution, rather than at GHI resolution. This would actually result in more crisp maps, because the Voronoi cells would be larger with straight lines.
+
+
+Citations for Data Sources:
+- ZCTA information (coordinates internal points) obtained from R's Tigris package.
+- Elevation information from USGS Lidar Explorer: "https://prd-tnm.s3.amazonaws.com/LidarExplorer/index.html#/"
+    - Select "DEM", "Show where DEMs exist?", "more info", and click to download 1 arc-second data.
+- GHI and DNI information from nsrdb viewer: "https://maps.nrel.gov/nsrdb-viewer"
+    - Select GOES PSM v3 dropdown, and download "Multi Year PSM Direct Normal Irradiance" and "Multi Year PSM Global Horizontal Irradiance"
+- Weather data from NOAA: "https://www.ncei.noaa.gov/pub/data/normals/1981-2010/" 
+    - Our project used 1981-2010 30 year Climate Normals, but newer data has become available.
+    - download "allstations.txt" from "https://www.ncei.noaa.gov/pub/data/normals/1981-2010/station-inventories/"
+    - Download the following from "https://www.ncei.noaa.gov/pub/data/normals/1981-2010/products/precipitation/": 
+		- ann-prcp-normal.txt
+		- ann-snow-normal.txt
+		- djf-prcp-normal.txt
+		- djf-snow-normal.txt
+		- jja-prcp-normal.txt
+		- jja-snow-normal.txt
+    - Download the following from "https://www.ncei.noaa.gov/pub/data/normals/1981-2010/products/precipitation/": 
+		- ann-dutr-normal.txt
+		- ann-tavg-normal.txt
+		- ann-tmax-normal.txt
+		- ann-tmin-normal.txt
+		- djf-tavg-normal.txt
+		- jja-tavg-normal.txt
+
+
+The raw weather data is provided in a less intuitive format.
+The following key to understanding the data format is taken from
+https://www1.ncdc.noaa.gov/pub/data/normals/1981-2010/readme.txt
+"""
+    A. FORMAT OF ANNUAL/SEASONAL FILES
+       (ann-*.txt, djf-*.txt, mam-*.txt, jja-*.txt, son-*.txt)
+
+       Each file contains the annual/seasonal values of one parameter at all
+       qualifying stations. There is one record (line) per station.
+
+       The variables in each record include the following:
+
+       Variable  Columns  Type
+       ----------------------------
+       STNID       1- 11  Character
+       VALUE      19- 23  Integer
+       FLAG       24- 24  Character
+       ----------------------------
+
+       These variables have the following definitions:
+
+       STNID   is the GHCN-Daily station identification code. See the lists in the
+               station-inventories directory.
+       VALUE1  is the annual/seasonal value.
+       FLAG1   is the completeness flag for the annual/seasonal value. See Flags
+               section below.
+
+    E. FORMAT OF STATION INVENTORIES
+       (*-inventory.txt, allstations.txt)
+
+       Each file contains on station per line.
+
+       The variables in each record include the following:
+       ------------------------------
+       Variable   Columns   Type
+       ------------------------------
+       ID            1-11   Character
+       LATITUDE     13-20   Real
+       LONGITUDE    22-30   Real
+       ELEVATION    32-37   Real
+       STATE        39-40   Character
+       NAME         42-71   Character
+       GSNFLAG      73-75   Character
+       HCNFLAG      77-79   Character
+       WMOID        81-85   Character
+       METHOD*      87-99   Character
+       ------------------------------
+
+    UNITS:
+           hundredths of inches for average monthly/seasonal/annual precipitation,
+    month-to-date/year-to-date precipitation, and percentiles of precipitation.
+    e.g., "1" is 0.01" and "1486" is 14.86"
+
+        tenths of inches for average monthly/seasonal/annual snowfall,
+    month-to-date/year-to-date snowfall, and percentiles of snowfall.
+    e.g. "39" is 3.9"
+"""
+ 
@@ -0,0 +1,12 @@
+library(rgdal)
+library(sf)
+library(ggplot2)
+library(tigris)
+library(ggplot2)
+options(tigris_use_cache=T)
+zip_df <- zctas()
+zcta_intpt = st_drop_geometry(zip_df[,c('ZCTA5CE10','INTPTLON10','INTPTLAT10')])
+
+write.table(x=zcta_intpt,file="processed_data/zcta_intpt.tsv",
+        sep="\t",na='`',
+        quote=F,row.names=F)
@@ -0,0 +1,19 @@
+library(sf)
+library(ggplot2)
+
+shapefile = read_sf("raw_data_sources/nsrdb_v3_0_1_1998_2016_dni/nsrdb_v3_0_1_1998_2016_dni.shp")
+head(shapefile)
+
+centroid_coords = st_coordinates(st_centroid(shapefile))
+longitude = centroid_coords[,1]
+latitude = centroid_coords[,2]
+head(centroid_coords)
+
+shapefile <- cbind(shapefile,longitude)
+shapefile <- cbind(shapefile,latitude)
+head(shapefile)
+
+relevant_data = st_drop_geometry(shapefile[,c("dni","longitude","latitude")])
+write.table(x=relevant_data,file="processed_data/dni_lonlat.tsv",
+            sep="\t",na='`',
+            quote=F,row.names=F)
@@ -0,0 +1,18 @@
+library(sf)
+library(ggplot2)
+
+shapefile = read_sf("raw_data_sources/nsrdb_v3_0_1_1998_2016_ghi/nsrdb_v3_0_1_1998_2016_ghi.shp")
+centroid_coords = st_coordinates(st_centroid(shapefile))
+longitude = centroid_coords[,1]
+latitude = centroid_coords[,2]
+head(shapefile)
+head(centroid_coords)
+
+shapefile <- cbind(shapefile,longitude)
+shapefile <- cbind(shapefile,latitude)
+head(shapefile)
+
+relevant_data = st_drop_geometry(shapefile[,c("ghi","longitude","latitude")])
+write.table(x=relevant_data,file="processed_data/ghi_lonlat.tsv",
+            sep="\t",na='`',
+            quote=F,row.names=F)
@@ -0,0 +1,11 @@
+library(sf)
+library(ggplot2)
+
+DEM_data=st_read('raw_data_sources/FESM_1.gpkg')
+print(head(DEM_data))
+
+ul_lr_zmean = st_drop_geometry(DEM_data[,c('lrlat','lrlon','ullat','ullon','zmean')])
+
+write.table(x=ul_lr_zmean,file="processed_data/DEM_ul_lr_zmean.tsv",
+            sep="\t", na='`',
+            quote=F,row.names=F)
@@ -0,0 +1,50 @@
+def append_station_map(station_map,input_txt):
+    """appends the metric from input_txt to the station_map dictionary.
+    This function will be executed 1x per metric type we have, and the header must be named in corresponding
+    order.
+    """
+    with open(input_txt,'r') as fin:
+        for r in fin:
+            station_id = r[0:11]
+            value = r[18:23].strip(' ')
+            station_map[station_id].append(value)
+
+directory = 'raw_data_sources/weather_data/'
+
+# Map station_id-->lat_lon
+station_map = {}
+with open(directory+'allstations.txt','r') as fin:
+    for r in fin:
+        station_id = r[0:11]
+        lat = r[12:20]
+        lon = r[21:30]
+        station_map[station_id]=[lon,lat]
+
+files_to_process = [
+                    'ann-tavg-normal.txt', 'ann-tmax-normal.txt', 'ann-tmin-normal.txt',
+                    'ann-dutr-normal.txt', 'djf-tavg-normal.txt', 'jja-tavg-normal.txt',
+                    'ann-prcp-normal.txt','ann-snow-normal.txt','djf-prcp-normal.txt',
+                    'djf-snow-normal.txt', 'jja-prcp-normal.txt', 'jja-snow-normal.txt'
+                   ]
+for fname in files_to_process:
+    input_txt = directory+fname
+    append_station_map(station_map,input_txt)
+
+files_header = [s[0:8].replace('-','_') for s in files_to_process]
+header = ['longitude','latitude'] + files_header
+valid_length = len(header)
+header_str = '\t'.join(header)+'\n'
+print(f'header is {header_str}')
+accepted_rows = 0
+with open('processed_data/lonlat_weather_metrics.tsv','w') as fout:
+    fout.write(header_str)
+    for stnid,metric_row in station_map.items():
+        if len(metric_row) == valid_length:
+            row_str = '\t'.join(metric_row)+'\n'
+            fout.write(row_str)
+            accepted_rows += 1
+
+"""see the word document for explanation. The non-accepted rows I think are not an issue"""
+
+print(f"total stations: {len(station_map)}")
+print(f"accepted stations: {accepted_rows}")
@@ -0,0 +1,37 @@
+import csv
+import numpy as np
+from sklearn.neighbors import BallTree
+import python_utils as pu
+
+def make_DEM_centroids(DEM_list):
+    DEM_centroids_list = []
+    header = DEM_list.pop(0)
+    hd = {col:i for i,col in enumerate(header)}
+    new_header = ['longitude','latitude','zmean']
+    DEM_centroids_list.append(new_header)
+    for row in DEM_list:
+        longitude = (float(row[hd['ullon']])+float(row[hd['lrlon']]))/2
+        latitude = (float(row[hd['ullat']])+float(row[hd['lrlat']]))/2
+        DEM_centroids_list.append([longitude,latitude,row[hd['zmean']]])
+    return DEM_centroids_list
+
+DEM_zmeans_list = pu.make_list_from_tsv("processed_data/DEM_ul_lr_zmean.tsv")
+DEM_centroids_list = make_DEM_centroids(DEM_zmeans_list)
+zcta_list = pu.custom_read_zcta_tsv("processed_data/zcta_intpt.tsv")
+
+_ = zcta_list.pop(0)
+_ = DEM_centroids_list.pop(0)
+zcta_array = np.deg2rad(np.array([row[1:] for row in zcta_list]))
+DEM_array = np.deg2rad(np.array([row[:-1] for row in DEM_centroids_list]))
+import sys
+sys.setrecursionlimit(100000)
+tree = BallTree(DEM_array,metric='haversine')
+distances,indices = tree.query(zcta_array,k=1)
+indices = list(np.ravel(indices))
+
+zcta_zmean_list = [[row[0],DEM_centroids_list[indices[i]][-1]] for i,row in enumerate(zcta_list)]
+
+with open("processed_data/zcta_zmean.tsv",'w') as fout:
+    fout.write("ZCTA\tzmean\n")
+    for row in zcta_zmean_list:
+        fout.write(f"{row[0]}\t{row[1]}\n")
@@ -0,0 +1,25 @@
+import csv
+import numpy as np
+from sklearn.neighbors import BallTree
+import python_utils as pu
+
+GHI_list = pu.make_list_from_tsv("processed_data/ghi_lonlat.tsv")
+DNI_list = pu.make_list_from_tsv("processed_data/dni_lonlat.tsv")
+zcta_list = pu.custom_read_zcta_tsv("processed_data/zcta_intpt.tsv")
+
+zcta_ghi_list = pu.match_centroids(zcta_list,
+                             GHI_list,
+                             col_pos_dict = {'metric':[0],'lon':1,'lat':2})
+with open("processed_data/zcta_ghi.tsv",'w') as fout:
+    fout.write("ZCTA\tghi\n")
+    for row in zcta_ghi_list:
+        fout.write(f"{row[0]}\t{row[1]}\n")
+
+zcta_list.insert(0,'replacement_header_filler')
+zcta_dni_list = pu.match_centroids(zcta_list,
+                             DNI_list,
+                             col_pos_dict = {'metric':[0],'lon':1,'lat':2})
+with open("processed_data/zcta_dni.tsv",'w') as fout:
+    fout.write("ZCTA\tdni\n")
+    for row in zcta_dni_list:
+        fout.write(f"{row[0]}\t{row[1]}\n")
@@ -0,0 +1,24 @@
+import csv
+import numpy as np
+from sklearn.neighbors import BallTree
+import python_utils as pu
+weather_list = pu.make_list_from_tsv("processed_data/lonlat_weather_metrics.tsv")
+zcta_list = pu.custom_read_zcta_tsv("processed_data/zcta_intpt.tsv")
+
+weather_header = weather_list[0]
+col_pos_dict = {'lon':0,
+                'lat':1,}
+metric_inds = list(range(len(weather_list[0])))
+metric_inds.remove(col_pos_dict['lon'])
+metric_inds.remove(col_pos_dict['lat'])
+col_pos_dict['metric'] = metric_inds
+
+zcta_weather_list = pu.match_centroids(zcta_list,weather_list,col_pos_dict)
+
+with open("processed_data/zcta_weather.tsv",'w') as fout:
+    header_str = "ZCTA\t"+'\t'.join(weather_header[2:])+"\n"
+    fout.write(header_str)
+    for row in zcta_weather_list:
+        row = [0.0 if el == -7777 else el for el in row]
+        row_str = "\t".join([str(el) for el in row])+"\n"
+        fout.write(row_str)
@@ -0,0 +1,60 @@
+library(rgdal)
+library(sf)
+library(ggplot2)
+library(tigris)
+library(plyr)
+options(tigris_use_cache=T)
+
+get_col_names = function(table_file){
+    table = read.table(file=table_file,
+           header=T, sep='\t',
+           na.strings='`',
+           stringsAsFactors = F)
+    col_names = colnames(table)[-1]
+    print('column_names = ')
+    print(col_names)
+    return(col_names)
+}
+
+
+merge_tables = function(other_table_file,zcta_master){
+    other_table = read.table(file=other_table_file,
+           colClasses = c(
+                          'character',
+                          rep('numeric',count.fields(textConnection(readLines(other_table_file,n=1)),sep='\t')-1)
+                          ),
+           header=T, sep='\t',
+           na.strings='`',
+           stringsAsFactors = F)
+
+    other_table <- rename(other_table,c("ZCTA"="ZCTA5CE10"))
+    head(other_table)
+    zcta_plus_other = merge(zcta_master,other_table,by="ZCTA5CE10",all.x=T)
+    print(head(zcta_plus_other))
+
+    return(zcta_plus_other)
+}
+
+
+zcta_master <- zctas()
+head(zcta_master)
+
+files_to_process = c("processed_data/zcta_zmean.tsv","processed_data/zcta_ghi.tsv","processed_data/zcta_dni.tsv","processed_data/zcta_weather.tsv")
+column_list = c()
+for (file in files_to_process) {
+    zcta_master = merge_tables(file,zcta_master)
+    column_list = c(column_list,get_col_names(file))
+}
+
+
+data_out = st_drop_geometry(zcta_master[,c('ZCTA5CE10','INTPTLON10','INTPTLAT10',column_list)])
+print("The head of the data out is!")
+print(head(data_out))
+write.table(x=data_out,file="processed_data/zcta_master_info.tsv",
+        sep="\t",na='`',
+        quote=F,row.names=F)
+
+# print("making dni by zip")
+# ggplot(data=zcta_master)+
+#   geom_sf(data=zcta_master,aes(fill=dni),size=0.01)
+# ggsave("processed_data/dni_by_zip.pdf")