Skip to content

Commit cd29d25

Browse files
committed
first commit
0 parents  commit cd29d25

40 files changed

+169414
-0
lines changed

README.md

+98
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
This repository contains the code used to generate our environmental data used in the analysis for [[PAPER TITLE HERE]]. We have provided both our generated data and our source code in hope that it will facilitate future analyses. Refer to the methods section for more detailed explanation of data preparation.
2+
3+
The final data files used in our analysis are located in "final_data/". They are "final_data/ghi_matched_master_cleaned_plus_zcta.tsv" and "final_data/zcta_master_with_pollution.tsv". Files match those used in our analysis to within rounding error.
4+
5+
-"final_data/zcta_master_with_pollution.tsv" contains each Zip Code Tabulation Area internal point matched to its nearest-neighbor environmental metric in each category. Each ZCTA is given a row in this dataset. This file was used in our analysis to assign environmental exposures to each patient in our study, as patients could be approximately localized to a ZCTA.
6+
-"final_data/ghi_matched_master_cleaned_plus_zcta.tsv" is used to generate high-resolution maps of environmental variables and risk ratios. In this file, each point of measurement for GHI and DNI has been matched to their nearest neighbor for every other environmental variable. This permits plotting up to the resolution of GHI and DNI, our highest-resolution data.
7+
8+
To generate these data files from scratch, run "./code/sh_run_all.sh".
9+
10+
Notes:
11+
- The zcta column in final_data/ghi_matched_master_cleaned_plus_zcta.tsv refers to the nearest ZCTA internal point, not necessarily the ZCTA within which the GHI and DNI latitude and longitude point reside.
12+
- Data generated in this repo matches our analysis data to 5 decimal places
13+
- An improvement to the mapping code would map each environmental variable at its native resolution, rather than at GHI resolution. This would actually result in more crisp maps, because the Voronoi cells would be larger with straight lines.
14+
15+
16+
Citations for Data Sources:
17+
- ZCTA information (coordinates internal points) obtained from R's Tigris package.
18+
- Elevation information from USGS Lidar Explorer: "https://prd-tnm.s3.amazonaws.com/LidarExplorer/index.html#/"
19+
- Select "DEM", "Show where DEMs exist?", "more info", and click to download 1 arc-second data.
20+
- GHI and DNI information from nsrdb viewer: "https://maps.nrel.gov/nsrdb-viewer"
21+
- Select GOES PSM v3 dropdown, and download "Multi Year PSM Direct Normal Irradiance" and "Multi Year PSM Global Horizontal Irradiance"
22+
- Weather data from NOAA: "https://www.ncei.noaa.gov/pub/data/normals/1981-2010/"
23+
- Our project used 1981-2010 30 year Climate Normals, but newer data has become available.
24+
- download "allstations.txt" from "https://www.ncei.noaa.gov/pub/data/normals/1981-2010/station-inventories/"
25+
- Download the following from "https://www.ncei.noaa.gov/pub/data/normals/1981-2010/products/precipitation/":
26+
- ann-prcp-normal.txt
27+
- ann-snow-normal.txt
28+
- djf-prcp-normal.txt
29+
- djf-snow-normal.txt
30+
- jja-prcp-normal.txt
31+
- jja-snow-normal.txt
32+
- Download the following from "https://www.ncei.noaa.gov/pub/data/normals/1981-2010/products/precipitation/":
33+
- ann-dutr-normal.txt
34+
- ann-tavg-normal.txt
35+
- ann-tmax-normal.txt
36+
- ann-tmin-normal.txt
37+
- djf-tavg-normal.txt
38+
- jja-tavg-normal.txt
39+
40+
41+
The raw weather data is provided in a less intuitive format.
42+
The following key to understanding the data format is taken from
43+
https://www1.ncdc.noaa.gov/pub/data/normals/1981-2010/readme.txt
44+
"""
45+
A. FORMAT OF ANNUAL/SEASONAL FILES
46+
(ann-*.txt, djf-*.txt, mam-*.txt, jja-*.txt, son-*.txt)
47+
48+
Each file contains the annual/seasonal values of one parameter at all
49+
qualifying stations. There is one record (line) per station.
50+
51+
The variables in each record include the following:
52+
53+
Variable Columns Type
54+
----------------------------
55+
STNID 1- 11 Character
56+
VALUE 19- 23 Integer
57+
FLAG 24- 24 Character
58+
----------------------------
59+
60+
These variables have the following definitions:
61+
62+
STNID is the GHCN-Daily station identification code. See the lists in the
63+
station-inventories directory.
64+
VALUE1 is the annual/seasonal value.
65+
FLAG1 is the completeness flag for the annual/seasonal value. See Flags
66+
section below.
67+
68+
E. FORMAT OF STATION INVENTORIES
69+
(*-inventory.txt, allstations.txt)
70+
71+
Each file contains on station per line.
72+
73+
The variables in each record include the following:
74+
------------------------------
75+
Variable Columns Type
76+
------------------------------
77+
ID 1-11 Character
78+
LATITUDE 13-20 Real
79+
LONGITUDE 22-30 Real
80+
ELEVATION 32-37 Real
81+
STATE 39-40 Character
82+
NAME 42-71 Character
83+
GSNFLAG 73-75 Character
84+
HCNFLAG 77-79 Character
85+
WMOID 81-85 Character
86+
METHOD* 87-99 Character
87+
------------------------------
88+
89+
UNITS:
90+
hundredths of inches for average monthly/seasonal/annual precipitation,
91+
month-to-date/year-to-date precipitation, and percentiles of precipitation.
92+
e.g., "1" is 0.01" and "1486" is 14.86"
93+
94+
tenths of inches for average monthly/seasonal/annual snowfall,
95+
month-to-date/year-to-date snowfall, and percentiles of snowfall.
96+
e.g. "39" is 3.9"
97+
"""
98+

code/01_pull_ZCTA_INTPT.R

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
library(rgdal)
2+
library(sf)
3+
library(ggplot2)
4+
library(tigris)
5+
library(ggplot2)
6+
options(tigris_use_cache=T)
7+
zip_df <- zctas()
8+
zcta_intpt = st_drop_geometry(zip_df[,c('ZCTA5CE10','INTPTLON10','INTPTLAT10')])
9+
10+
write.table(x=zcta_intpt,file="processed_data/zcta_intpt.tsv",
11+
sep="\t",na='`',
12+
quote=F,row.names=F)

code/02_pull_DNI_latlon.R

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
library(sf)
2+
library(ggplot2)
3+
4+
shapefile = read_sf("raw_data_sources/nsrdb_v3_0_1_1998_2016_dni/nsrdb_v3_0_1_1998_2016_dni.shp")
5+
head(shapefile)
6+
7+
centroid_coords = st_coordinates(st_centroid(shapefile))
8+
longitude = centroid_coords[,1]
9+
latitude = centroid_coords[,2]
10+
head(centroid_coords)
11+
12+
shapefile <- cbind(shapefile,longitude)
13+
shapefile <- cbind(shapefile,latitude)
14+
head(shapefile)
15+
16+
relevant_data = st_drop_geometry(shapefile[,c("dni","longitude","latitude")])
17+
write.table(x=relevant_data,file="processed_data/dni_lonlat.tsv",
18+
sep="\t",na='`',
19+
quote=F,row.names=F)

code/03_pull_GHI_latlon.R

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
library(sf)
2+
library(ggplot2)
3+
4+
shapefile = read_sf("raw_data_sources/nsrdb_v3_0_1_1998_2016_ghi/nsrdb_v3_0_1_1998_2016_ghi.shp")
5+
centroid_coords = st_coordinates(st_centroid(shapefile))
6+
longitude = centroid_coords[,1]
7+
latitude = centroid_coords[,2]
8+
head(shapefile)
9+
head(centroid_coords)
10+
11+
shapefile <- cbind(shapefile,longitude)
12+
shapefile <- cbind(shapefile,latitude)
13+
head(shapefile)
14+
15+
relevant_data = st_drop_geometry(shapefile[,c("ghi","longitude","latitude")])
16+
write.table(x=relevant_data,file="processed_data/ghi_lonlat.tsv",
17+
sep="\t",na='`',
18+
quote=F,row.names=F)

code/04_pull_DEM_ul_lr_zmean.R

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
library(sf)
2+
library(ggplot2)
3+
4+
DEM_data=st_read('raw_data_sources/FESM_1.gpkg')
5+
print(head(DEM_data))
6+
7+
ul_lr_zmean = st_drop_geometry(DEM_data[,c('lrlat','lrlon','ullat','ullon','zmean')])
8+
9+
write.table(x=ul_lr_zmean,file="processed_data/DEM_ul_lr_zmean.tsv",
10+
sep="\t", na='`',
11+
quote=F,row.names=F)
+50
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
def append_station_map(station_map,input_txt):
2+
"""appends the metric from input_txt to the station_map dictionary.
3+
This function will be executed 1x per metric type we have, and the header must be named in corresponding
4+
order.
5+
"""
6+
with open(input_txt,'r') as fin:
7+
for r in fin:
8+
station_id = r[0:11]
9+
value = r[18:23].strip(' ')
10+
station_map[station_id].append(value)
11+
12+
directory = 'raw_data_sources/weather_data/'
13+
14+
# Map station_id-->lat_lon
15+
station_map = {}
16+
with open(directory+'allstations.txt','r') as fin:
17+
for r in fin:
18+
station_id = r[0:11]
19+
lat = r[12:20]
20+
lon = r[21:30]
21+
station_map[station_id]=[lon,lat]
22+
23+
files_to_process = [
24+
'ann-tavg-normal.txt', 'ann-tmax-normal.txt', 'ann-tmin-normal.txt',
25+
'ann-dutr-normal.txt', 'djf-tavg-normal.txt', 'jja-tavg-normal.txt',
26+
'ann-prcp-normal.txt','ann-snow-normal.txt','djf-prcp-normal.txt',
27+
'djf-snow-normal.txt', 'jja-prcp-normal.txt', 'jja-snow-normal.txt'
28+
]
29+
for fname in files_to_process:
30+
input_txt = directory+fname
31+
append_station_map(station_map,input_txt)
32+
33+
files_header = [s[0:8].replace('-','_') for s in files_to_process]
34+
header = ['longitude','latitude'] + files_header
35+
valid_length = len(header)
36+
header_str = '\t'.join(header)+'\n'
37+
print(f'header is {header_str}')
38+
accepted_rows = 0
39+
with open('processed_data/lonlat_weather_metrics.tsv','w') as fout:
40+
fout.write(header_str)
41+
for stnid,metric_row in station_map.items():
42+
if len(metric_row) == valid_length:
43+
row_str = '\t'.join(metric_row)+'\n'
44+
fout.write(row_str)
45+
accepted_rows += 1
46+
47+
"""see the word document for explanation. The non-accepted rows I think are not an issue"""
48+
49+
print(f"total stations: {len(station_map)}")
50+
print(f"accepted stations: {accepted_rows}")

code/06_make_zcta_to_zmean.py

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
import csv
2+
import numpy as np
3+
from sklearn.neighbors import BallTree
4+
import python_utils as pu
5+
6+
def make_DEM_centroids(DEM_list):
7+
DEM_centroids_list = []
8+
header = DEM_list.pop(0)
9+
hd = {col:i for i,col in enumerate(header)}
10+
new_header = ['longitude','latitude','zmean']
11+
DEM_centroids_list.append(new_header)
12+
for row in DEM_list:
13+
longitude = (float(row[hd['ullon']])+float(row[hd['lrlon']]))/2
14+
latitude = (float(row[hd['ullat']])+float(row[hd['lrlat']]))/2
15+
DEM_centroids_list.append([longitude,latitude,row[hd['zmean']]])
16+
return DEM_centroids_list
17+
18+
DEM_zmeans_list = pu.make_list_from_tsv("processed_data/DEM_ul_lr_zmean.tsv")
19+
DEM_centroids_list = make_DEM_centroids(DEM_zmeans_list)
20+
zcta_list = pu.custom_read_zcta_tsv("processed_data/zcta_intpt.tsv")
21+
22+
_ = zcta_list.pop(0)
23+
_ = DEM_centroids_list.pop(0)
24+
zcta_array = np.deg2rad(np.array([row[1:] for row in zcta_list]))
25+
DEM_array = np.deg2rad(np.array([row[:-1] for row in DEM_centroids_list]))
26+
import sys
27+
sys.setrecursionlimit(100000)
28+
tree = BallTree(DEM_array,metric='haversine')
29+
distances,indices = tree.query(zcta_array,k=1)
30+
indices = list(np.ravel(indices))
31+
32+
zcta_zmean_list = [[row[0],DEM_centroids_list[indices[i]][-1]] for i,row in enumerate(zcta_list)]
33+
34+
with open("processed_data/zcta_zmean.tsv",'w') as fout:
35+
fout.write("ZCTA\tzmean\n")
36+
for row in zcta_zmean_list:
37+
fout.write(f"{row[0]}\t{row[1]}\n")

code/07_make_zcta_ghi_zcta_dni.py

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
import csv
2+
import numpy as np
3+
from sklearn.neighbors import BallTree
4+
import python_utils as pu
5+
6+
GHI_list = pu.make_list_from_tsv("processed_data/ghi_lonlat.tsv")
7+
DNI_list = pu.make_list_from_tsv("processed_data/dni_lonlat.tsv")
8+
zcta_list = pu.custom_read_zcta_tsv("processed_data/zcta_intpt.tsv")
9+
10+
zcta_ghi_list = pu.match_centroids(zcta_list,
11+
GHI_list,
12+
col_pos_dict = {'metric':[0],'lon':1,'lat':2})
13+
with open("processed_data/zcta_ghi.tsv",'w') as fout:
14+
fout.write("ZCTA\tghi\n")
15+
for row in zcta_ghi_list:
16+
fout.write(f"{row[0]}\t{row[1]}\n")
17+
18+
zcta_list.insert(0,'replacement_header_filler')
19+
zcta_dni_list = pu.match_centroids(zcta_list,
20+
DNI_list,
21+
col_pos_dict = {'metric':[0],'lon':1,'lat':2})
22+
with open("processed_data/zcta_dni.tsv",'w') as fout:
23+
fout.write("ZCTA\tdni\n")
24+
for row in zcta_dni_list:
25+
fout.write(f"{row[0]}\t{row[1]}\n")
+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
import csv
2+
import numpy as np
3+
from sklearn.neighbors import BallTree
4+
import python_utils as pu
5+
weather_list = pu.make_list_from_tsv("processed_data/lonlat_weather_metrics.tsv")
6+
zcta_list = pu.custom_read_zcta_tsv("processed_data/zcta_intpt.tsv")
7+
8+
weather_header = weather_list[0]
9+
col_pos_dict = {'lon':0,
10+
'lat':1,}
11+
metric_inds = list(range(len(weather_list[0])))
12+
metric_inds.remove(col_pos_dict['lon'])
13+
metric_inds.remove(col_pos_dict['lat'])
14+
col_pos_dict['metric'] = metric_inds
15+
16+
zcta_weather_list = pu.match_centroids(zcta_list,weather_list,col_pos_dict)
17+
18+
with open("processed_data/zcta_weather.tsv",'w') as fout:
19+
header_str = "ZCTA\t"+'\t'.join(weather_header[2:])+"\n"
20+
fout.write(header_str)
21+
for row in zcta_weather_list:
22+
row = [0.0 if el == -7777 else el for el in row]
23+
row_str = "\t".join([str(el) for el in row])+"\n"
24+
fout.write(row_str)

code/09_make_ZCTA_master_info.R

+60
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
library(rgdal)
2+
library(sf)
3+
library(ggplot2)
4+
library(tigris)
5+
library(plyr)
6+
options(tigris_use_cache=T)
7+
8+
get_col_names = function(table_file){
9+
table = read.table(file=table_file,
10+
header=T, sep='\t',
11+
na.strings='`',
12+
stringsAsFactors = F)
13+
col_names = colnames(table)[-1]
14+
print('column_names = ')
15+
print(col_names)
16+
return(col_names)
17+
}
18+
19+
20+
merge_tables = function(other_table_file,zcta_master){
21+
other_table = read.table(file=other_table_file,
22+
colClasses = c(
23+
'character',
24+
rep('numeric',count.fields(textConnection(readLines(other_table_file,n=1)),sep='\t')-1)
25+
),
26+
header=T, sep='\t',
27+
na.strings='`',
28+
stringsAsFactors = F)
29+
30+
other_table <- rename(other_table,c("ZCTA"="ZCTA5CE10"))
31+
head(other_table)
32+
zcta_plus_other = merge(zcta_master,other_table,by="ZCTA5CE10",all.x=T)
33+
print(head(zcta_plus_other))
34+
35+
return(zcta_plus_other)
36+
}
37+
38+
39+
zcta_master <- zctas()
40+
head(zcta_master)
41+
42+
files_to_process = c("processed_data/zcta_zmean.tsv","processed_data/zcta_ghi.tsv","processed_data/zcta_dni.tsv","processed_data/zcta_weather.tsv")
43+
column_list = c()
44+
for (file in files_to_process) {
45+
zcta_master = merge_tables(file,zcta_master)
46+
column_list = c(column_list,get_col_names(file))
47+
}
48+
49+
50+
data_out = st_drop_geometry(zcta_master[,c('ZCTA5CE10','INTPTLON10','INTPTLAT10',column_list)])
51+
print("The head of the data out is!")
52+
print(head(data_out))
53+
write.table(x=data_out,file="processed_data/zcta_master_info.tsv",
54+
sep="\t",na='`',
55+
quote=F,row.names=F)
56+
57+
# print("making dni by zip")
58+
# ggplot(data=zcta_master)+
59+
# geom_sf(data=zcta_master,aes(fill=dni),size=0.01)
60+
# ggsave("processed_data/dni_by_zip.pdf")
4.65 KB
Binary file not shown.

0 commit comments

Comments
 (0)