-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy path09_NetCDF_ReadWrite.Rmd
384 lines (286 loc) · 13.7 KB
/
09_NetCDF_ReadWrite.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
---
title: "Read and Write Data from a NetCDF"
subtitle: "Using NetCDF and Raster Functions"
author: "Gal Koss and Jude Bayham"
output: html_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
library(tidyverse)
library(lubridate)
library(raster)
library(sf)
library(knitr)
library(mapview)
library(ggmap)
library(ggthemes)
library(kableExtra)
library(data.table)
library(ncdf4)
library(stars)
knitr::opts_chunk$set(echo = T,
results = 'hide',
message = F,
warning = F,
fig.align = "center")
```
<br>
<br>
## Before You Start
<br>
Gridded data is often stored in a [netCDFs](https://pjbartlein.github.io/REarthSysSci/raster_intro.html) format – a multidimensional file format that resembles a raster brick. A netCDF file contains data with a specific structure: a two-dimensional spatial grid (e.g., longitude and latitude) and a third dimension which is usually date or time. This structure is convenient for weather data measured on a consistent grid over time. One such dataset is called [gridMET](http://www.climatologylab.org/gridmet.html) which maintains a gridded dataset of weather variables at 4km resolution.
<br>
<br>
This section describes two methods for extracting data from a netCDF, using gridMET data as an example. We first will use the [`raster`](https://cran.r-project.org/web/packages/raster/raster.pdf) library, treating the netCDF as a raster object. Then, we will use the [`ncdf4`](https://pjbartlein.github.io/REarthSysSci/netCDF.html) library to access the data with its netCDF-specific functions.
<br>
<br>
<br>
## Reading NetCDF with Raster
<br>
A netCDF file is both spatially and temporally explicit, and can be handled similarly to a raster object. To start, we download from a gridmet API using the `downloader` package. gridMET data is also available in the [Google Earth Engine Data Catalog](https://developers.google.com/earth-engine/datasets/), which can be accessed with the R library [`rgee`](https://github.com/r-spatial/rgee) We set the destination file name (what to call the file and where we want to it to be), and the mode to `wb` for a binary download. Then we use `raster::brick()` to ingest the raster file.
```{r eval = T}
#--- download gridMET precipitation 2018 ---#
downloader::download(
url = str_c("http://www.northwestknowledge.net/metdata/data/pr_2018.nc"),
destfile = "pr_2018.nc",
mode = 'wb'
)
#--- ingest the file ---#
brick <- raster::brick("pr_2018.nc") # can ignore warning
brick
```
```{r echo = FALSE, results = "show"}
brick
```
We see that this is a 3 dimensional object of class `RasterBrick`, with 365 “slices” (nlayers) in the brick. Each slice has 810,810 grid cells, of dimensions 585x1386. The resolution is given in the same units as the crs (WGS84), in this case decimal degrees. Each cell has width and height of approximately 1/24$^{th}$ of a degree, or 4 km. The number of layers corresponds to the days in 2018, which are given as the days since January 1$^{st}$, 1900. This is also the convention used in the naming of each slice.
<br>
<br>
<br>
### Extracting Data
<br>
To access the values, we can either index the objects, or we can use `raster::values`. Here we use the empty index values to take the entirety of the brick. If we knew the cell's index values (for instance, the 100$^{th}$ row and cell), we could extract the data using that index: `brick[100, 100]`.
```{r eval = T}
#--- indexing the brick, all ---#
vals_brick1 <- brick[,]
#--- using the values function ---#
vals_brick2 <- raster::values(brick)
identical(vals_brick1, vals_brick2)
```
```{r echo=FALSE, results="show"}
identical(vals_brick1, vals_brick2)
```
The result is a matrix with 810,810 rows and 365 columns: we have a row for each grid cell, and a column for each day. With this information, we can turn the matrix into a more usable data frame.
```{r }
#--- check the dimension ---#
dim(vals_brick2)
#--- take a look ---#
vals_brick2[1:10, 1:10]
```
As can be seen above, column names are the days since January 1$^{st}$, 1990 with prefix "X."
First, using `base::names` we can extract the dates. Then we remove the leading "X", convert from character to numeric and then make it a date using the origin from the raster.
Then, using `raster::coordinates` we get the latitude and longitude of each cell as a matrix we can index.
We can construct a data.frame by renaming the columns as dates, creating new columns with the latitude and longitude coordinates.
```{r eval = F}
#--- get the dates ---#
dates <-
as.Date(
as.numeric(str_remove(names(brick), "X")),
origin = "1900-01-01"
)
#--- get the coordinates ---#
coords <- raster::coordinates(brick)
#--- value dataframe ---#
df <-
vals_brick2 %>%
as.data.frame() %>%
rename_all( ~ str_c(dates)) %>%
mutate(
lon = coords[,1],
lat = coords[,2]
)
#--- value tidy ---#
df_tidy <-
df %>%
pivot_longer(
-c(lon, lat),
names_to = "date",
values_to = "value"
) %>%
mutate(date = ymd(date))
```
```{r }
dt <-
df %>%
data.table() %>%
melt(id.var = c("lon", "lat")) %>%
setnames("variable", "date")
```
---
Note, that we can also use `terra::rast` or `stars::read_stars` to ingest the netCDF as a raster, with similar results.
```{r eval = F}
slice <- terra::rast("pr_2018.nc")
tic()
temp <- terra::values(slice) %>% data.table()
toc()
slice <- stars::read_stars("pr_2018.nc")
```
<br>
<br>
<br>
### Subsetting Days
<br>
We can access an individual slice either by calling the named layer of the brick, or by indexing the brick.
```{r eval = T}
#--- calling the named layer ---#
slice1 <- brick$X43099
#--- indexing the brick ---#
slice2 <- brick[[1]]
identical(slice1,slice2)
```
```{r echo=FALSE, results="show"}
identical(slice1,slice2)
```
We can even use the indexing when reading in the raster brick, to limit what we read into memory.
```{r eval = F}
#--- indexing as we read in ---#
slice3 <- raster::brick("pr_2018.nc")[[1]]
```
The resulting slice -- whichever way you slice it -- is a single RasterLayer.
<br>
<br>
<br>
### Subsetting Cells
<br>
In addition to accessing individual slices of the brick, we can also pull a column out of it, taking all days' observations for a group of cells. For instance, to get the data for all counties in Colorado, we can use `raster::crop()` on the brick and the `sf` object of counties.
```{r eval = T}
#--- preparing the county sf ---#
colorado_counties <- tigris::counties(state = "CO")
#--- cropping the brick to the counties ---#
column <- raster::crop(brick,colorado_counties)
column
```
```{r echo=FALSE, results="show"}
column
```
We still have 365 layers, but now only 96x169 grids, covering Colorado. Unlike when we extract a slice and were left with a layer, in extracting a column we are left with a RasterBrick once again.
<br>
<br>
<br>
<br>
<br>
<br>
## Reading NetCDF with NCDF4
<br>
The `ncdf4` package is specially built to handle netCDF4 objects, and can in many cases be easier, if not faster, than using `raster` functions. Again, to start, we download from a gridmet API using the `downloader` package. Similary, we still set the destination file name (what to call the file and where we want to it to be), and the mode to `wb` for a binary download.
This time, instead of ingesting the file as a RasterBrick or RasterLayer, we open it as a `ncdf4` object, a list. Note that open indicates an open connection to an object that is not yet in memory, similar to a database connection.
```{r eval = T}
#--- download gridMET precipitation 2018 ---#
downloader::download(url=str_c("http://www.northwestknowledge.net/metdata/data/pr_2018.nc"),
destfile = "pr_2018.nc",
mode = 'wb')
#--- ingest the file ---#
nc <- nc_open("pr_2018.nc")
nc
```
```{r echo=FALSE, results="show"}
nc
```
We can once again see that we have a spatially and temporally explicit object in 3 dimensions: longitude, latitude, and time. We can see that there are 365 units of time in this file, and that the unit is day. For each day, we have 585x1386=810,810 cells. We also see the projection, and global attributes for all files in `gridMET`.
To access the values used for each dimension, we can use the function `ncvar_get()`, passing in the name of connection object, nc, and the name of the dimension. Doing so gives us a vector: in the case of the grid centroids, the data correspond to the grid formed by all combinations of latitude and longitude. We use `expand.grid` to construct all combinations. For the dates, we can convert the date number to a date value, and we can also get the name of the variable in the file using `names(nc$var)`.
```{r eval = T}
#--- access the coordinates ---#
nc_lat <- ncvar_get(nc = nc, varid = "lat")
nc_lon <- ncvar_get(nc = nc, varid = "lon")
nc_coords <- expand.grid(lon=nc_lon,lat=nc_lat)
#--- access the days ---#
nc_dates <- lubridate::as_date(ncvar_get(nc = nc, varid = "day"),origin="1900-01-01")
#--- access the variable name ---#
var_id <- names(nc$var)
```
<br>
<br>
<br>
### Extracting Data
<br>
We can extract the data using the same function `ncvar_get`. When we use `ncvar_get` on the variable, we are pulling the 3-dimensional data out of the netCDF. While we lose the spatial and temporal data stored in the file, we are left with an array that contains only the data values in the same dimensions. We can then pair the data with the locations and time as needed as we operate on that multi-dimensional data using tidy techniques to run our analysis.
You will notice the empty index values in the brackets with 3 index places; these correspond to the order of dimensions in the file, longitude, latitude, and day. See documentation of `nc_get()` to access subsets of the array across any dimension if the correct indices are known.
```{r eval = T}
#--- calling the named layer ---#
nc_data <- ncvar_get(nc = nc, varid = var_id)[,,]
dim(nc_data)
```
```{r echo=FALSE, results="show"}
dim(nc_data)
```
We can put the data into something much more tidy-able by using these dimensions to create a rectangular dataframe. We rename the columns, bind the columns with the coordinates, and we have something we can put into tidy data.
```{r eval = F}
#--- value dataframe ---#
dim <- c(prod(dim(nc_data)[1:2]),dim(nc_data)[3])
df <- array(nc_data, dim = dim) %>%
as_tibble(.name_repair = "universal") %>%
rename_all(~str_c(nc_dates)) %>%
bind_cols(nc_coords,.)
#--- value tidy ---#
df_tidy <- df %>%
pivot_longer(-c(lon,lat),
names_to = "date",
values_to = "value") %>%
mutate(date=ymd(date))
```
In the last step is when you can introduce polygons over which to aggregate. You can join `df` with a bridge that maps each grid centroid to a polygon of interest. Then you can aggregate (`group_by(polygon,day) %>% summarise`) over time and space to get data on a relevant scale.
<br>
<br>
<br>
### Subsetting Cells or Days
<br>
To subset the data by location or day, you can use arguments in the `ncvar_get` function, or you can take advantage of the indexing when you read the data into a multi-dimensional array. The `ncvar_get` function can take lists (of length equal to the number of file dimensions) for the `start` and `count` arguments. Below is one example of how to get the data from all grids in September from the netCDF, passing the index for the first day to the `start` argument and the number of days to the `count` argument. In both cases the time is the third item in the list (it is always last), and we use $1$s as placeholders in `start` and $-1$ as placeholders in `count`.
```{r eval = T}
#--- preparing the index ---#
nc_dates_sep <- tibble(dates = as_date(ncvar_get(nc = nc, varid = "day"),origin="1900-01-01")) %>%
rownames_to_column("index") %>%
filter(month(dates) == 9)
dates_count <- length(nc_dates_sep$index[1]:nc_dates_sep$index[nrow(nc_dates_sep)])
#--- subsetting by days ---#
nc_data <- ncvar_get(nc = nc, varid = var_id,
start = c(1,1,as.numeric(nc_dates_sep$index[1])),
count = c(-1,-1,dates_count))
dates_index <- nc_dates_sep$index[1]:nc_dates_sep$index[nrow(nc_dates_sep)]
#--- subsetting by days ---#
nc_data1 <- ncvar_get(nc = nc, varid = var_id)[,,dates_index]
```
Alternatively, you can treat the output of `ncvar_get` as an object that can be subsetted. Using the same dates, we pass index values into the third index location.
```{r eval = T}
dates_index <- nc_dates_sep$index[1]:nc_dates_sep$index[nrow(nc_dates_sep)]
#--- subsetting by days ---#
nc_data2 <- ncvar_get(nc = nc, varid = var_id)[,,dates_index]
identical(nc_data1,nc_data2)
```
```{r echo=F, results="show"}
identical(slice1,slice2)
```
This can be a bit easier to use for subsetting by location, since it does not require contiguous index values. To use this approach to subset by location of grid cells, you pass through the grids' index values for column (longitude) and row (latitude) into the data's first two index locations, respectively.
<br>
<br>
<br>
## r
```{r }
slice <- raster::brick("pr_2018.nc")[[1]] # can ignore warning
#--- make tibble of coordinates ---#
slicecoords <-
tibble(
lon = xyFromCell(slice, 1:ncell(slice))[,1],
lat = xyFromCell(slice, 1:ncell(slice))[,2]
) %>%
rownames_to_column(var = "cell") %>%
mutate(cell = as.numeric(cell))
```
```{r }
r <- raster(nrow = 3, ncol = 3)
values(r) <- 1:ncell(r)
s <- raster(nrow = 10, ncol = 10)
plot(s)
s <- resample(r, s, method = "bilinear")
plot(s)
plot(r)
```