README.Rmd

---
title: "CEFI"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

[NOAA's Physical Science Laboratory (PSL)](https://psl.noaa.gov/) [Climate Ecosystems and Fisheries Initiative Portal](https://psl.noaa.gov/cefi_portal/) serves historical and forecast data useful in ecological studies. Data is served using a [THREDDS](https://psl.noaa.gov/thredds/catalog/Projects/CEFI/regional_mom6/catalog.html) catalog, but PSL also makes [tabular catalogs](https://psl.noaa.gov/cefi_portal/var_list_northwest_atlantic_hist_run.html) available, too.

# Requirements

 + [R v4.1+](https://www.r-project.org/)
 + [rlang](https://CRAN.R-project.org/package=rlang)
 + [dplyr](https://CRAN.R-project.org/package=dplyr)
 + [sf](https://CRAN.R-project.org/package=sf)
 + [stars](https://CRAN.R-project.org/package=stars)
 + [jsonlite](https://CRAN.R-project.org/package=jsonlite)
 + [tidync](https://CRAN.R-project.org/package=tidync)

# Installation

```
remotes::install_github("BigelowLab/cefi")
```

# Usage

Load the libraries needed.
```{r}
suppressPackageStartupMessages({
  library(rnaturalearth)
  library(cefi)
  library(tidync)
  library(stars)
  library(dplyr)
})
```

## Catalogs

CEFI offers THREDDS catalogs which are great for mining programmatically, but they also provide simple tabular catalogs: one for historical runs and one for forecasts.  (Yes, the first variable is "Varible_Name".)  These are easy for users to navigate. 

```{r}
uri = catalog_uri(region = "Northwest Atlantic", period = "history")
hist = read_catalog(uri) |>
  dplyr::glimpse()
```
```{r}
uri = catalog_uri(region = "Northwest Atlantic", period = "forecast")
fcst = read_catalog(uri) |>
  dplyr::glimpse()
```

Assuming these catalogs will remain and stay up-to-date, we can leverage them in lieu of coding up software to navigate the THREDDS catalogs.

# Getting data

To get data select one row from either catalog, and open that resource which we see in R as a [tidync](https://docs.ropensci.org/tidync/) object useful for navigating and extracting netcdf files.

### Historical data

Let's start with historical data. 

```{r}
nc = hist |>
  dplyr::filter(Variable_Name == "btm_o2") |>
  cefi_open()
nc
```

Note that the spatial dimensions of the "active" grid are defined by `xh` nd `yh` not `x` and `y` or `lon` and `lat`.  That is a choice of the creator of the NetCDF resource, and so we use those names for filtering.

Time requires some background understanding of the CEFI architecture coupled with the `tidync` approach to navigating the NetCDF object.  To ease that for the user we have created a function called `cefi_time()` which is a wrapper around `tidync::hyper_transforms()`, but adds a column called `time_` which can be either 'Date' of 'POSIXct' class.  

```{r}
cefi_time(nc)
```


Next we filter the array with `cefi_filter()`, so that we can collect a subset of the data. This function is a wrapper around the `tidync::hyper_filter()` function; we wrap because the time coordinate in the NetCDF files is not in user-friendly format.  Note that it is important that `time` comes before any other filtering element.

```{r}
nc = cefi_filter(nc, 
                 time = as.Date(c("1995-12-16", "1995-12-20")), 
                 xh = xh > -75 & xh < -60,
                 yh = yh > 40 & yh < 50)
```

It is important to understand that the actual data is not loaded into R (yet), think of this a prefiltering step.  To actually get the data we can call `cefi_var`.  You can explore more about prefiltering at the [tidync website](https://docs.ropensci.org/tidync/).

Now we can load the data into R, using `cefi_var()`.

```{r}
s = cefi_var(nc)
s
```

And finally we can plot the result.

```{r}
coast = rnaturalearth::ne_coastline(scale = "medium", returnclass = "sf") |>
  sf::st_geometry() |>
  sf::st_crop(sf::st_bbox(s)) 

plot_coast = function(){
  plot(sf::st_geometry(coast), add = TRUE, col = "darkorange")
}

plot(s, hook = plot_coast, key.pos = NULL)
```


### Forecast data

Now let's look at forecast data.

```{r}
nc = fcst |>
  dplyr::filter(Variable_Name == "tob" & Time_of_Initialization == "2022-12") |>
  cefi_open()

nc
```

Here time is measured in `lead` time (in months) relative to an initalization date.  But note that the actually time varying dimension is called `lead` even though we address it in the more familiar `time`. Once again we can use `cefi_time()` to expose the time-varying dimension relative to the starting date.

```{r}
cefi_time(nc)
```


```{r}
nc = cefi_filter(nc, 
                 time = as.Date(c("1994-12-01", "1995-02-01")), 
                 xh = xh > -75 & xh < -60,
                 yh = yh > 40 & yh < 50)
```                 

Unlike the historical runs, the forecast results include one or more ensemble member results for each time period. These are identified as `member` which you might think of as replicates.  

Here we show retrieving all of the members.

```{r}
s = cefi_var(nc, collapse_fun = NULL)
s
```

To see the ensemble results for one date, we can slice-and-dice using indexing.In this case, there are two variables, `tob` (temperature of bottom) and `tob_anom` temperature of bottom anomaly.  Below we show only `tob`.

```{r}
times = st_get_dimension_values(s, "time")
plot(s['tob'][,,,,1], hook = plot_coast)
```


Requesting all members for a given time is possible as shown above, but a more common use case is to extract a summary (mean or median) or a measure of variability (standard deviation or variance).

```{r}
s = cefi_var(nc, collapse_fun = mean)
s
```

Note that the dimensionality is reduced because we computed the mean of the ensembles at each time. 

```{r collaped_mean_plot}
plot(s['tob'], hook = plot_coast, main = "Mean Temp of Bottom")
```


### Granularity

Here we highlight the granularity of the data source by zooming in on the New England coastline. You can see that large embayments such as Penobscot and Narragansett bays are not included. 

```{r granularity}
plot(dplyr::slice(s['tob'], "time",  1),
     xlim = c(-72 , -63),
     ylim = c(39, 45),
     reset = FALSE,
     axes = TRUE)
plot(coast, add = TRUE, col = "orange")
```