Authors: Mike Johnson + Lynker Spatial Team
Accurate reservoir locations are essential for hydrologic modeling because reservoirs alter the natural flow regime by storing, releasing, and redistributing water across space and time. These operations directly influence downstream streamflow, flood peaks, drought severity, water availability, and ecosystem conditions. Today’s NWM only accounts for ~500 reservoirs across CONUS, which is incomplete for many forecasting and planning applications. To extend the scope of reservoir locations, data from other resources is needed.
The National Inventory of Dams (NID) provides broad coverage but variable location quality (on-reservoir, on-flowline, generalized, sometimes wrong). Even small positional errors can disconnect a dam/reservoir to the wrong flowline or waterbody, degrading routing of inflows/outflows and reducing model skill for discharge, storage, and evapotranspiration—undermining flood forecasting, drought planning, and environmental flow assessments.
Other datasets often have better locations but are incomplete or inconsistent in other ways particularly with spatial coverage. Critically, each dataset also opens doors for data assimilation, parameterization, and ML training on historic time series. By grounding our reference reservoirs with precise geographic contexts and aligning to a shared hydrographic fabric, we get regulated flow representation that better reflects the coupled human–natural water cycle and is a asset for community efforts like those at geoconnex and as NOAA/NWS POIs in the NWM.
Our goal is to build a harmonized set of reference reservoirs (proxied
by dams) that are geospatially consistent with the hydrofabric used in
USGS and NOAA/NWS modeling. We treat NID as the global set to validate
and enrich, assign stable synthetic IDs (dam_id = "ls-*"
), and use
multiple contexts to correct locations and enhance attributes.
Strategy (evidence aggregation):
- build candidate pairs via spatial proximity within tuned per-context radii,
- compute name similarity (Jaro–Winkler) from cleaned strings
- rank contexts by reliability and derived evidence,
- select a best realization per dam, with diagnostics.
Per-dam output: A chosen realization (context + ID), snap distance (m), name similarity, number of supporting contexts, and offset from the original NID point.
-
NID (cleaned, EPSG:5070, synthetic IDs
dam_id = "ls-*"
). Baseline catalog (USACE). High inclusivity; variable positional accuracy. Synthetic IDs provide stable tracking. -
Lynker Spatial hydrofabric flowlines (
ref_fab_fp
) + waterbodies (ref_fab_wb
). National hydrographic backbone (v2.3). Consistent topology for flowlines and waterbodies aligned to modeling needs. -
OpenStreetMap (OSM): water polygons, water lines, dam lines. Volunteer geographic data adding local detail; quality and coverage vary regionally.
-
GNIS. USGS naming authority for natural/cultural features (dams, lakes, reservoirs), used for robust naming comparisons.
-
ResOpsUS. Reservoir operations and attributes useful for modeling and water management.
-
HILARRI. Curated links among NID (2024), GRanD (v1.3), and EHA (2024), connecting dams, reservoirs, and hydropower plants (ORNL/DOE).
-
GOODD. Global dam compilation (>38k) with attributes supporting large-scale analyses.
-
NWM (optionally re-linked to WB IDs). NOAA’s hydrologic modeling system. Reservoir POIs can be re-indexed to hydrofabric WBs to improve geometric alignment.
-
Bring Your Own.: The method is extensible so that anyone can add a dataset by specifying a unique ID, search radius, and rank weight; it will be harmonized with the principal data resources.
# stitched outputs (written by the runner)
res_rds <- "output/reference-reservoirs.rds"
res <- readRDS(res_rds) |>
dplyr::filter(!is.na(X)) |>
sf::st_as_sf(coords = c("X","Y"), crs = 5070, remove = FALSE)
CONUS is divided into ~100 km cells. We process only tiles that
intersect dams. Each tile runs independently (bounded memory; smaller
candidate pools). Per-tile results are written to RDS; a final pass
stitches tiles, resolving overlaps by preferring more supporting
contexts (n
) then closer snaps.
source("R/utils_fin.R")
#> Warning in fun(libname, pkgname): GEOS versions differ: lwgeom has 3.11.0 sf
#> has 3.14.0
#> Warning in fun(libname, pkgname): PROJ versions differ: lwgeom has 9.1.0 sf has
#> 9.6.2
#> Spherical geometry (s2) switched off
conus <- AOI::aoi_get(state = "conus") |> st_transform(5070)
tiles <- make_conus_grid(st_union(conus), cell_km = 100)
if (!is.null(res)) {
ggplot2::ggplot() +
ggplot2::geom_sf(data = res, alpha = 0.15, size = 0.25) +
ggplot2::geom_sf(data = tiles, fill = NA, color = "brown", size = 0.2) +
ggplot2::labs(title = "Reservoirs", subtitle = "EPSG:5070",
x = NULL, y = NULL) +
ggplot2::theme_minimal()
} else {
plot.new(); title("Dam points plot skipped (no X/Y)")
}
The NID defines the global set we validate, supplement, and
standardize. Because NID IDs can be duplicated and locations imprecise,
we assign stable synthetic IDs (dam_id = ls-*
) and treat NID like any
other context in scoring—but privileged as the anchor. Outputs retain
NID identifiers while updated coordinates, names, and attributes can be
adopted from the best realization across contexts. This preserves
continuity with the most complete inventory while systematically
improving accuracy via GNIS names, GOODD’s footprint, hydrofabric
topology, and OSM detail—producing features that are geoconnex-ready and
compatible with NWS POIs.
A context is an external dataset/layer (e.g., gnis
, goodd
,
ref_fab_fp
, osm_ww_poly
) against which NID dams are compared. For
each dam and context, we:
- generate candidate pairs within a tuned search radius,
- compute snap distance and name similarity (JW), and
- filter/rank to a single best match per (dam, context).
Two derived contexts are also created by intersecting waterbodies and flowlines in each data family:
ref_int
: intersections of ref_fab_wb × ref_fab_fposm_int
: intersections of osm_ww_poly × osm_ww_lines
These provide strong geometry/topology anchors.
- 0 – Intersection evidence:
ref_int
,osm_int
(geometry + topology; strongest). - 1 – Curated/named:
gnis
,resops
,goodd
,osm_dam_lines
,hillari
. - 2 – Direct/core geometries:
osm_ww_poly
,osm_ww_lines
,ref_fab_fp
,ref_fab_wb
,nwm
(re-linked),nid
. - Tributary penalty: if
river
implies TR/OS/TRIB, add +5 to rank. Within any tier, smaller snap and smaller JW win.
-
Per tile
- Load dams (NID) and clip contexts.
- Build representative points per context: points (identity), lines (midpoints/endpoints), polygons (point-on-surface).
- Generate candidates via
st_is_within_distance
(per-context radius) with a KNN fallback gated by the same radius. - Score (snap distance, JW), apply tributary penalty; reduce to best per (dam, context).
- Build a wide table of IDs (one column per context), select best realization per dam, compute QA (offset from NID), and distance to flowpath.
- Write tile RDS and append a manifest row.
Context | Search Distance (m) | Rank | Group | Notes |
---|---|---|---|---|
ref_int | 2000 | 0 | Anchors / Derived | Intersections of ref_fab_wb × ref_fab_fp; highest-confidence geometry. |
osm_int | 2000 | 0 | Anchors / Derived | Intersections of osm_ww_poly × osm_ww_lines; strong topology signal. |
gnis | 2000 | 1 | Curated / Named | USGS names; authoritative nomenclature, variable location quality. |
resops | 2000 | 1 | Curated / Named | Reservoir ops/attributes useful for modeling. |
osm_dam_lines | 1500 | 1 | Curated / Named | OSM dam features; coverage varies. |
hillari | 2000 | 1 | Curated / Named | Links dams–reservoirs–plants (ORNL/DOE). |
goodd | 2000 | 1 | Curated / Named | Global dam footprint/attributes. |
osm_ww_lines | 1500 | 2 | Direct / Network | Dense/noisy; short radius reduces false hits. |
osm_ww_poly | 1500 | 2 | Direct / Network | Strong geometric anchors for reservoirs. |
ref_fab_fp | 1500 | 2 | Direct / Fabric | Topologically consistent flowlines. |
ref_fab_wb | 2000 | 2 | Direct / Fabric | Waterbodies as spatial anchors. |
nwm | 2000 | 2 | Direct / POIs | Often mislocated; improved when re-indexed to WBs. |
nid | 2000 | 2 | Core Dataset | Baseline set for validation & enrichment; stable synthetic IDs. |
Risk / Complexity | Why it matters | Mitigation in this workflow |
---|---|---|
Mis-snap to wrong flowline/waterbody | Broken routing; bad inflow/outflow accounting | Per-context radii; intersections (ref_int /osm_int ); rank 0 |
Duplicate/ambiguous IDs & names | Double-counting or missed joins | Synthetic dam_id , string prep + JW, cross-context tallies n |
Noisy/shifted geometries (esp. NWM, NID) | High false positives; unstable matches | Rep points, short radii (750 m), KNN fallback within same gate |
Seasonal shoreline changes | Point-on-surface drift vs. dam location | Prefer dam-aligned contexts; intersections; multi-context voting |
Tile edge effects | Missed candidates near boundaries | Buffered tile search; global stitch preferring n then distance |
Nonstationarity / updates over time | Drift between versions; reproducibility | Tile manifests, context IDs, rank map documented |
Licensing & attribution (OSM) | Compliance and redistribution | Keep source IDs/contexts; document license provenance |
A separate but related task in developing the reference reservoir set is
to (1) identify reservoirs suitable for the RFC-DA system and (2)
provide a consistent set of reservoir parameters for use in the National
Water Model (NWM) under this scheme. In the current NWM, all reservoir
parameters excluding WeirE
and LkArea
and LkMxE
are populated with
a single default value. Our goal is to provide a more defensible and
consistent set of parameters using a combination of NID attributes and
DEM-based surrogates, anchored by the reservoir surface and dam toe
elevations. When this proves impossible, we default to the primary NWM
values (WeirC = 0.4, WeirL = 10 m, OrficeC = 0.1, OrficeA = 1 m², ifd =
0.899).
For this first version (v1), we included only reservoirs within 1 km of
a reference flowpath that had an associated OSM or reference waterbody
with an area greater than 0.2 km². Once identified, two key DEM-based
measures were derived: (1) the mean elevation of the OSM and/or
reference waterbody and (2) the elevation of the dam toe (dam_elev).
These were extracted from the 1/3 arc-second (10 m) 3DEP DEM. From these
anchors, we computed a suite of reservoir parameters required for RFC-DA
in the NWM, using NID attributes wherever possible. When NID values were
missing or inconsistent, we applied a transparent set of heuristics and,
as a last resort, the fixed defaults currently used in the NWM. An
optional flag (use_hazard = TRUE
) was enabled in this v1 release to
modestly increase weir length (WeirL
) and orifice area (OrficeA
) or
bias orifice coefficients upward for significant and high-hazard dams,
ensuring more conservative estimates. By default this flag is off
(FALSE) to avoid introducing policy-driven noise when using the function
elsewhere.
The derived variables include: H_m
(hydraulic height), LkArea
(reservoir area, m²), WeirE
(crest elevation), LkMxE
(maximum pool
elevation), OrficeE
(invert elevation), WeirC
(weir coefficient),
WeirL
/Dam_Length
(weir length), OrficeC
(orifice coefficient),
OrficeA
(orifice area), and ifd
(fixed constant).
Hydraulic height (H_m
) is selected in priority order from
structural_height
, dam_height
, hydraulic_height
, and finally
nid_height.
When direct measurements were absent, elevation fractions
were applied: crest ≈ dam_elev + 0.90 * H_m, invert ≈ dam_elev + 0.15
* H_m max pool ≈ wb_elev + 0.10 * H_m or dam_elev + 1.00 * H_m.
Storage attributes (nid_storage
, normal_storage
, max_storage
) were
converted from acre-feet to cubic meters (1 ac-ft = 1233.48 m³) and,
when paired with LkArea
, used to approximate mean depth
.
Coefficients and areas were inferred from categorical descriptors and dam height:
WeirC
was set to 1.6 for broad-crested, 1.7 for ogee, 1.84 for
sharp-crested, and 1.6 for earthen dams when unspecified (Chow, 1959).
OrficeC
was set to 0.62 for sharp-edged/sluice/pipe outlets, 0.80 for
gated or rounded entries, and defaulted to 0.1 otherwise (Chow, 1959).
Orifice areas (OrficeA
) were assigned by dam height (<10 m → 0.5 m²;
10–30 m → 0.9 m²; ≥30 m → 1.5 m²), with a 1.2 m² override for concrete
or ogee dams. The coefficient ranges align with established values in
Open-Channel Hydraulics (Chow, 1959). The use of fractional height
surrogates for crest and invert levels is consistent with
screening-level approaches employed by FEMA and USACE when design
drawings are unavailable (FEMA, 2004). Storage-to-area ratios are a
standard method for approximating mean depth in reservoir studies
(USACE, 1995). Importantly, all surrogates are intended for
national-scale screening and modeling, not for site-specific engineering
or safety determinations. DEM-based anchors (dam_elev, wb_elev) may vary
with DEM quality, so regional refinements are encouraged where
higher-resolution data are available.
The end result is a traceable, reproducible, and tunable framework where each dam–waterbody record is enriched with consistent hydraulic variables needed for RFC-DA in the NWM. The process was able to extend the scope of candidate reservoirs ~7x and offers a more refined set of attributes beyond global defaults.
When it comes to the reference-reservoirs, a significant part of the workflow is heuristic based (rank order, search radius, tributary penalty, etc). These were developed through trial and error and expert judgement. There is significant opportunity to refine these heuristics with regional calibrations or more manual investigation. In this first pass, as with any reference system, a source of truth was needed. In this first pass, NID was considered the truth, and external entities were used to refine the location, and populate more attributes and outlinks. In the future, creating a ore complete “truth” dataset from the multiple sources could be considered - in particular the OSM dam lines.
With respect to the hydraulic estimation, there is significant areas for enhancement now that this version 1 dataset is defined. Future work could include: (1) expanding the reservoir set by relaxing proximity and size thresholds, (2) incorporating additional data sources for reservoir surface and dam toe elevations (from all linked resources), (3) refining heuristics with regional calibrations or machine learning, and (4) integrating dynamic reservoir operation rules where available.
-
To use this repo, all data is stored wit the exception of OSM. All data - including OSM - can be downloaded with the direction in the
data/data_prep.R
. -
Run
workflow/01_process_tiles_nid.R
If new resources are added, be sure to include them in the ingest as well as provide a rank and radius -
workflow/02_stich.R
stitches the tiles together and adds preliminary info to define thereference-reservoir
set (data/reference-reservoirs-v1.gpkg). This includes distance to flowpath and waterbody area. -
workflow/03_hydraulics.R
selects the candidate reservoirs and adds parameters the parameters needed for RFC-DA in the NWM. -
If you want to recreate the webmap, run the make file in scripts/tiles using the latest
gpkg.
Output can be viewed withpnpm dev --strictPort --port 8000
⸻
• Chow, V. T. (1959). Open-Channel Hydraulics. McGraw-Hill, New York.
• FEMA (2004). Federal Guidelines for Dam Safety: Selecting and Accommodating Inflow Design Floods for Dams. FEMA 94. Federal Emergency Management Agency, Washington, D.C.
• USACE (1995). Hydrologic Engineering Requirements for Reservoirs. Engineer Regulation ER 1110-2-240. U.S. Army Corps of Engineers, Washington, D.C.
⸻
if (exists("res") && nrow(res)) {
p1 <- ggplot2::ggplot(res, ggplot2::aes(x = realization_snap_m)) +
ggplot2::geom_histogram(bins = 50) +
ggplot2::labs(title = "Snap distance (m)") + ggplot2::theme_minimal()
p2 <- ggplot2::ggplot(res, ggplot2::aes(x = realization_jw)) +
ggplot2::geom_histogram(bins = 50) +
ggplot2::labs(title = "Name similarity (JW)") + ggplot2::theme_minimal()
p3 <- ggplot2::ggplot(res, ggplot2::aes(x = n)) +
ggplot2::geom_histogram(binwidth = 1) +
ggplot2::scale_x_continuous(breaks = 0:10) +
ggplot2::labs(title = "Supporting contexts per dam (n)") + ggplot2::theme_minimal()
print(p1); print(p2); print(p3)
}
#> Warning: Removed 54654 rows containing non-finite outside the scale range
#> (`stat_bin()`).
if (exists("res") && nrow(res)) {
ctx_cols <- c("gnis","resops","goodd","nwm","osm_ww_poly","osm_ww_lines",
"osm_dam_lines","ref_fab_fp","ref_fab_wb","ref_int","osm_int","nid")
have <- intersect(ctx_cols, names(res))
if (length(have)) {
long <- tidyr::pivot_longer(as.data.frame(res), dplyr::all_of(have), names_to = "context", values_to = "id")
long$has <- !is.na(long$id)
ggplot2::ggplot(long, ggplot2::aes(x = context, fill = has)) +
ggplot2::geom_bar() +
ggplot2::coord_flip() +
ggplot2::labs(title = "Context coverage (count of dams with a match)", y = "count", x = NULL) +
ggplot2::theme_minimal()
}
}