Skip to content

Commit bed4a78

Browse files
committed
Add a chunking layout paragraph with 2D example diagram
1 parent 1bc473d commit bed4a78

File tree

2 files changed

+23
-0
lines changed

2 files changed

+23
-0
lines changed

images/chunking-patterns.png

2.08 MB
Loading

sections/cloud-scale-data.qmd

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -289,6 +289,29 @@ chunk layout should be such that, under expected common usage patterns,
289289
proximal chunks are more likely to be requested together. On average,
290290
this will reduce the number of separate read requests a client must
291291
issue to retrieve and piece together any particular desired data subset.
292+
293+
![](../images/chunking-patterns.png){.lightbox fig-align="center" width="90%"}
294+
295+
For example, in the image above depicting a simple 2D spatial raster
296+
dataset, consider three alternative chunking patterns. The approach on
297+
the left stores data together that share the same latitude, whereas the
298+
approach on the right stores data together that share the same
299+
longitude. In each case, values that are nearby but in the orthogonal
300+
direction will be split across multiple chunks, which is not ideal.
301+
Meanwhile, the square tile approach in the middle -- consistent with the
302+
COG format specification -- strikes a healthy balance, and overall is
303+
more effective at storing together data that are spatially proximal in
304+
any given direction. Now imagine a _stack_ of such rasters corresponding
305+
to data collected over time -- in other words, a 3D array dataset with a
306+
time dimension. The same principle holds in general: _cubes_ will ensure
307+
that data that are proximal in both space and time will be stored in the
308+
same chunk, and are probably the safest bet. However, an alternative
309+
strategy may be preferable if, for example, the expected dominant use
310+
case is spatial analysis (consider storing tiles that are spatially
311+
broad but are shallower in the temporal dimension) or time-series
312+
analysis (consider storing tiles that are spatially narrow but are
313+
longer in the time dimension).
314+
292315
In addition, chunks should almost certainly be compressed with a
293316
suitable compression algorithm. Compression incurs some additional
294317
compute time for decompression, but with a net benefit because it

0 commit comments

Comments
 (0)