gadget-course/mfdb.Rmd at master · gadget-framework/gadget-course · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
---
title: "MFDB"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Processing input data with MFDB
## Working with data

Fisheries stock assessment aims to provide a satisfactory level of catch based the estimated stock status. Typically this means the abundance or biomass of the resource in question. In all but rare cases it is next to impossible to observe these numbers directly as the fish species in question live in the deep oceanc. Therefore the most of the information on the population is derived from assumptions on the population dynamics combined with observations on various aspects of the population. This includes relative abundance/biomass from scientific surveys and catch size and/or age composition.

Assessing stock or ecosystem status is therefore often a exhaustive exercise of data gathering and aggregation. The assessment method often require that the data are pre-processed in a certain manner in order for them to be used and it is in this processing of the data is where the stock assessor needs to draw on insights to piece them together. Examples of this expert judgment process is the borrowing of age length keys between years and/or gears or filling up length distribution for certain days. On of the key utilities of Gadget is its ability to digest this disparate array of observations with minimal aggregation, ensuring that when observations are truly available for a stock/gear/time combination it can be compared with model output and effectively moves the "guess-work" into the model where is belongs.

Regardless of the modeling approach employed, fisheries stock assessment requires robust data management especially for those assessment that are used annually to produce advice on catch advice as it is imperative for whole process to be reproducible between years. There has been increased interest in developing tools/procedures to ensure this reproducibility which can be seen from project such as the ICES TAF and stockassessment.org. One such project, [MFDB](https://github.com/mareframe/mfdb), was developed explicitly to deal with formatting data for ecosystem models and Gadget models. In this chapter an overview of the key components of the MFDB system is provided with simple illustration based on Icelandic data.


## Description of the MFDB system

MFDB is essentially a custom built database system written to smooth out kinks in the process of collating the various snippets of data. The user interface is written entirely in R allowing the user to leverage the tools available in the R environment thus enabling automated processing of fisheries data into suitable forms for running ecosystem models against it such as Gadget

MFDB contains several distinct sets of functions:

* Create and connect to a PostgreSQL database schema (mfdb function)
* Import data into a PostgreSQL database (mfdb_import_* functions)
* Sample / group data from database (other mfdb_* functions)

Using this, you can install PostgreSQL locally and have a script automating the process of:

* Importing data from your sources
* Uploading into your local MFDB database
* Sampling / grouping this data
* Producing set of Gadget data files ready to be run by Gadget

Also, MFDB can be used to connect to a remote database and generate model files from that data.

Before doing anything with mfdb, it is worth knowing a bit about how data is stored. Broadly, there are 2 basic types of table in mfdb, **taxonomy** and **measurement** tables.

The measurement tables store all forms of sample data supported, at the finest available detail. These are then aggregated when using any of the mfdb query functions. All measurement data is separated by case study, so multiple case studies can be loaded into a database without conflicts.

Taxonomy tables store all possible values for terms and their meaning, to ensure consistency in the data. For example, species stores short-names and full latin names of all known species to MFDB, to ensure consistency in naming.

Most Taxonomies have defaults which are populated when the database is created, and their definitions are stored as data attached to this package. See `mfdb-data` for more information on these. Others, such as areacell and sampling_type are case study specific, and you will need to define your terms before you can import data.

### Installing MFDB

MFDB is available from CRAN and is intalled in the typical fashion:

```{r eval=FALSE}
install.packages('mfdb')
```

However MFDB requires that the user has access to a running postgreSQL database. To install postgreSQL on your local machine you can follow these [instructions](https://github.com/mareframe/mfdb/#prerequisites) but if you expect more users working on the same DB it may be wise to consider installing it on a (widely) accessible server.

## Importing data

Unless you are working with a remote database, you will need to populate the database at least once before you are able to do any querying. The steps your script needs to do are:

### Connect to database

Use the `mfdb()` function. This will create tables / populate taxonomies if necessary. The user opens the link with the following command:
```{r eval=FALSE}
mdb <- mfdb('Iceland')
```

where the string the `mfdb` represents the "case study", or simply just the data set the user want to link to. Note that this string can be set as anything allowing the user to maintain different versions of the data depending on the time (e.g. between assessment periods). If it is the first time the `mfdb` function is called with a particular case study name it will spew out a lot of gunk similar to:

```
2020-09-27 14:25:41 INFO:mfdb:No schema, creating test
2020-09-27 14:25:41 INFO:mfdb:Creating schema from scratch
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "case_study_pkey" for table "case_study"
NOTICE:  CREATE TABLE / UNIQUE will create implicit index "case_study_name_key" for table "case_study"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "institute_pkey" for table "institute"
NOTICE:  CREATE TABLE / UNIQUE will create implicit index "institute_name_key" for table "institute"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "gear_pkey" for table "gear"
NOTICE:  CREATE TABLE / UNIQUE will create implicit index "gear_name_key" for table "gear"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "vessel_type_pkey" for table "vessel_type"
NOTICE:  CREATE TABLE / UNIQUE will create implicit index "vessel_type_name_key" for table "vessel_type"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "market_category_pkey" for table "market_category"
NOTICE:  CREATE TABLE / UNIQUE will create implicit index "market_category_name_key" for table "market_category"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "sex_pkey" for table "sex"
...

```
where the database is simply notifying the user that it is populating the case study schema.

If the user want to connect to a remote server you can specify the connection using the `db_params` argument:

```{r, eval=FALSE}
mdb <- mfdb('Iceland', db_params = list(host='mfdb.example.com',
                                        user = 'johndoe',
                                        pass = '1234'))
```
where the connection parameters are specified appropriately. In addition, the schema can be destroyed (e.g. when you want to start again) using `destroy_schema = TRUE` when connecting to the database.

### Defining taxonomies

The MFDB system has a number of taxonomies, some of which are populated by default while other are defined by the user when importing and will be linked to the data when imported and used to filter it when exported. These taxonomies can be are listed by calling:

```{r}
mfdb:::mfdb_taxonomy_tables
```

The following tables are populated by default but can be extended based on the different needs that may arise:

* **gear**
```{r gear, echo=FALSE}
DT::datatable(mfdb::gear)
```

* **maturity_stage**
```{r matruritystage, echo=FALSE}
DT::datatable(mfdb::maturity_stage)
```

* **sex**
```{r sex, echo=FALSE}
DT::datatable(mfdb::sex)
```

* **species**
```{r species, echo=FALSE}
DT::datatable(mfdb::species)
```

* **vessel_type**
```{r vesseltype, echo=FALSE}
DT::datatable(mfdb::vessel_type)
```

The other taxonomies such as area divisions, tow locations and vessels will need to be defined by the user.

### Define areas & divisions

MFDB models space in the following way:

* `areacell`
  + The finest level of detail stored in the database. Every measurement (e.g. temperature, length sample) is assigned to an areacell. This will generally correspond to ICES gridcells, however there is no requirement to do so. You might augment gridcell information with depth, or include divisions when the measurement doesn't correlate to a specific areacell.

* `division`
  + Collections of areacells, e.g. ICES subdivisions, or whatever is appropriate.

Finally, when querying, divisions are grouped together into named collections, for instance `mfdb_group(north = 1:3, south = 4:6)` will put anything in divisions 1–3 under an area named "north", 4–5 under an area named "south".

Before you can upload any measurements, you have to define the areacells that they will use. You do this using the `mfdb_import_area()` function. This allows you to import tables of area/division information, such as:

```{R, eval=FALSE}
mfdb_import_area(mdb,
                 data.frame(name = c('101', '102', '103', '401','402', '403'),
                            size = c(1,2,3,4,5,6))
```

In the case of the area around Iceland the grid cells are defined from coordinates $(x,y)$ using the following formula:

$$\Gamma(x,y) = 10*\left(100*(\lfloor{2*y}\rfloor/2 - 60) + \lfloor{x}\rfloor \right) + \begin{cases} 0 &  s(-(x - \lfloor{2*x}\rfloor/2)) \in \{2,4,6\} \\
1 & s(-(x - \lfloor{2*x}\rfloor/2)) = 4\\
2 & s(-(x - \lfloor{2*x}\rfloor/2)) = 1\\
3 & s(-(x - \lfloor{2*x}\rfloor/2)) = 7\\
4 & s(-(x - \lfloor{2*x}\rfloor/2)) = 3\\
\end{cases}  $$

and this is implemented in the `d2sr` function from the `geo`-package:
```{r eval=FALSE}
# install from github
#devtools::install_github('hafro/geo')
## note lon is assume from west to east not vice-versa
geo::d2sr(lat = 66, lon = 20)
```


The grid cells can then be split up into divisions based on a collection of grid cells as illustrated in the figure below:

```{r echo=FALSE,message=FALSE,warning=FALSE}
library(tidyverse)

#bcareas <- c(1015, 1014, 1018, 1011, 1012, 1023, 1023, 1021, 1032, 1031,
#             1042, 1041, 1052, 1053, 1051, 1054, 1061, 1071, 1081, 1082)


sr2d <-
  function (data, cell.col = "sr", col.names = c("lat", "lon"))
  {
    if (!("tbl_sql" %in% class(data))) {
        mod <- function(x, y) x%%y
    }
    sr <- data %>% dplyr::select_(.dots = stats::setNames(cell.col,
        "sr")) %>% dplyr::distinct() %>% dplyr::mutate(r = floor(sr/10)) %>%
        dplyr::mutate(s = round(sr - r * 10, 0) + 1, lat = floor(r/100)) %>%
        dplyr::mutate(lon = mod(r - lat * 100, 50)) %>% dplyr::mutate(halfb = (r -
        100 * lat - lon)/100) %>% dplyr::mutate(lon = -(lon +
        0.5), lat = lat + 60 + halfb + 0.25) %>% dplyr::mutate(lat = lat +
        ifelse(s == 1, 0, ifelse(s %in% 2:3, 0.125, -0.125)),
        lon = lon + ifelse(s == 1, 0, ifelse(s %in% c(2, 4),
            -0.25, 0.25))) %>% dplyr::select_(.dots = stats::setNames(c("sr",
        "lat", "lon"), c(cell.col, col.names)))
    data %>% dplyr::left_join(sr, by = "sr")
}

reitmapping <-
  read.table(
        system.file("demo-data", "reitmapping.tsv", package="mfdb"),
        header=TRUE,
        as.is=TRUE) %>%
  mutate(sr=as.numeric(GRIDCELL)) %>%
  sr2d()


scale_fill_crayola <- function(n = 100, ...) {

  # taken from RColorBrewer::brewer.pal(12, "Paired")
  pal <- c("#A6CEE3", "#1F78B4", "#B2DF8A", "#33A02C",
           "#FB9A99", "#E31A1C", "#FDBF6F", "#FF7F00",
           "#CAB2D6", "#6A3D9A", "#FFFF99", "#B15928")
  pal <- rep(pal, n)
  ggplot2::scale_fill_manual(values = pal, ...)

}


test <-
  plyr::ddply(reitmapping, 'SUBDIVISION',
        function(x){
          ##   2
          ## 1 x 3
          ##   4
          sides <- rep(10*x$GRIDCELL,each=4) + 1:4
          up <- geo::d2sr(x$lat + 0.125, x$lon)
          down <- geo::d2sr(x$lat - 0.126, x$lon)
          left <- geo::d2sr(x$lat, x$lon - 0.25)
          right <- geo::d2sr(x$lat, x$lon + 0.26)

          for(i in 1:nrow(x)){
            ## up
            if(up[i] %in% x$GRIDCELL){
              sides[4*(i-1)+2] <- NA
              sides[sides==(10*up[i]+4)] <- NA
            }
            ## down
            if(down[i] %in% x$GRIDCELL){
              sides[4*(i-1)+4] <- NA
              sides[sides==(10*down[i]+2)] <- NA
            }
            ## left
            if(left[i] %in% x$GRIDCELL){
              sides[4*(i-1)+1] <- NA
              sides[sides==(10*left[i]+3)] <- NA
            }
            ## right
            if(right[i] %in% x$GRIDCELL){
              sides[4*(i-1)+3] <- NA
              sides[sides==(10*right[i]+1)] <- NA
            }
          }
          sides <- sides[!is.na(sides)]
          corners <- plyr::ddply(data.frame(side=rep(sides,each=2)),
                       'side',
                       function(x){
                         loc <- geo::sr2d(floor(x$side[1]/10))
                         loc <- c(loc$lat,loc$lon)
                         if((x$side[1] %% 10) == 1){
                           up.loc <- loc+c(0.125,-0.25)
                           down.loc <- loc+c(-0.125,-0.25)
                         }
                         if((x$side[1] %% 10) == 3){
                           up.loc <- loc+c(0.125,0.25)
                           down.loc <- loc+c(-0.125,0.25)
                         }
                         if((x$side[1] %% 10) == 2){
                           up.loc <- loc+c(0.125,-0.25)
                           down.loc <- loc+c(0.125,0.25)
                         }
                         if((x$side[1] %% 10) == 4){
                           up.loc <- loc+c(-0.125,-0.25)
                           down.loc <- loc+c(-0.125,0.25)
                         }
                         tmp <- as.data.frame(rbind(up.loc,down.loc))
                         names(tmp) <- c('lat','lon')
                         return(tmp)
                       })

          corners$order <- NA
          corners$order[1:2] <- 1:2
          counter <- 1:nrow(corners)
          i <- 2
          for(order in 3:(length(sides))){
            tmp <- counter[is.na(corners$order) &
                           corners$lat == corners$lat[i] &
                           corners$lon == corners$lon[i]][1]
            corners$order[corners$side == corners$side[tmp]] <- order
            corners <- corners[-tmp,]
            i <- which(corners$order == order)

          }
          corners$side <- NULL
          return(arrange(corners,order))

        })

  ggplot(data.frame(lat=0,lon=0), aes(lon,lat)) +
#  ggplot(subset(dat,ar==2011),aes(lon,lat)) +
#  stat_density2d(aes(fill=..level..),geom='polygon') +
#  scale_fill_gradient(limits = c(0,0.2), low='yellow',high='red') +

  geom_polygon(aes(lon,lat,group=SUBDIVISION, fill = as.factor(SUBDIVISION)),
            data=test,
            #subset(test,SUBDIVISION %in% bcareas),
            size = 0.3) +
  geom_point(col='yellow') +
  geom_path(data=geo::gbdypi.100,lty=2,size = 0.3) +
  geom_path(data=geo::gbdypi.800,lty=2,size = 0.3) +
  #  geom_polygon(alpha=0.5,data=gbdypi.100,fill='gray90') +
  geom_path(data=geo::gbdypi.200,lty=2,size = 0.3) + #alpha=0.5,fill='gray90',
  geom_path(data=geo::gbdypi.500,lty=3,size = 0.3) +
  geom_path(data=geo::gbdypi.1000,lty=4,size = 0.3) +
  geom_path(aes(lon,lat,group=SUBDIVISION),
            data=test,
            #subset(test,SUBDIVISION %in% bcareas),
            size = 0.3) +
#  geom_text(aes(lon,lat,label=SUBDIVISION),
#            data=ddply(test,~SUBDIVISION, summarise, lat=mean(lat),
#              lon=mean(lon))) +
  geom_polygon(data = map_data("world",'Iceland'),
               aes(long,lat,group = group),
               col = 'black' ,fill = 'gray70',size = 0.3) +
  geom_polygon(data=map_data("world",'Greenland'),
               aes(long,lat,group = group),
               col = 'black', fill = 'gray70') +
  geom_polygon(data=geo::eyjar, col = 'black', fill = 'gray70',size = 0.3) +
  geom_polygon(data=geo::faeroes, col = 'black', fill = 'gray70') +
  coord_map('mercator', xlim=c(-40,0),ylim=c(60,70)) +
#  coord_map('mercator', xlim=c(-28,-9.5),ylim=c(62.5,68)) +
#  xlab('LONGITUDE') +
#  ylab('LATITUDE') +
  theme_bw() +
  theme(panel.border = element_rect(colour='black',size = 0.5),
        #legend.position=c(0.9, 0.85),
        panel.background = element_rect(fill='white'),#element_rect(fill='deepskyblue'),
        axis.ticks.length = unit(-5 , "points"),
        axis.ticks.margin = unit(10 * 1.25, "points"),
        axis.text.x = element_text(family='Arial',size=10),
        panel.grid=element_blank(),
        axis.title = element_blank(),
        legend.position = 'none') +
    scale_fill_crayola()
    #+
#  scale_x_continuous(breaks=seq(-26,-10,4))  +
#  scale_y_continuous(breaks=seq(63, 67, 2))# +
#    scale_x_continuous(breaks=seq(-35,0,4))  +
#    scale_y_continuous(breaks=seq(60, 70, 2)) +

#    geom_path(data = geo::twohmiles, lty = 2) #+
#  geom_path(data=icelandrivers,size=0.1,col='blue') +
#  geom_path(data=glaciers,size=0.5)

```

where the subdivision were selected based on a cluster analysis of survey data and depth contours (see the DST$^2$ final report). The mapping of area cells to division can be read from the `reitmapping.tsv` file which is installed as a part of MFDB. It can be loaded it into memory with the following command:

```{r eval=FALSE}
reitmapping <-
  read.table(
        system.file("demo-data", "reitmapping.tsv", package="mfdb"),
        header=TRUE,
        as.is=TRUE)
```


The areas and divisions of the Icelandic continental shelf area are defined by the following commands:

```{r eval=FALSE}
## import the area definitions into mfdb
mfdb_import_area(mdb,
                 reitmapping %>%
                   select(id,
                          name=GRIDCELL,
                          size) %>%
                   as.data.frame())

## create area division (used for e.g. bootstrapping)
mfdb_import_division(mdb,
                     reitmapping %>%
                       split(.$SUBDIVISION) %>%
                       map('GRIDCELL'))

```


### Define sampling types

Any survey data can have a sampling type defined, which then can be used when querying data. If you want to use a sampling type, then define it using `mfdb_import_sampling_type()`. It is useful to give the different survey and commercial samples from various sampling projects separate to be able handle them differently when defining the likelihood function for the model. The function requires three columns to be defined:

* `id`
* `name` key to be used for the samples
* `description` of the sampling type

An example of the definition of sampling types for Icelandic waters:

```{r eval=FALSE}
mfdb_import_sampling_type(mdb, data.frame(
  id = 1:18,
  name = c('SEA', 'IGFS','AUT','SMN','LND','LOG','INS','ACU','FLND','OLND','CAA',
           'CAP','GRE','FAER','RED','RAC','LOBS','ADH'),
  description = c('Sea sampling', 'Icelandic ground fish survey',
                  'Icelandic autumn survey','Icelandic gillnet survey',
                  'Landings','Logbooks','Icelandic nephrop survey',
                  'Acoustic capelin survey','Foreign vessel landings','Old landings (pre 1981)',
                  'Old catch at age','Capelin data','Eastern Greenland autumn survey',
                  'Faeroese summer survey','Redfish survey','Redfish accoustic survey',
                  'Icelandic nephrops survey','Ad-hoc surveys')))
```

### Define vessel taxonomy

Data that is imported can be linked to vessel where basic information on the vessel can attached to the vessel name/ID. The following columns will need to be present in the imported vessel list:

* `name` ID of the vessel, used in the tow taxonomy
* `length` of the vessel
* `tonnage` the capacity of the vessel
* `power` the engine power
* `full_name` gives the full name of the vessel
* `vessel_type` linked to the vessel type taxonomy

To import the vessel list the following command is used:

```{r eval=FALSE}
mfdb_import_vessel_taxonomy(mdb, vessel_registry)
```


### Define tow taxonomy

The tow taxonomy registers exact location and depth of the sample, in order for multiple samples can be linked to the same tow. To define a tow the input data frame can contain the following columns:

* `name` the ID of the tow
* `latitude` initial position of the tow
* `longitude`
* `end_latitude` end position of the tow
* `end_longitude`
* `depth` the depth measured in the tow
* `length` length of the tow
* `duration` time duration

Tows are imported in a similar fashion as vessel taxonomies:

```{r, eval=FALSE}
mfdb_import_tow_taxonomy(mdb, tows)
```


## Importing data

### Import temperature data

At this point, you can start uploading actual measurements. The easiest of which is temperature. Upload a table of areacell/month/temperature data using `mfdb_import_temperature()`:

```{r eval = FALSE}
## here some work on temperature would be nice to have instead of fixing the temperature
mfdb_import_temperature(mdb,
                        expand.grid(year=1900:2100,
                                    month=1:12,
                                    areacell = reitmapping$GRIDCELL,
                                    temperature = 3))

```

Obviously it it would be more useful to have actual temperature data stored in these area cells but since this is rarely used in Gadget models it will not be considered further here.

### Importing survey/commercial samples

The function `mfdb_import_survey()` allows the user to import biological samples from various sources. Ideally data should be uploaded in separate chunks based on the data source. For example, if you have length and age-length data, don't combine them in R, instead upload them separately and both will be used when querying for length data. This keeps the process simple, and allows you to swap out only the data that needs to be update as necessary. The function accepts data frames of the following format:

* `institute` *Optional*. An institute name, see `mfdb::institute` for possible values
* `tow` *Optional*. Tow defined previously with `mfdb_import_tow_taxonomy(...)`
* `gear` *Optional*. Gear name, see `mfdb::gear` for possible values
* `vessel` *Optional*. Vessel defined previously with `mfdb_import_vessel_taxonomy(...)`
* `sampling_type` *Optional*. A sampling_type, see `mfdb::sampling_type` for possible values
* `year` **Required** Year each sample was taken, e.g. `c(2000,2001)`
* `month` **Required**. Month (1–12) each sample was taken, e.g. `c(1,12)`
* `areacell` **Required**. Areacell sample was taken within
* `species` *Optional*, default `c(NA)`. Species of sample, see `mfdb::species` for possible values
* `age` *Optional*, default `c(NA)`. Age of sample, or mean age
* `sex` *Optional*, default `c(NA)`. Sex of sample, see `mfdb::sex` for possible values
* `maturity_stage`
* `length` *Optional*, default `c(NA)`. Length of sample / mean length of all samples
* `length_var` *Optional*, default `c(NA)`. Sample variance, if data is already aggregated
* `length_min` *Optional*, default `c(NA)`. Minimum theoretical length, if data is already aggregated
* `weight` *Optional*, default `c(NA)`. Weight of sample / mean weight of all samples
* `weight_var` *Optional*, default `c(NA)`. Sample variance, if data is already aggregated
* `weight_total` *Optional*, default `c(NA)`. Total weight of all samples, can be used with `count = NA` to represent an unknown number of samples
* `count` *Optional*, default `c(1)`. Number of samples this row represents (i.e. if the data is aggregated)

```{r eval = FALSE}
mfdb_import_survey(mdb,samples,data_source = 'the_source')
```

Note that the `data_source` argument gives a name for the uploaded data set and it allows you repeat the re-upload only this chunk of data if and when corrected data become available.

### Import stomach survey

Stomach surveys are imported in much the same way, however there are 2 data.frames, one representing predators, one preys. The column stomach_name links the two, which can contain any numeric / character value, as long as it is unique for predators and prey measurements are assigned to the correct stomach.

See `mfdb_import_survey` for more information or the demo directory for concrete examples.


## Querying data

There are a selection of querying functions available, all of which work same way. You give a set of parameters, each of which can be a vector of data you wish returned, for instance `year = 1998:2000` or `species = c('COD')`.

If also grouping by this column (i.e. 'year', 'timestep', 'area' and any other columns given, e.g. 'age'), then the parameter will control how this grouping works, e.g. maturity_stage = mfdb_group(imm = 1, mat = 2:5) will result in the maturity_stage column having either 'imm' or 'mat'. These will also be used to generate Gadget aggregation files later.

For example, the following queries the temperature table:

```
defaults <- list(
    area = mfdb_group("101" = ),
    timestep = mfdb_timestep_quarterly, # Group months to create 2 timesteps for each year
    year = 1996:2005)
agg_data <- mfdb_temperature(mdb, defaults)

```
All functions will result in a list of data.frame result tables (generally only one, unless you requested bootstrapping). Each are suitable for feeding into a gadget function to output into model files.

See `mfdb_sample_count` for more information or the demo directory for concrete examples.

## Uploading the DATRAS database to MFDB

The following gives an illustrative example of how the DATRAS survey data are uploaded to MFDB.

```{r, eval=FALSE}
# devtools::install_github('ices-tools-prod/icesDatras')
# devtools::install_github('ices-tools-prod/icesFO',dependencies = FALSE)
# devtools::install_github('fishvice/tidyices')
# devtools::install_github('hafro/geo')
library(icesDatras)
library(mfdb)
library(tidyices)
library(tidyverse)


mdb <- mfdb('datras')


yrs <- 2017  # years
qs <- c(1)   # quarters

hh <-
  icesDatras::getDATRAS(record = "HH",
                        survey = "NS-IBTS",
                        years = yrs,
                        quarters = qs) %>%
  dtrs_tidy_hh()


hl <-
  icesDatras::getDATRAS(record = "HL",
                        survey = "NS-IBTS",
                        years = yrs,
                        quarters = qs) %>%
  dtrs_tidy_hl(hh)


ca <-
  icesDatras::getDATRAS(record = "CA",
                        survey = "NS-IBTS",
                        years = yrs,
                        quarters = qs) %>%
  dtrs_tidy_ca()


## Setup the spatial taxonomy
hh <- hh %>%
  mutate(areacell = geo::d2sr(shootlat,shootlong))

hh %>%
  select(areacell) %>%
  distinct() %>%
  mutate(id = 1:n(),
         lat = geo::sr2d(areacell)$lat,
         lon = geo::sr2d(areacell)$lon,
         area = geo::srA(areacell)) %>%
  select(id, name=areacell,size = area) %>%
  mfdb_import_area(mdb,.)

## Setup sampling types

mfdb_import_sampling_type(mdb, data.frame(
  id = 1:3,
  name = c('NS-IBTS','SEA','LND'),
  description = c('North Sea IBTS','Catch samples','Landings')))

## upload tow taxonomy (optional)

# select(synis_id,ar,man,lat=kastad_n_breidd,lon=kastad_v_lengd,lat1=hift_n_breidd,lon1=hift_v_lengd,
# gear,sampling_type,depth=dypi_kastad,vessel,reitur,smareitur)

hh %>%
  select(sampling_type = survey,
         name = id,
         year,
         month = quarter,
         latitude = shootlat,
         longitude = shootlong,
         towlength = hauldur,
         depth,
         vessel = ship) %>%
  mfdb_import_tow_taxonomy(mdb,.)

## setup vessel taxonomy

hh %>%
  select(name = ship) %>%
  distinct() %>%
  mutate(length=NA,tonnage=NA,power=NA, full_name = name) %>%
mfdb_import_vessel_taxonomy(mdb,.)


## import length data

hh %>%
  left_join(hl) %>%
  select(sampling_type = survey,
         areacell,
         tow = id,
         year,
         month = quarter,
         latitude = shootlat,
         longitude = shootlong,
         towlength = hauldur,
         depth,
         vessel = ship,
         latin,
         length,
         sex,
         count = n)%>%
  mutate(latin = tolower(latin),
         gear = 'BMT') %>%
  left_join(mfdb::species %>%
              mutate(latin = gsub('.+\\((.+)\\)','\\1',description),
                     latin = tolower(latin)) %>%
              rename(species = name)) %>%
  mutate(species = as.character(species)) %>%
  select(-c(id,latin,description)) %>%
  as.data.frame() %>%
  mfdb_import_survey(mdb,.,data_source = 'DATRAS-lengths')

## import biological information

hh %>%
  left_join(ca) %>%
  select(sampling_type = survey,
         areacell,
         tow = id,
         year,
         month = quarter,
         latitude = shootlat,
         longitude = shootlong,
         towlength = hauldur,
         depth,
         maturity,
         weight = wgt,
         vessel = ship,
         latin,
         length,
         sex,
         count = n) %>%
  mutate(maturity_stage = pmin(as.numeric(maturity) - 60,5)) %>%
  mutate(latin = tolower(latin),
         gear = 'BMT') %>%
  left_join(mfdb::species %>%
              mutate(latin = gsub('.+\\((.+)\\)','\\1',description),
                     latin = tolower(latin)))%>%
  select(-c(id,latin,description)) %>%
  rename(species = name) %>%
  mutate(species = as.character(species)) %>%
  as.data.frame() %>%
  mfdb_import_survey(mdb,.,data_source = 'DATRAS-ages')


## upload landings

tibble(year = 2017,month = 1, areacell = -3548,
       species = 'WHG', weight_total = 1000,
       gear = 'BMT',sampling_type = 'LND') %>%
  as.data.frame() %>%
  mfdb_import_survey(mdb,.,data_source = 'test-landings')


```

## Dumping / Restoring a DB

You can also dump/import a dump from another host using the postgres `pg_dump` and `pg_restore` commands. You can dump/restore indvidual schemas (i.e. the case study you give to the `mfdb()` command), to list all the schemas installed run `SELECT DISTINCT(table_schema) FROM information_schema.tables` from psql. Note that if you use `mfdb('Baltic')`, the Postgres schema name will be lower-cased.

Create a dump of your chosen schema with the following command:

```
pg_dump --schema=baltic -F tar mf > baltic.tar
```

This will make a dump of the "baltic" case study into "baltic.tar". It can then be restored onto another computer with the following:

```
pg_restore --clean -d mf baltic.tar
```

If you already have a baltic schema you wish to preserve, you can rename it first by issuing ALTER SCHEMA baltic RENAME TO baltic_o in psql. Once the restore is done you can rename the new schema and put the name of the old schema back.