forked from WinVector/rquery
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
300 lines (214 loc) · 16.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
# `rquery`
[`rquery`](https://winvector.github.io/rquery/) is a piped query generator based on [Codd's relational algebra](https://en.wikipedia.org/wiki/Relational_algebra) (updated to reflect lessons learned from working with [`R`](https://www.r-project.org), [`SQL`](https://en.wikipedia.org/wiki/SQL), and [`dplyr`]( https://CRAN.R-project.org/package=dplyr) at big data scale in production).
`rquery` is currently recommended for use with
[`data.table`](http://r-datatable.com) (via [`rqdatatable`](https://github.com/WinVector/rqdatatable/)),
[`PostgreSQL`](https://github.com/WinVector/rquery/blob/master/db_examples/RPostgreSQL.md), [`sparklyr`](https://github.com/WinVector/rquery/blob/master/db_examples/sparklyr.md), [`SparkR`](https://github.com/WinVector/rquery/blob/master/db_examples/SparkR.md), [`MonetDBLite`](https://github.com/WinVector/rquery/blob/master/db_examples/MonetDBLite.md), and (and with non-window functionality with [`RSQLite`](https://CRAN.R-project.org/package=RSQLite)). It can target various databases through its adapter layer.
To install: `devtools::install_github("WinVector/rquery")` or `install.packages("rquery")`.
Note: `rquery` is a "database first" design. This means choices are made that favor database implementation. These include: capturing the entire calculation prior to doing any work (and using recursive methods to inspect this object, which can limit the calculation depth to under 1000 steps at a time), preferring "tame column names" (which isn't a bad idea in `R` anyway as columns and variables are often seen as cousins), and not preserving row or column order (or supporting numeric column indexing). Also, `rquery` does have a fast in-memory implementation: [`rqdatatable`](https://CRAN.R-project.org/package=rqdatatable) (thanks to the [`data.table` package](https://CRAN.R-project.org/package=data.table)), so one can in fact use `rquery` without a database.

# Discussion
[`rquery`](https://github.com/WinVector/rquery) can be an excellent advanced `SQL`
training tool (it shows how some very deep `SQL` by composing `rquery` operators). Currently `rquery` is biased towards the `Spark` and `PostgeSQL` `SQL` dialects.
There are many prior relational algebra inspired specialized query languages. Just a few include:
* [`Alpha`](https://en.wikipedia.org/wiki/Alpha_(programming_language)) ~1971.
* `ISBL` / Information system based language ~1973
* [`QUEL`](https://en.wikipedia.org/wiki/QUEL_query_languages) ~1974.
* [`IBM System R`](https://en.wikipedia.org/wiki/IBM_System_R) ~1974.
* [`SQL`](https://en.wikipedia.org/wiki/SQL) ~1974.
* [`Tutorial D`](https://en.wikipedia.org/wiki/D_(data_language_specification)#Tutorial_D) ~1994.
* [`data.table`](http://r-datatable.com/) ~2006.
* [`LINQ`](https://msdn.microsoft.com/en-us/library/bb308959.aspx) ~2007.
* [`pandas`](http://pandas.pydata.org) ~2008.
* [`dplyr`](http://dplyr.tidyverse.org) ~2014.
`rquery` is realized as a thin translation to an underlying `SQL` provider. We are trying to put the Codd relational operators front and center (using the original naming, and back-porting `SQL` progress such as window functions to the appropriate relational operator).
The primary relational operators include:
* [`extend()`](https://winvector.github.io/rquery/reference/extend.html). Extend adds derived columns to a relation table. With a sufficiently powerful `SQL` provider this includes ordered and partitioned window functions. This operator also includes built-in [`seplyr`](https://winvector.github.io/seplyr/)-style [assignment partitioning](https://winvector.github.io/seplyr/articles/MutatePartitioner.html). `extend()` can also alter existing columns, though we note this is not always a relational operation (it can lose row uniqueness).
* [`project()`](https://winvector.github.io/rquery/reference/project.html). Project is usually *portrayed* as the equivalent to column selection, though the original definition includes aggregation. In our opinion the original relational nature of the operator is best captured by moving `SQL`'s "`GROUP BY`" aggregation functionality.
* [`natural_join()`](https://winvector.github.io/rquery/reference/natural_join.html). This a specialized relational join operator, using all common columns as an equi-join condition.
* [`theta_join()`](https://winvector.github.io/rquery/reference/theta_join.html). This is the relational join operator allowing an arbitrary matching predicate.
* [`select_rows()`](https://winvector.github.io/rquery/reference/theta_join.html). This is Codd's relational row selection. Obviously `select` alone is an over-used and now ambiguous term (for example: it is already used as the "doit" verb in `SQL` and the *column* selector in `dplyr`).
* [`rename_columns()`](https://winvector.github.io/rquery/reference/rename_columns.html). This operator renames sets of columns.
* [`set_indicator()`](https://winvector.github.io/rquery/reference/set_indicator.html). This operator produces a new column indicating set membership of a named column.
(Note `rquery` prior to version `1.2.1` used a `_nse()` suffix yielding commands such as `extend_nse()` instead of the newer `extend()` shown here).
The primary non-relational (traditional `SQL`) operators are:
* [`select_columns()`](https://winvector.github.io/rquery/reference/select_columns.html). This allows choice of columns (central to `SQL`), but is not a relational operator as it can damage row-uniqueness.
* [`orderby()`](https://winvector.github.io/rquery/reference/orderby.html). Row order is not a concept in the relational algebra (and also not maintained in most `SQL` implementations). This operator is only useful when used with its `limit=` option, or as the last step as data comes out of the relation store and is moved to `R` (where row-order is usually maintained).
* [`map_column_values()`](https://winvector.github.io/rquery/reference/map_column_values.html) re-map values in columns (very useful for re-coding data, currently implemented as a [`sql_node()`](https://winvector.github.io/rquery/reference/sql_node.html)).
* [`unionall()`](https://winvector.github.io/rquery/reference/unionall.html) concatenate tables.
And `rquery` supports higher-order (written in terms of other operators, both package supplied and user supplied):
* [`pick_top_k()`](https://winvector.github.io/rquery/reference/pick_top_k.html). Pick top `k` rows per group given a row ordering.
* [`assign_slice()`](https://winvector.github.io/rquery/reference/assign_slice.html). Conditionally assign sets of rows and columns a scalar value.
* [`if_else_op()`](https://winvector.github.io/rquery/reference/if_else_op.html). Simulate simultaneous if/else assignments.
`rquery` also has implementation helpers for building both `SQL`-nodes (nodes that are just `SQL` expressions) and non-`SQL`-nodes (nodes that are general functions of their input data values).
* [`sql_node()`](https://winvector.github.io/rquery/reference/sql_node.html)
* [`sql_expr_set()`](https://winvector.github.io/rquery/reference/sql_expr_set.html)
* [`non_sql_node()`](https://winvector.github.io/rquery/reference/non_sql_node.html)
* [`quantile_node()`](https://winvector.github.io/rquery/reference/quantile_node.html)
* [`rsummary_node()`](https://winvector.github.io/rquery/reference/rsummary_node.html)
The primary missing relational operators are:
* Union.
* Direct set difference, anti-join.
* Division.
One of the principles of `rquery` is to prefer expressive nodes, and not depend on complicated in-node expressions.
A great benefit of Codd's relational algebra is it gives one concepts to decompose complex data transformations into sequences of simpler transformations.
Some reasons `SQL` seems complicated include:
* `SQL`'s realization of sequencing as nested function composition.
* `SQL` uses some relational concepts as steps, others as modifiers and predicates.
A lot of the grace of the Codd theory can be recovered through the usual trick changing function composition notation from `g(f(x))` to `x . f() . g()`. This experiment is asking (and not for the first time): "what if `SQL` were piped (expressed composition as a left to right flow, instead of a right to left nesting)?"
Let's work a non-trivial example: the `dplyr` pipeline
from [Let’s Have Some Sympathy For The Part-time R User](http://www.win-vector.com/blog/2017/08/lets-have-some-sympathy-for-the-part-time-r-user/).
```{r ex, warning=FALSE, message=FALSE}
library("rquery")
library("wrapr")
use_spark <- FALSE
if(use_spark) {
raw_connection <- sparklyr::spark_connect(version='2.2.0',
master = "local")
cname <- rq_connection_name(raw_connection)
rquery::setDBOption(raw_connection,
"create_options",
"USING PARQUET OPTIONS ('compression'='snappy')")
} else {
driver <- RPostgreSQL::PostgreSQL()
raw_connection <- DBI::dbConnect(driver,
host = 'localhost',
port = 5432,
user = 'johnmount',
password = '')
}
dbopts <- rq_connection_tests(raw_connection)
db <- rquery_db_info(connection = raw_connection,
is_dbi = TRUE,
connection_options = dbopts)
# copy data in so we have an example
d_local <- build_frame(
"subjectID", "surveyCategory" , "assessmentTotal", "irrelevantCol1", "irrelevantCol2" |
1L , "withdrawal behavior", 5 , "irrel1" , "irrel2" |
1L , "positive re-framing", 2 , "irrel1" , "irrel2" |
2L , "withdrawal behavior", 3 , "irrel1" , "irrel2" |
2L , "positive re-framing", 4 , "irrel1" , "irrel2" )
rq_copy_to(db, 'd',
d_local,
temporary = TRUE,
overwrite = TRUE)
# produce a hande to existing table
d <- db_td(db, "d")
```
Note: in examples we use `rq_copy_to()` to create data. This is only for the purpose of having
easy portable examples. With big data the data is usually already in the remote database or
Spark system. The task is almost always to connect and work with this pre-existing remote data
and the method to do this is [`db_td()`](https://winvector.github.io/rquery/reference/db_td.html),
which builds a reference to a remote table given the table name. The suggested pattern for working with
remote tables is to get inputs via [`db_td()`](https://winvector.github.io/rquery/reference/db_td.html)
and land remote results with [`materialze()`](https://winvector.github.io/rquery/reference/materialize.html).
To work with local data one can copy data from memory to the database with [`rq_copy_to()`](https://winvector.github.io/rquery/reference/rq_copy_to.html)
and bring back results with [`execute()`](https://winvector.github.io/rquery/reference/execute.html) (though be aware
operation on remote non-memory data is `rquery`'s primary intent).
First we show the Spark/database version of the original example data:
```{r dbdat, warning=FALSE, message=FALSE}
class(db)
print(db)
class(d)
print(d)
# remote structure inspection
rstr(db, d$table_name)
# or execute the table representation to bring back data
d %.>%
execute(db, .) %.>%
knitr::kable(.)
```
Now we re-write the original calculation in terms of the `rquery` SQL generating operators.
```{r calc, warning=FALSE, message=FALSE}
scale <- 0.237
dq <- d %.>%
extend(.,
probability :=
exp(assessmentTotal * scale)) %.>%
normalize_cols(.,
"probability",
partitionby = 'subjectID') %.>%
pick_top_k(.,
partitionby = 'subjectID',
orderby = c('probability', 'surveyCategory'),
reverse = c('probability')) %.>%
rename_columns(., 'diagnosis' := 'surveyCategory') %.>%
select_columns(., c('subjectID',
'diagnosis',
'probability')) %.>%
orderby(., cols = 'subjectID')
```
(Note one can also use the named map builder alias `%:=%` if there is concern of aliasing with `data.table`'s definition of `:=`.)
We then generate our result:
```{r res, warning=FALSE, message=FALSE}
result <- materialize(db, dq)
class(result)
result
DBI::dbReadTable(db$connection, result$table_name) %.>%
knitr::kable(.)
```
We see we have quickly reproduced the original result using the new database operators. This means such a calculation could easily be performed at a "big data" scale (using a database or `Spark`; in this case we would not take the results back, but instead use `CREATE TABLE tname AS` to build a remote materialized view of the results).
A bonus is, thanks to `data.table` and the `rqdatatable` packages we can run the exact same operator pipeline on local data.
```{r localrun, warning=FALSE, message=FALSE}
library("rqdatatable")
d_local %.>%
dq %.>%
knitr::kable(.)
```
Notice we applied the pipeline by piping data into it. This ability is a feature of the [dot arrow pipe](https://journal.r-project.org/archive/2018/RJ-2018-042/index.html) we are using here.
The actual `SQL` query that produces the database result is, in fact, quite involved:
```{r q, comment=''}
cat(to_sql(dq, db, source_limit = 1000))
```
The query is large, but due to its regular structure it should be very amenable to query optimization.
A feature to notice is: the query was automatically restricted to just columns actually needed from the source table to complete the calculation. This has the possibility of decreasing data volume and greatly speeding up query performance. Our [initial experiments](https://github.com/WinVector/rquery/blob/master/extras/PerfTest%2Emd) show `rquery` narrowed queries to be twice as fast as un-narrowed `dplyr` on a synthetic problem simulating large disk-based queries. We think if we connected directly to `Spark`'s relational operators (avoiding the `SQL` layer) we may be able to achieve even faster performance.
The above optimization is possible because the `rquery` representation is an intelligible tree of nodes, so we can interrogate the tree for facts about the query. For example:
```{r qfacts}
column_names(dq)
tables_used(dq)
columns_used(dq)
```
The additional record-keeping in the operator nodes allows checking and optimization (such as [query narrowing](http://www.win-vector.com/blog/2017/12/how-to-greatly-speed-up-your-spark-queries/)). The flow itself is represented as follows:
```{r pqp, comment=''}
cat(format(dq))
```
```{r diagram, fig.width=8, fig.height=8, eval=FALSE}
dq %.>%
op_diagram(.) %.>%
DiagrammeR::grViz(.)
```

`rquery` also includes a number of useful utilities (both as nodes and as functions).
```{r utils}
quantile_cols(db, "d")
rsummary(db, "d")
dq %.>%
quantile_node(.) %.>%
execute(db, .)
dq %.>%
rsummary_node(.) %.>%
execute(db, .)
```
We have found most big-data projects either require joining very many tables (something `rquery` join planners help with, please see [here](https://github.com/WinVector/rquery/blob/master/extras/JoinController%2Emd) and [here](https://github.com/WinVector/rquery/blob/master/extras/JoinController%2Emd)) or they require working with wide data-marts (where `rquery` query narrowing helps, please see [here](https://github.com/WinVector/rquery/blob/master/extras/PerfTest%2Emd)).
We can also stand `rquery` up on non-`DBI` sources such as [`SparkR`](https://github.com/WinVector/rquery/blob/master/extras/SparkR%2Emd)
and also [`data.table`](https://CRAN.R-project.org/package=data.table). The `data.table` adapter is being developed in the [`rqdatatable`](https://github.com/WinVector/rqdatatable) package, and can be [quite fast](http://www.win-vector.com/blog/2018/06/rqdatatable-rquery-powered-by-data-table/). Notice the examples in this mode all essentially use the same query pipeline, the user can choose where to apply it: in memory (`data.table`), in a `DBI` database (`PostgreSQL`, `Sparklyr`), and with even non-DBI systems (`SparkR`).
# See also
For deeper dives into specific topics, please see also:
* <a href="https://github.com/WinVector/rquery/blob/master/extras/JoinController%2Emd">Join Controller</a>
* <a href="https://github.com/WinVector/rquery/blob/master/extras/DependencySorting%2Emd">Join Dependency Sorting</a>
* <a href="https://github.com/WinVector/rquery/blob/master/extras/AssigmentPartitioner%2Emd">Assignment Partitioner</a>
* <a href="https://github.com/WinVector/rquery/blob/master/extras/ExtraDBs%2Emd">DifferentDBs</a>
* <a href="https://github.com/WinVector/rqdatatable">rqdatatable</a>
```{r cleanup, include=FALSE}
if(use_spark) {
sparklyr::spark_disconnect(raw_connection)
} else {
DBI::dbDisconnect(raw_connection)
}
rm(list = c("raw_connection", "db"))
```
# Installing
To install `rquery` please try `install.packages("rquery")`.