-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrecorded_session_practical.Rmd
291 lines (188 loc) · 8.66 KB
/
recorded_session_practical.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
---
title: "Recorded Session: working with JSON and jsonlite"
output:
html_document:
number_sections: true
df_print: paged
toc: true
toc_float: true
css: custom.css
includes:
after_body: footer.html
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# xaringan::inf_mr()
```
<br/>
```{r, message=FALSE}
# load necessary packages
library(tidyverse)
library(jsonlite)
library(listviewer)
```
<br/>
# **How do JSON strings look like anyways?**
Let's take a look at how a typical JSON file might look like.
```{r}
# print contents of the JSON file "students_data.json"
cat(paste0(readLines("data/students_data.json", warn=FALSE), collapse="\n"))
```
<br/>
# **Understanding how JSON files are read with `jsonlite`**
First, let's load the same JSON file with jsonlite's [`fromJSON()`](https://www.rdocumentation.org/packages/jsonlite/versions/1.7.2/topics/toJSON,%20fromJSON) function.
As you will see, the output features different types of data, such as data frames, vectors, matrices and lists.
```{r}
json_data <- fromJSON("data/students_data.json")
json_data
```
<br/>
## Why the different data types? The answer is *simple*
JSON data usually gets converted into **lists** when it is read by `fromJSON()`.
However, **this is not always the case**. Depending on how the JSON data is structured, `fromJSON()` automatically converts the array to other R classes.
This process where JSON arrays automatically get converted from a list into a more specific R class is called **simplification**.
There are **3 types** of JSON arrays that get converted to different R classes by default:
- arrays of **primitives**
- arrays of **objects**
- arrays of **arrays**
*Hint: you can identify `arrays` through blockquotes `[]`.*
Let's take a closer look at these.
<br/>
### R `vectors` are created from JSON `arrays of primitives` (strings, numbers, booleans or null)
<br/>
The `teachers` element in the JSON file is an **array of primitives**. "Primitives" are either `strings`, `numbers`, `booleans` or `null` values (they are named so because they are the simplest elements available in a programming language).
> "teachers": ["Simon Munzer", "Lisa Oswald"]
Why? It has
- `Primitives` i.e. strings separated by commas `,`
- all within blockquotes `[]`
These get converted to a R `vector` by default. See again:
```{r}
json_data["teachers"] # output is a R vector
```
<br/>
### R `data frames` are created from JSON `arrays of objects` (key-value pairs)
<br/>
The `students` element in the JSON file is an **array of objects**.
> "students": [
{
"id":"014789",
"name": "Francesco",
"lastname": "Danovi"
},
{
"id":"023657",
"name": "Gabriel",
"lastname": "da Silva Zech"
}
]
Why? It has
- `Objects` i.e. key-value pairs separated by commas `,`
- within curly brackets `{}` that are also separated by commas `,`
- all within blockquotes `[]`.
These get converted to a R `data frame` by default. See again:
```{r}
json_data["students"] # output is a R data frame
```
<br/>
### R `matrices` are created from JSON `arrays of arrays` (equal-length sub-arrays)
<br/>
The `ages_students_teachers` element in the JSON file is an **array of arrays**.
> "ages_students_teachers": [[23, 24],
[27, 33]]
Why? It has
- `Arrays` i.e. numbers separated by commas `,` within blockquotes `[]`
- next to other **equal length** `arrays` separated by commas `,`
- all within blockquotes `[]`.
These get converted to a R `matrix` by default. See again:
```{r}
json_data["ages_students_teachers"] # output is a R matrix
```
<br/>
## How and why to prevent automatic simplification
It is possible, however, to disable this automatic simplification. This can be done by passing `simplifyVector = FALSE` to the `fromJSON()` function.
This will make sure all values are returned as **lists**.
```{r}
# array of primitives
fromJSON('["Simon Munzer", "Lisa Oswald"]')
fromJSON('["Simon Munzer", "Lisa Oswald"]', simplifyVector = FALSE)
```
<br/>
```{r}
# array of objects
fromJSON('[{ "id":"023657", "name": "Gabriel", "lastname": "da Silva Zech" }]')
fromJSON('[{ "id":"023657", "name": "Gabriel", "lastname": "da Silva Zech" }]', simplifyVector = FALSE)
```
<br/>
```{r}
# array of arrays
fromJSON('[[23, 24], [27, 33]]')
fromJSON('[[23, 24], [27, 33]]', simplifyVector = FALSE)
```
<br/>
**Why is knowing this relevant?**
Having all your data organised in lists instead of a mixture of lists, vectors, data frames and matrices is beneficial because it will allow you to **work more consistently** with a few number of helper functions.
Don't take my word for it. You can take Hadley Wickham's, the guy behind tidyverse - it has been [reported](https://themockup.blog/posts/2020-05-22-parsing-json-in-r-with-jsonlite/) that when he works with the jsonlite package, he prefers to read JSON files without simplifying them.
<br/>
## Reading newline-delimited JSON files
A slight variation of the JSON format is the [newline-delimited JSON (aka ndjson)](http://ndjson.org/) format.
In this format, each line is a JSON element in itself. This allows one to read individual elements without having to parse the entire file, which is helpful in certain use cases such as reading log files.
)](presentation_files/images/comparison_json_ndjson.png)
<br/>
This is important to know because `fromJSON()` will through the error `parse error: trailing garbage` when trying to read a ndjson file.
> fromJSON("data/yelp_academic_dataset_business.json")
>
> Error in parse_con(txt, bigint_as_char) : parse error: trailing garbage: true}, "type": "business"} {"business_id": "UsFtqoBl7naz8A (right here) ------^
This is where jsonlite's [`stream_in()`](https://www.rdocumentation.org/packages/jsonlite/versions/1.7.2/topics/stream_in,%20stream_out) function for reading ndjson comes in. It is especially useful because it allows us to read specific lines of a ndjson file without having to parse the entire document.
Let's use it to read the first 500 lines of a dataset containing information on businesses on the Yelp platform ([source](https://www.kaggle.com/yelp-dataset/yelp-dataset)).
```{r}
# read the first 500 lines of the ndjson file
yelp_list <- stream_in(textConnection(readLines("data/yelp_academic_dataset_business.json", n=500)), simplifyVector = FALSE)
# alternatively, the code below reads the entire file:
# yelp_list <- stream_in(file("data/yelp_academic_dataset_business.json"), simplifyVector = FALSE)
# print the information for the first business
str(yelp_list[[1]])
```
<br/>
# **Transforming JSON data into tidy data sets**
<br/>
As explained in [this very helpful tidyr article](https://tidyr.tidyverse.org/articles/rectangle.html), "rectangling is the art and craft of taking a deeply nested list (often sourced from wild caught JSON or XML) and taming it into a tidy data set of rows and columns".
The most important functions for rectangling are:
- `unnest_longer()` takes each element of a list-column and makes a new **row**
- `unnest_wider()` takes each element of a list-column and makes a new **column**
- `hoist()` is similar to `unnest_wider()` but only plucks out selected components, and can reach down multiple levels
We will go over these in the upcoming live coding session. For now, let's just run these without further explanation:
```{r}
yelp_df <- tibble(business_info = yelp_list) %>%
hoist(business_info,
"name",
"city",
"neighborhood" = list("neighborhoods", 1),
"hours") %>%
unnest_longer(hours) %>%
rename("day" = hours_id) %>%
unnest_wider(hours) %>%
select(-business_info)
yelp_df
```
<br/>
# **Exporting data to JSON file with `toJSON()`**
Exporting data to a JSON file is very simple. First we have to transform the data (e.g. the data frame) into a JSON string with [`toJSON()`](https://www.rdocumentation.org/packages/jsonlite/versions/1.7.2/topics/toJSON,%20fromJSON).
```{r}
# select first row to export
yelp_export <- head(yelp_df, 1)
# create a JSON string
toJSON(yelp_export)
```
<br/>
The `toJSON()` function has a bunch of arguments to define how the output should be encoded.
For instance, when the argument `pretty` is set to `True` it adds indentation whitespace to make JSON strings more readable by humans.
```{r}
toJSON(yelp_export, pretty = T)
```
<br/>
Finally, to save the JSON data to a file, simply use `write()`.
```{r}
write(toJSON(yelp_export, pretty = T), "data/yelp_export.json")
```
<br/>
Thats it! More to follow on the live coding session :)