Skip to content
This repository has been archived by the owner on Feb 4, 2022. It is now read-only.

Commit

Permalink
update README
Browse files Browse the repository at this point in the history
  • Loading branch information
jennybc committed Jun 1, 2015
1 parent 0f14528 commit 8795e78
Show file tree
Hide file tree
Showing 2 changed files with 403 additions and 175 deletions.
125 changes: 86 additions & 39 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -86,9 +86,7 @@ suppressPackageStartupMessages(library("dplyr"))

### Function naming convention

*implementation not yet 100% complete ... but we'll get there soon*

All functions start with `gs_`, which plays nicely with tab completion in RStudio, for example. If the function has something to do with worksheets or tabs within a spreadsheet, it will start with `gs_ws_`.
All functions start with `gs_`, which plays nicely with tab completion in RStudio, for example. If the function has something to do with worksheets or tabs within a spreadsheet, then it will start with `gs_ws_`.

### See some spreadsheets you can access

Expand Down Expand Up @@ -139,82 +137,127 @@ third_party_gap <- GAP_URL %>%
# Worried that a spreadsheet's registration is out-of-date?
# Re-register it!
gap <- gap %>% gs_gs()
gap
```

The registration functions `gs_title()`, `gs_key()`, `gs_url()`, and `gs_gs()` return a registered sheet as a `googlesheet` object, which is the first argument to practically every function in this package. Likewise, almost every function returns a freshly registered `googlesheet` object, ready to be stored or piped into the next command.

*We export a utility function, `extract_key_from_url()`, to help you get and store the key from a browser URL. Registering via browser URL is fine, but registering by key is probably a better idea in the long-run.*

### Consume data

#### Ignorance is bliss

*coming soon: a wrapper for the functions described below that just gets the data you want, while you remain blissfully ignorant of how we're doing it*
If you want to consume the data in a worksheet and get something rectangular back, use the all-purpose function `gs_read()`. By default, it reads all the data in a worksheet.

```{r}
oceania <- gap %>% gs_read(ws = "Oceania")
oceania
str(oceania)
glimpse(oceania)
```

You can target specific cells via the `range =` argument. The simplest usage is to specify an Excel-like cell range, such as range = "D12:F15" or range = "R1C12:R6C15". The cell rectangle can be specified in various other ways, using helper functions.

```{r}
gap %>% gs_read(ws = 2, range = "A1:D8")
gap %>% gs_read(ws = "Europe", range = cell_rows(1:4))
gap %>% gs_read(ws = "Europe", range = cell_rows(100:103), col_names = FALSE)
gap %>% gs_read(ws = "Africa", range = cell_cols(1:4))
gap %>% gs_read(ws = "Asia", range = cell_limits(c(1, 5), c(4, NA)))
```

`gs_read()` is a wrapper that bundles together the most common methods to read data from the API and transform it for downstream use. You can refine it's behavior further, by passing more arguments via `...`. Read the help file for more details.

If `gs_read()` doesn't do what you need, then keep reading for the underlying functions to read and post-process data.

#### Specify the consumption method

There are three ways to consume data from a worksheet within a Google spreadsheet. The order goes from fastest-but-more-limited to slowest-but-most-flexible:

* `gs_read_csv()`: Don't let the name scare you! Nothing is written to file during this process. The name just reflects that, under the hood, we request the data via the "exportcsv" link. For cases where `gs_read_csv()` and `gs_read_listfeed()` both work, we see that `gs_read_csv()` is around __50 times faster__. Use this when your data occupies a nice rectangle in the sheet and you're willing to consume all of it. You will get a `tbl_df` back, which is basically just a `data.frame`.
* `gs_read_csv()`: Don't let the name scare you! Nothing is written to file during this process. The name just reflects that, under the hood, we request the data via the "exportcsv" link. For cases where `gs_read_csv()` and `gs_read_listfeed()` both work, we see that `gs_read_csv()` is around __50 times faster__. Use this when your data occupies a nice rectangle in the sheet and you're willing to consume all of it. You will get a `tbl_df` back, which is basically just a `data.frame`. In fact, you might want to use `gs_read_csv()`, it in other, less tidy scenarios and do further munging in R.
* `gs_read_listfeed()`: Gets data via the ["list feed"](https://developers.google.com/google-apps/spreadsheets/#working_with_list-based_feeds), which consumes data row-by-row. Like `gs_read_csv()`, this is appropriate when your data occupies a nice rectangle. You will again get a `tbl_df` back, but your variable names may have been mangled (by Google, not us!). Specifically, variable names will be forcefully lowercased and all non-alpha-numeric characters will be removed. Why do we even have this function? The list feed supports some query parameters for sorting and filtering the data, which we plan to support (#17).
* `gs_read_cellfeed()`: Get data via the ["cell feed"](https://developers.google.com/google-apps/spreadsheets/#working_with_cell-based_feeds), which consumes data cell-by-cell. This is appropriate when you want to consume arbitrary cells, rows, columns, and regions of the sheet. It works great for small amounts of data but can be rather slow otherwise. `gs_read_cellfeed()` returns a `tbl_df` with __one row per cell__. You can specify cell limits in `gs_read_cellfeed()` via the `range` argument. See below for demos of `gs_reshape_cellfeed()` and `gs_simplify_cellfeed()` which help with post-processing.
* `gs_read_cellfeed()`: Get data via the ["cell feed"](https://developers.google.com/google-apps/spreadsheets/#working_with_cell-based_feeds), which consumes data cell-by-cell. This is appropriate when you want to consume arbitrary cells, rows, columns, and regions of the sheet. It is invoked by `gs_read()` whenever the `range =` argument is used. It works great for modest amounts of data but can be rather slow otherwise. `gs_read_cellfeed()` returns a `tbl_df` with __one row per cell__. You can target specific cells via the `range` argument. See below for demos of `gs_reshape_cellfeed()` and `gs_simplify_cellfeed()` which help with post-processing.

```{r csv-list-and-cell-feed}
# Get the data for worksheet "Oceania": the super-fast csv way
oceania_csv <- gap %>% gs_read_csv(ws = "Oceania")
str(oceania_csv)
oceania_csv
# Get the data for worksheet "Oceania": the fast tabular way ("list feed")
# Get the data for worksheet "Oceania": the less-fast tabular way ("list feed")
oceania_list_feed <- gap %>% gs_read_listfeed(ws = "Oceania")
str(oceania_list_feed)
oceania_list_feed
# Get the data for worksheet "Oceania": the slower cell-by-cell way ("cell feed")
# Get the data for worksheet "Oceania": the slow cell-by-cell way ("cell feed")
oceania_cell_feed <- gap %>% gs_read_cellfeed(ws = "Oceania")
str(oceania_cell_feed)
oceania_cell_feed
```

#### Convenience wrappers and post-processing the data
#### Quick speed comparison

There are a few ways to limit the data you're consuming. You can put direct limits into `gs_read_cellfeed()`, ~~but there are also convenience functions to get a row (`get_row()`), a column (`get_col()`), or a range (`get_cells()`)~~. Also, when you consume data via the cell feed (which these wrappers are doing under the hood), you will often want to reshape it or simplify it (`gs_reshape_cellfeed()` and `gs_simplify_cellfeed()`).
Let's consume all the data for Africa by all 3 methods and see how long it takes.

```{r wrappers-and-post-processing}
# Reshape: instead of one row per cell, make a nice rectangular data.frame
oceania_reshaped <- oceania_cell_feed %>% gs_reshape_cellfeed()
str(oceania_reshaped)
oceania_reshaped
```{r}
jfun <- function(readfun)
system.time(do.call(readfun, list(gs_gap(), ws = "Africa", verbose = FALSE)))
readfuns <- c("gs_read_csv", "gs_read_listfeed", "gs_read_cellfeed")
readfuns <- sapply(readfuns, get, USE.NAMES = TRUE)
sapply(readfuns, jfun)
```

# Limit data retrieval to certain cells
#### Post-processing data from the cell feed

If you consume data from the cell feed with `gs_read_cellfeed(..., range = ...)`, you get a data.frame back with **one row per cell**. The package offers two functions to post-process this into something more useful, `gs_reshape_cellfeed()` and `gs_simplify_cellfeed()`.

To reshape into a table, use `gs_reshape_cellfeed()`. You can signal that the first row contains column names (or not) with `col_names = TRUE` (or `FALSE`). Or you can provide a character vector of names. This is inspired by the `col_names` argument of `readxl::read_excel()` and `readr::read_delim()`, which generalizes the `header` argument of `read.table()`.

```{r post-processing}
# Reshape: instead of one row per cell, make a nice rectangular data.frame
australia_cell_feed <- gap %>%
gs_read_cellfeed(ws = "Oceania", range = "A1:F13")
str(australia_cell_feed)
oceania_cell_feed
australia_reshaped <- australia_cell_feed %>% gs_reshape_cellfeed()
str(australia_reshaped)
australia_reshaped
# Example: first 3 rows
gap_3rows <- gap %>% gs_read_cellfeed("Europe", range = cell_rows(1:3))
gap_3rows %>% head()
# convert to a data.frame (first row treated as header by default)
# convert to a data.frame (by default, column names found in first row)
gap_3rows %>% gs_reshape_cellfeed()
# arbitrary cell range, column names no longer available in first row
gap %>%
gs_read_cellfeed("Oceania", range = "D12:F15") %>%
gs_reshape_cellfeed(col_names = FALSE)
# arbitrary cell range, direct specification of column names
gap %>%
gs_read_cellfeed("Oceania", range = cell_limits(c(2, 5), c(1, 3))) %>%
gs_reshape_cellfeed(col_names = paste("thing", c("one", "two", "three"),
sep = "_"))
```

To extract the cell data into an atomic vector, possibly named, use `gs_simplify_cellfeed()`. You can signal that the first row contains column names (or not) with `col_names = TRUE` (or `FALSE`). There are several arguments to control conversion.

```{r}
# Example: first row only
gap_1row <- gap %>% gs_read_cellfeed("Europe", range = cell_rows(1))
gap_1row
# convert to a named character vector
gap_1row %>% gs_simplify_cellfeed()
# just 2 columns, converted to data.frame
gap %>%
gs_read_cellfeed("Oceania", range = cell_cols(3:4)) %>%
gs_reshape_cellfeed()
# Example: single column
gap_1col <- gap %>% gs_read_cellfeed("Europe", range = cell_cols(3))
gap_1col
# arbitrary cell range
gap %>%
gs_read_cellfeed("Oceania", range = "D12:F15") %>%
gs_reshape_cellfeed(col_names = FALSE)
# arbitrary cell range, alternative specification
gap %>%
gs_read_cellfeed("Oceania", range = cell_limits(c(NA, 5), c(1, 3))) %>%
gs_reshape_cellfeed()
# convert to a un-named character vector and drop the variable name
gap_1col %>% gs_simplify_cellfeed(notation = "none", col_names = TRUE)
```

### Create sheets
Expand All @@ -226,7 +269,7 @@ foo <- gs_new("foo")
foo
```

By default, there will be an empty worksheet called "Sheet1", but you can control it's title, extent, and initial data with additional arguments to `gs_new()`. You can also add, rename, and delete worksheets within an existing sheet via `gs_ws_new()`, `gs_ws_rename()`, and `gs_ws_delete()`. Copy an entire spreadsheet with `gs_copy()`.
By default, there will be an empty worksheet called "Sheet1", but you can control it's title, extent, and initial data with additional arguments to `gs_new()` (see `gs_edit_cells()` in the next section). You can also add, rename, and delete worksheets within an existing sheet via `gs_ws_new()`, `gs_ws_rename()`, and `gs_ws_delete()`. Copy an entire spreadsheet with `gs_copy()`.

### Edit cells

Expand All @@ -236,7 +279,7 @@ You can modify the data in sheet cells via `gs_edit_cells()`. We'll work on the
foo <- foo %>% gs_edit_cells(input = head(iris), trim = TRUE)
```

Go to [your Google Sheets home screen](https://docs.google.com/spreadsheets/u/0/), find the new sheet `foo` and look at it. You should see some iris data in the first (and only) worksheet. We'll also take a look at it here, by consuming `foo` via the list feed.
Go to [your Google Sheets home screen](https://docs.google.com/spreadsheets/u/0/), find the new sheet `foo` and look at it. You should see some iris data in the first (and only) worksheet. We'll also take a look at it here, by reading the data from `foo`.

Note how we always store the returned value from `gs_edit_cells()` (and all other sheet editing functions). That's because the registration info changes whenever we edit the sheet and we re-register it inside these functions, so this idiom will help you make sequential edits and queries to the same sheet.

Expand All @@ -261,10 +304,12 @@ If you'd rather specify sheets for deletion by title, look at `gs_grepdel()` and
Here's how we can create a new spreadsheet from a suitable local file. First, we'll write then upload a comma-delimited excerpt from the iris data.

```{r new-sheet-from-file}
iris %>% head(5) %>% write.csv("iris.csv", row.names = FALSE)
iris %>%
head(5) %>%
write.csv("iris.csv", row.names = FALSE)
iris_ss <- gs_upload("iris.csv")
iris_ss
iris_ss %>% gs_read_listfeed()
iris_ss %>% gs_read()
file.remove("iris.csv")
```

Expand All @@ -273,14 +318,16 @@ Now we'll upload a multi-sheet Excel workbook. Slowly.
```{r new-sheet-from-xlsx}
gap_xlsx <- gs_upload(system.file("mini-gap.xlsx", package = "googlesheets"))
gap_xlsx
gap_xlsx %>% gs_read_listfeed(ws = "Oceania")
gap_xlsx %>% gs_read(ws = "Asia")
```

And we clean up after ourselves on Google Drive.

```{r delete-moar-sheets}
gs_delete(iris_ss)
gs_delete(gap_xlsx)
gs_vecdel(c("iris", "mini-gap"))
## achieves same as:
## gs_delete(iris_ss)
## gs_delete(gap_xlsx)
```

### Download sheets as csv, pdf, or xlsx file
Expand Down Expand Up @@ -331,4 +378,4 @@ user_session_info

In March 2014 [Google introduced "new" Sheets](https://support.google.com/docs/answer/3541068?hl=en). "New" Sheets and "old" sheets behave quite differently with respect to access via API and present a big headache for us. Recently, we've noted that Google is forcibly converting sheets: [all "old" Sheets will be switched over the "new" sheets during 2015](https://support.google.com/docs/answer/6082736?p=new_sheets_migrate&rd=1). However there are still "old" sheets lying around, so we've made some effort to support them, when it's easy to do so. But keep your expectations low.

In particular, `gs_read_csv()` does not and indeed __cannot__ work for "old" sheets.
In particular, `gs_read_csv()` does not currently work for "old" sheets.
Loading

0 comments on commit 8795e78

Please sign in to comment.