Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vignette on combine cut interaction functions #195

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,8 @@ navbar:
href: articles/export.html
- text: Subtotals and Headings
href: articles/subtotals.html
- text: Combining Answers and Variables
href: articles/combine-cut-interact.html
- text: Crunch Internals
href: articles/crunch-internals.html
- text: Abstract Categories
Expand Down
1 change: 1 addition & 0 deletions inst/WORDLIST
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ Multitables
na
nd
NumericVariable
olds
OrderGroup
OrderGroups
PermissionCatalog
Expand Down
56 changes: 48 additions & 8 deletions vignette-data/make-vignette-rdata.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
library(crunch)
options(crunch.api=getOption("test.api"),
crunch.debug=FALSE,
crunch.email=getOption("test.user"),
crunch.pw=getOption("test.pw"))
# options(crunch.api=getOption("test.api"),
# crunch.debug=FALSE,
# crunch.email=getOption("test.user"),
# crunch.pw=getOption("test.pw"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have mentioned this before, brian, we should set up your r profile to have values for these options that point to a good backend. In the repository these lines should stay uncommented.

login()

## 1. Getting started
Expand Down Expand Up @@ -35,6 +35,7 @@ ds$imiss <- makeArray(ds[grep("^imiss_", names(ds))], name="Issue importance")
show_imiss_subvars <- crunch:::showSubvariables(subvariables(ds$imiss))
show_imiss <- capture.output(print(ds$imiss))
names_imiss_subvars <- names(subvariables(ds$imiss))
imiss.cats <- categories(ds$imiss)

newnames <- c("The economy", "Immigration",
"The environment", "Terrorism", "Gay rights", "Education",
Expand All @@ -45,13 +46,29 @@ show_imiss_subvars2 <- crunch:::showSubvariables(subvariables(ds$imiss))
sorting <- order(names(subvariables(ds$imiss)))
subvariables(ds$imiss) <- subvariables(ds$imiss)[sorting]
show_imiss_subvars3 <- crunch:::showSubvariables(subvariables(ds$imiss))
ds$imiss_topboxes <- combine(ds$imiss, name="Issue Importance (Top Boxes)",
combinations=list(
list(name="Important", categories=c("Very Important", "Somewhat Important")),
list(name="Not Important", categories=c("Not very Important", "Unimportant"))
))
imiss_topboxes.cats <- categories(ds$imiss_topboxes)

show_boap_4 <- capture.output(print(ds$boap_4))
ds$boap <- makeMR(ds[grep("^boap_[0-9]+", names(ds))],
name="Approval of Obama on issues",
selections=c("Strongly approve", "Somewhat approve"))
show_boap_subvars <- crunch:::showSubvariables(subvariables(ds$boap))
show_boap <- c(crunch:::showCrunchVariableTitle(ds$boap),
show_boap_subvars)
ds$boap_combined <- combine(ds$boap, name="Approval of Obama on issues (Combined Subvariables)",
combinations=list(
list(name="All Others", responses=c('boap_2', 'boap_3', 'boap_4', 'boap_5', 'boap_6',
'boap_7', 'boap_8', 'boap_9', 'boap_10', 'boap_11'))
))
show_boap_combined_subvars <- crunch:::showSubvariables(subvariables(ds$boap_combined))
show_boap_combined <- c(crunch:::showCrunchVariableTitle(ds$boap_combined),
show_boap_combined_subvars)

ds$boap <- undichotomize(ds$boap)
show_boap2 <- capture.output(print(ds$boap))
ds$boap <- dichotomize(ds$boap, "Strongly approve")
Expand Down Expand Up @@ -130,6 +147,7 @@ exclusion(ds) <- ds$perc_skipped > 15
high_perc_skipped <- capture.output(print(exclusion(ds)))
dim.ds.excluded <- dim(ds)


message("subtotals")
sub_initial_subtotals <- subtotals(ds$manningknowledge)
subtotals(ds$manningknowledge) <- list(
Expand All @@ -151,16 +169,38 @@ sub_headings <- subtotals(ds$obamaapp)
subtotals(ds$obamaapp) <- NULL
approve_subtotals <- list(
Subtotal(name = "Approves",
categories = c("Somewhat approve", "Strongly approve"),
after = "Somewhat approve"),
categories = c("Somewhat approve", "Strongly approve"),
after = "Somewhat approve"),
Subtotal(name = "Disapprove",
categories = c("Somewhat disapprove", "Strongly disapprove"),
after = "Strongly disapprove"))
categories = c("Somewhat disapprove", "Strongly disapprove"),
after = "Strongly disapprove"))
subtotals(ds$snowdenleakapp) <- approve_subtotals
subtotals(ds$congapp) <- approve_subtotals
sub_snowdon <- subtotals(ds$snowdenleakapp)
sub_con <- subtotals(ds$congapp)
sub_crtab <- crtabs(~congapp + gender, ds)


message("10. Re-Combining Answers and Variables")
ds$age4 <- cut(ds$age, name="Age (4 categories)",
breaks=c(17,29,44,64,100), labels=c('18-29', '30-44', '45-64', '65+'))
age4.var <- ds$age4
summary.age4.var <- capture.output(print(age4.var))
age4.cats <- categories(ds$age4)
ds$age3 <- combine(ds$age4, name="Age (3 categories)",
combinations=list(
list(name="18-44", categories=c('18-29', '30-44'))
))
age3.var <- ds$age3
summary.age3.var <- capture.output(print(age3.var))
age3.cats <- categories(ds$age3)
gender.var <- ds$gender
summary.gender.var <- capture.output(print(gender.var))
ds$gender_by_age <- interactVariables(ds$gender, ds$age3, name="Gender by Age")
gender_by_age.var <- ds$gender_by_age
summary.gender_by_age.var <- capture.output(print(gender_by_age.var))
gender_by_age.cats <- categories(ds$gender_by_age)

save.image(file="../vignettes/vignettes.RData")

with_consent(delete(ds)) ## cleanup
140 changes: 140 additions & 0 deletions vignettes/combine-cut-interact.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
---
title: "Combining Answers and Variables"
description: "Vignette showing you how to take existing variables and recombine their answers or other variables."
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Adding Variables}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

[Previous: subtotals](subtotals.html)

```{r, results='hide', echo=FALSE, message=FALSE}
## Because the vignette tasks require communicating with a remote host,
## we do all the work ahead of time and save a workspace, which we load here.
## We'll then reference saved objects in that as if we had just retrieved them
## from the server
library(crunch)
load("vignettes.RData")
options(width=120)
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Many common data cleaning steps are grouping a number of categories or values together for easier analysis. For example, you might have a numeric variable that you want to transform from continuous to a set of bins. In Crunch, this is easy with the `cut()` command (just like base R!). Or, say you have a categorical variable with many different categories that you want to summarize by collapsing categories together. To do this as a separate variable, use `combine()`, if instead you want subtotals to be displayed alongside the original categories, you might want [subtotals](subtotals.html) instead. Finally, if you have multiple categorical variables `interactVariables()` can make interactions of many variables. Each of these will be discussed in this vignette.

# Cutting a Numeric Variable into Categories
Say we have a numeric type variable `age`, which is in years from 18-99, and we want to place each answer into one of a few categories: 18-29, 30-44, etc. We can use the `cut()` function to do just that and _cut_ the numeric variable into a new Categorical type variable. We designed this function to match the way that base R's `cut()` function works.

```{r, eval = FALSE}
ds$age4 <- cut(ds$age, name="Age (4 categories)",
breaks=c(17,29,44,64,100), labels=c('18-29', '30-44', '45-64', '65+'))
```
```{r, eval = FALSE}
categories(ds$age4)
```
```{r, echo = FALSE}
cat(summary.age4.var, sep="\n")
```

And now we have a new Categorical type variable with the alias `age4`, the name "Age (4 categories)", and it's made of 4 categories.

# Re-Combining Answer Choices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read re-combining at first, I found it a bit awkward. I wonder if maybe just "Combining answer choices" would be better here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Re-combining is not great. Let's switch

## Categorical Type Variables
Sometimes we want to create subtotals (aka "nets" or "top boxes") for a Categorical type variable, where we preserve all of the original categories and collapse two or more categories together and for those cases we should use the `subtotals()` function (for more information and details about these, see [the subtotals vignette](subtotals.html)). But other times we do *not* want to preserve all the original categories and instead combine them into a smaller set of categories. To do that we use the `combine()` function.

Let's take the variable "Age (4 categories)" and combine the two youngest categories to create a new variable we will call "Age (3 categories)".

```{r, eval = FALSE}
categories(ds$age4)
```
```{r, echo = FALSE}
age4.cats
```
```{r, eval = FALSE}
ds$age3 <- combine(ds$age4, name="Age (3 categories)",
combinations=list(
list(name="18-44", categories=c('18-29', '30-44'))
))
```
```{r, eval = FALSE}
categories(ds$age3)
```
```{r, echo = FALSE}
age3.cats
```
And now we have a new variable with the alias `age3`, the name "Age (3 categories)", and a category that combines 18 to 44 year olds.

Note how this created an entirely new variable that exists in the dataset now, and so we can use it just like any other variable in Crunch: as a filter, etc. If we want to hide the variable we started with—"Age (4 categories)"—because we no longer need it, that is fine and will not affect our new variable.

## Categorical Array Type Variables
Similar to above with Categorical type variables, for Categorical Arrays we can combine together the answer choices that a person gave using the `combine()` function.

Let's take the variable "Issue Importance (categorical array)", which has the alias `imiss` and 11 subvariables with 4 categories: Very Important, Somewhat Important, Not very Important, and Unimportant. We would like to create a new Categorical Array variable that combines the two Important categories together and another that combines the two Not Important.

```{r, eval = FALSE}
categories(ds$imiss)
```
```{r, echo = FALSE}
imiss.cats
```
```{r, eval = FALSE}
ds$imiss_topboxes <- combine(ds$imiss, name="Issue Importance (Top Boxes)",
combinations=list(
list(name="Important", categories=c("Very Important", "Somewhat Important")),
list(name="Not Important", categories=c("Not very Important", "Unimportant"))
))
```
```{r, eval = FALSE}
categories(ds$imiss_topboxes)
```
```{r, echo = FALSE}
imiss_topboxes.cats
```
We have created a new Categorical Array type variable with the alias `imiss_topboxes`, the name "Issue Importance (Top Boxes)", and 2 categories instead of the original variable's 4.


## Multiple Response Type Variables
At first it might not seem that we can use the `combine()` function with Multiple Response type variables because each subvariable in the multiple response has already been reduced down to the categories that are "selected" or "not selected". However, there is an option that allows us to combine the subvariables (aka responses) in a multiple response similar to how we combined the categories in a categorical variable.

```{r, eval = FALSE}
ds$boap
```
```{r, echo = FALSE}
show_boap
```
```{r, eval = FALSE}
ds$boap_combined <- combine(ds$boap, name="Approval of Obama on issues (Combined Subvariables)",
combinations=list(
list(name="All Others", responses=c('boap_2', 'boap_3', 'boap_4', 'boap_5', 'boap_6',
'boap_7', 'boap_8', 'boap_9', 'boap_10', 'boap_11'))
))
```
```{r, eval = FALSE}
ds$boap_combined
```
```{r, echo = FALSE}
show_boap_combined
```

We have created a new Multiple Response type variable with the alias `boap_combined`, the name "Approval of Obama on issues (Combined Subvariables)", which has 4 subvariables instead of the original 13.

# Re-Combining Variables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think this might be ok just called "Combining variables"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

Besides re-combining answer choices, we can also re-combine variables. For example, if our dataset has a categorical type variable for gender and another categorical type variable for Age, then we can cross these two together into a new variable using the `interactVariables()` function (named after the use of 'interaction terms' in regression).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to add a more elaborated use case in here.

```{r interact 2 cats, eval = FALSE}
ds$gender_by_age <- interactVariables(ds$gender, ds$age3, name="Gender by Age")
```
```{r, eval = FALSE}
categories(ds$gender_by_age
```
```{r interaction var, echo = FALSE}
gender_by_age.cats
```

This generates a new Categorical type variable with a category for each possible combination of the 2 variables that fed into it, in this case Male for each Age category and Female for each Age category.

[Next: Crunch internals](crunch-internals.html)
2 changes: 1 addition & 1 deletion vignettes/crunch-internals.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,4 @@ datasets(proj) <- ds

Internally there is actually a second method `datasets<-` that takes the value on the right hand side of the `<-` operator and posts that value to the datasets attribute of the project catalog. The projects catalog will then update to reflect that a dataset belongs to a particular catalog, and that will be reflected in the web app. Similar patterns happen when you get and set attributes on objects, like "names".

[Next: Category objects](abstract-categories.html)
[Next: category objects](abstract-categories.html)
2 changes: 1 addition & 1 deletion vignettes/datasets.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ vignette: >
%\VignetteEncoding{UTF-8}
---

[Previous: Getting started](getting-started.html)
[Previous: getting started](getting-started.html)

```{r, results='hide', echo=FALSE, message=FALSE}
## Because the vignette tasks require communicating with a remote host,
Expand Down
1 change: 1 addition & 0 deletions vignettes/getting-started.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -48,5 +48,6 @@ The Crunch data store is built around datasets, which contain variables. Unlike
* [Filtering](filters.html): subsetting data, both in your R session and in the web interface
* [Downloading and exporting](export.html): how to pull data from the server, both for use in R and file export
* [Subtotals and headings](subtotals.html): how to set and get subtotals and headings for categorical variables
* [Combining answers and variables](combine-cut-interact.html): using Crunch tools to combine categories, responses, and variables
* [Crunch internals](crunch-internals.html): an introduction to the Crunch API and concepts to help you make more complex and more efficient queries
* [Category objects](abstract-categories.html): an introduction to the S4 classes that power categories and category-like representations in the package
2 changes: 1 addition & 1 deletion vignettes/subtotals.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -159,4 +159,4 @@ noTransforms(sub_crtab)

This does not modify the variable---the subtotals are still defined and visible in the web app---but they are removed from the current analysis.

[Next: Crunch internals](crunch-internals.html)
[Next: combining answers and variables](combine-cut-interact.html)
Binary file modified vignettes/vignettes.RData
Binary file not shown.