-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vignette on combine cut interaction functions #195
base: master
Are you sure you want to change the base?
Changes from 7 commits
f4aa631
25c35cf
6e53d2f
a95082d
29cec3a
72f0fd9
063e28d
08353ba
faad527
3cb8f3a
ad3dd7c
d3b4ad9
71cd08e
159e4a5
d7d2d9b
2da348b
596aaef
bc79493
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -93,6 +93,7 @@ Multitables | |
na | ||
nd | ||
NumericVariable | ||
olds | ||
OrderGroup | ||
OrderGroups | ||
PermissionCatalog | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
--- | ||
title: "Combining Answers and Variables" | ||
description: "Vignette showing you how to take existing variables and recombine their answers or other variables." | ||
output: rmarkdown::html_vignette | ||
vignette: > | ||
%\VignetteIndexEntry{Adding Variables} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
[Previous: subtotals](subtotals.html) | ||
|
||
```{r, results='hide', echo=FALSE, message=FALSE} | ||
## Because the vignette tasks require communicating with a remote host, | ||
## we do all the work ahead of time and save a workspace, which we load here. | ||
## We'll then reference saved objects in that as if we had just retrieved them | ||
## from the server | ||
library(crunch) | ||
load("vignettes.RData") | ||
options(width=120) | ||
``` | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE) | ||
``` | ||
|
||
Many common data cleaning steps are grouping a number of categories or values together for easier analysis. For example, you might have a numeric variable that you want to transform from continuous to a set of bins. In Crunch, this is easy with the `cut()` command (just like base R!). Or, say you have a categorical variable with many different categories that you want to summarize by collapsing categories together. To do this as a separate variable, use `combine()`, if instead you want subtotals to be displayed alongside the original categories, you might want [subtotals](subtotals.html) instead. Finally, if you have multiple categorical variables `interactVariables()` can make interactions of many variables. Each of these will be discussed in this vignette. | ||
|
||
# Cutting a Numeric Variable into Categories | ||
Say we have a numeric type variable `age`, which is in years from 18-99, and we want to place each answer into one of a few categories: 18-29, 30-44, etc. We can use the `cut()` function to do just that and _cut_ the numeric variable into a new Categorical type variable. We designed this function to match the way that base R's `cut()` function works. | ||
|
||
```{r, eval = FALSE} | ||
ds$age4 <- cut(ds$age, name="Age (4 categories)", | ||
breaks=c(17,29,44,64,100), labels=c('18-29', '30-44', '45-64', '65+')) | ||
``` | ||
```{r, eval = FALSE} | ||
categories(ds$age4) | ||
``` | ||
```{r, echo = FALSE} | ||
cat(summary.age4.var, sep="\n") | ||
``` | ||
|
||
And now we have a new Categorical type variable with the alias `age4`, the name "Age (4 categories)", and it's made of 4 categories. | ||
|
||
# Re-Combining Answer Choices | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When I read re-combining at first, I found it a bit awkward. I wonder if maybe just "Combining answer choices" would be better here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah. Re-combining is not great. Let's switch |
||
## Categorical Type Variables | ||
Sometimes we want to create subtotals (aka "nets" or "top boxes") for a Categorical type variable, where we preserve all of the original categories and collapse two or more categories together and for those cases we should use the `subtotals()` function (for more information and details about these, see [the subtotals vignette](subtotals.html)). But other times we do *not* want to preserve all the original categories and instead combine them into a smaller set of categories. To do that we use the `combine()` function. | ||
|
||
Let's take the variable "Age (4 categories)" and combine the two youngest categories to create a new variable we will call "Age (3 categories)". | ||
|
||
```{r, eval = FALSE} | ||
categories(ds$age4) | ||
``` | ||
```{r, echo = FALSE} | ||
age4.cats | ||
``` | ||
```{r, eval = FALSE} | ||
ds$age3 <- combine(ds$age4, name="Age (3 categories)", | ||
combinations=list( | ||
list(name="18-44", categories=c('18-29', '30-44')) | ||
)) | ||
``` | ||
```{r, eval = FALSE} | ||
categories(ds$age3) | ||
``` | ||
```{r, echo = FALSE} | ||
age3.cats | ||
``` | ||
And now we have a new variable with the alias `age3`, the name "Age (3 categories)", and a category that combines 18 to 44 year olds. | ||
|
||
Note how this created an entirely new variable that exists in the dataset now, and so we can use it just like any other variable in Crunch: as a filter, etc. If we want to hide the variable we started with—"Age (4 categories)"—because we no longer need it, that is fine and will not affect our new variable. | ||
|
||
## Categorical Array Type Variables | ||
Similar to above with Categorical type variables, for Categorical Arrays we can combine together the answer choices that a person gave using the `combine()` function. | ||
|
||
Let's take the variable "Issue Importance (categorical array)", which has the alias `imiss` and 11 subvariables with 4 categories: Very Important, Somewhat Important, Not very Important, and Unimportant. We would like to create a new Categorical Array variable that combines the two Important categories together and another that combines the two Not Important. | ||
|
||
```{r, eval = FALSE} | ||
categories(ds$imiss) | ||
``` | ||
```{r, echo = FALSE} | ||
imiss.cats | ||
``` | ||
```{r, eval = FALSE} | ||
ds$imiss_topboxes <- combine(ds$imiss, name="Issue Importance (Top Boxes)", | ||
combinations=list( | ||
list(name="Important", categories=c("Very Important", "Somewhat Important")), | ||
list(name="Not Important", categories=c("Not very Important", "Unimportant")) | ||
)) | ||
``` | ||
```{r, eval = FALSE} | ||
categories(ds$imiss_topboxes) | ||
``` | ||
```{r, echo = FALSE} | ||
imiss_topboxes.cats | ||
``` | ||
We have created a new Categorical Array type variable with the alias `imiss_topboxes`, the name "Issue Importance (Top Boxes)", and 2 categories instead of the original variable's 4. | ||
|
||
|
||
## Multiple Response Type Variables | ||
At first it might not seem that we can use the `combine()` function with Multiple Response type variables because each subvariable in the multiple response has already been reduced down to the categories that are "selected" or "not selected". However, there is an option that allows us to combine the subvariables (aka responses) in a multiple response similar to how we combined the categories in a categorical variable. | ||
|
||
```{r, eval = FALSE} | ||
ds$boap | ||
``` | ||
```{r, echo = FALSE} | ||
show_boap | ||
``` | ||
```{r, eval = FALSE} | ||
ds$boap_combined <- combine(ds$boap, name="Approval of Obama on issues (Combined Subvariables)", | ||
combinations=list( | ||
list(name="All Others", responses=c('boap_2', 'boap_3', 'boap_4', 'boap_5', 'boap_6', | ||
'boap_7', 'boap_8', 'boap_9', 'boap_10', 'boap_11')) | ||
)) | ||
``` | ||
```{r, eval = FALSE} | ||
ds$boap_combined | ||
``` | ||
```{r, echo = FALSE} | ||
show_boap_combined | ||
``` | ||
|
||
We have created a new Multiple Response type variable with the alias `boap_combined`, the name "Approval of Obama on issues (Combined Subvariables)", which has 4 subvariables instead of the original 13. | ||
|
||
# Re-Combining Variables | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, I think this might be ok just called "Combining variables" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yup |
||
Besides re-combining answer choices, we can also re-combine variables. For example, if our dataset has a categorical type variable for gender and another categorical type variable for Age, then we can cross these two together into a new variable using the `interactVariables()` function (named after the use of 'interaction terms' in regression). | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be nice to add a more elaborated use case in here. |
||
```{r interact 2 cats, eval = FALSE} | ||
ds$gender_by_age <- interactVariables(ds$gender, ds$age3, name="Gender by Age") | ||
``` | ||
```{r, eval = FALSE} | ||
categories(ds$gender_by_age | ||
``` | ||
```{r interaction var, echo = FALSE} | ||
gender_by_age.cats | ||
``` | ||
|
||
This generates a new Categorical type variable with a category for each possible combination of the 2 variables that fed into it, in this case Male for each Age category and Female for each Age category. | ||
|
||
[Next: Crunch internals](crunch-internals.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have mentioned this before, brian, we should set up your r profile to have values for these options that point to a good backend. In the repository these lines should stay uncommented.