Stat21/week5-part1.Rmd at main · dr-suz/Stat21 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
---
title:  "More Chi-squared Procedures"
author: "STAT 021 with Prof Suzy"
institute: "Swarthmore College"

output:
  xaringan::moon_reader:
 #   css: ["default", "assets/sydney-fonts.css", "assets/sydney.css"]
    self_contained: false # if true, fonts will be stored locally
#    includes:
 #     in_header: "assets/mathjax-equation-numbers.html"
    nature:
#      beforeInit: ["assets/remark-zoom.js", "https://platform.twitter.com/widgets.js"]
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
      ratio: '16:9' # alternatives '16:9' or '4:3' or others e.g. 13:9
      navigation:
        scroll: false # disable slide transitions by scrolling
includes:
  in_header: mystyles.sty
---

```{r setup_pres, include=FALSE, echo=FALSE}
#devtools::install_github("ropenscilabs/icon")
#devtools::session_info('rmarkdown')
rm(list=ls())
library('tidyverse')
library('gridExtra')
library('broom')
library('cowplot')
library("RefManageR")
library("DT")


options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(fig.path='Figs/',echo=TRUE, warning=FALSE, message=FALSE)
```

```{css, echo=FALSE}
pre {
  background: #FFBB33;
  max-width: 100%;
  overflow-x: scroll;
}

.scroll-output {
  height: 75%;
  overflow-y: scroll;
}

.scroll-small {
  height: 50%;
  overflow-y: scroll;
}

.red{color: #ce151e;}
.green{color: #26b421;}
.blue{color: #426EF0;}
```


## More Chi-square Procedures

There are actually three different chi-square tests we are going to consider. They are tests for

1. Goodness-of-fit

2. Homogeneity

3. Independence

All three of these tests apply to count data (which can be represented as cells in a contingency table) and they all use the same test statistic $$\text{Test Statistic} = \chi^2 = \sum_{\text{all cells}}\frac{(Obs - Exp)^2}{Exp}.$$

---
## More Chi-square Procedures
### Goodness-of-fit

Recall the goodness-of-fit test, helps us determine whether the distribution of counts in a one-way contingency table match some particular distribution or predicted model which defines our null hypothesis. There is one categorical variable represented in the data with $k$ different levels.

| Category      | Observations  |
|---------------|---------------|
| Level 1       | 21            |
| Level 2       | 10            |
| Level 3       | 9             |
| $\dots$       | $\dots$       |
| Level $k$     | 16            |


The expected counts come from this predicting model. The test finds a p-value from a chi-square model with $k-1$ degrees of freedom, where $k$ is the number of levels of the categorical variable.

---
## Other Chi-square Procedures
### Homogeneity and Independence

These tests are very similar to one another and the formula for the test statistics is exactly the same as for the goodness-of-fit test.

These tests only differ with one another according to the problem setting.

  - If we want to know if the distribution of counts for different levels of a single categorical variable are significantly different from one another across two or more differ groups of subjects, we use a test of homogeneity. (You can think of the group status as another binary, categorical variable.)

  - If we are interested in the relationship between two categorical variables, collected on one group of subjects, we use a test of independence.


---
## Other Chi-square Procedures
### Homogeneity

|            |         | Group A    | Group B | $\dots$ | Group C |
|------------|---------|------------|---------|---------|---------|
|            | Level 1 |      24      |   32      |         |    18     |
|            | Level 2 |      12      |   36      |         |     19    |
| **Variable 1** | Level 3 |      34      |   35      |         |     22    |
|            | $\dots$ |              |           |         |           |
|            | Level R |      10      |   28      |         |     27    |


A chi-square test for homogeneity compares the distribution of counts for two or more groups (columns above) on the same categorical variable (variable 1 above). In this test, we are looking for a change in the counts over different groups/levels.

This test finds expected counts based on the overall frequencies, adjusted for the totals in each group under the null hypothesis assumption that the distributions are the same for each group. The p-value is calculated from a chi-square model with $(R-1)\times(C-1)$ degrees of freedom.


---
## Other Chi-square Procedures
### Homogeneity

$$H_0: \text{The levels of Variable 1 have the same distribution (are homogeneous) across all C groups.} \\
H_A: \text{The levels of Variable 1 do not have the same distribution across all C groups.}$$

The test for homogeneity is a generalization of the two-sample test for a difference in proportions, only now we have more than two proportions to compare (one corresponding to each level of the null hypothesis).


**Assumptions:**

Each of the groups must be independent of one another. Also, the expected cell counts must all be at least 5.

A distinguishing feature of the test of homogeneity, is that here we do not need the random sample assumption to hold unless we want to generalize our results to a larger population.

---
## Other Chi-square Procedures
### Independence

|    |    |    | Variable 2    |    |    |
|------------|---------|------------|---------|---------|---------|
|            |         | Level A    | Level B | $\dots$ | Level C |
|            | Level 1 |      24      |   32      |         |    18     |
|            | Level 2 |      12      |   36      |         |     19    |
| **Variable 1** | Level 3 |      34      |   35      |         |     22    |
|            | $\dots$ |              |           |         |           |
|            | Level R |      10      |   28      |         |     27    |

A test of whether two categorical variables are independent examines the distribution of counts for one group of individuals classified according to two categorical variables.

This test finds expected counts by assuming that knowing the marginal totals tells us the cell frequencies, assuming the null hypothesis is true, that there is no association between the variables. The p-value is calculated from a chi-square model with $(R-1)\times(C-1)$ degrees of freedom.

---
## Other Chi-square Procedures
### Independence

$$H_0: \text{Variable 1 is independent of Variable 2.}\\
H_{A}: \text{Variable 1 and Variable 2 are not independent.}$$

If we reject the null hypothesis, that does **not** mean that one variable depends on the other. Dependence suggests a model or pattern. When two categorical variables fail a test of independence, this could be for many different reasons so we instead say that there is evidence that the variables are *associated*. Thus the chi-square test for independence is **not** a way to check an independence assumption necessary for another statistical test.


**Assumptions:**

The expected cell counts must all be at least 5 and the data must be a simple random sample from the population.

---
## More Chi-square Procedures

- Goodness-of-fit tests compare the observed distribution of a single categorical variable to an expected (or theorized) distribution.

- Tests of homogeneity compare the distribution of several different groups for the same categorical variable. (You can think of the group status as another binary, categorical variable.)

- Tests of independence examine counts from a single group for evidence of an association between two categorical variables.


---
## More Chi-square Procedures
### In R

We use the same R function to conduct each of these three chi-square tests.

```{r eval=FALSE}
## Goodness-of-fit test
chisq.test(data_vector, p=prob_vector)

## Test of homogeneity or of independence
chisq.test(data_table)  ## or alternatively
chisq.test(factor1, factor2)
```


---
## More Chi-square Procedures
### Example 1

A brokerage firm wants to see whether the type of account a customer has (Silver, Gold, or Platinum) affects the type of trades that customer makes (in person, by phone, or online). It collects a random sample of trades made for its customers over the past year and preforms a chi-square test for **independence**.

.scroll-small[
|            |         |            | Trade type   |         |
|------------|---------|------------|--------------|---------|
|            |         | In person  | By phone     | Online  |
|            | Silver  |      52     |   57        |     220 |
| **Account type** | Gold |      103 |   124       |     104 |
|           | Platinum |      132    |   24        |   7     |

```{r echo=TRUE}
data_table <- matrix( c(52, 57, 220, 103, 124,104, 132, 24, 7), byrow=TRUE, ncol=3)
chisq.test(data_table)
```
]

---
## More Chi-square Procedures
### Example 2

The same brokerage firm wants to know if the type of account affects the size of the account (in dollars). It performs a Chi-square test of **homogeneity** to see if the mean size of the account (categorized as large, medium, or small) is the same for the three account types.


|            |         |        | Account Size  |         |
|------------|---------|--------|---------------|---------|
|            |         | Large  | Medium        |  Small  |
|            | Silver  |      52     |   57        |     220 |
| **Account type** | Gold |      103 |   124       |     104 |
|           | Platinum |      132    |   24        |   7     |

---
## More Chi-square Procedures
### Example 2


.scroll-output[
```{r echo=TRUE}
data_object <- tibble(type = factor(c(rep("Silver", 52+57+220),
                               rep("Gold", 103+124+104),
                               rep("Platinum", 132+24+7))),
                      size = factor(c(rep("Large", 52), rep("Medium", 57), rep("Small",220),
                               rep("Large", 103), rep("Medium", 124), rep("Small", 104),
                               rep("Large", 132), rep("Medium", 24), rep("Small", 7))))
table(data_object)
chisq.test(data_object$type, data_object$size)
```
]