AppliedEconometrics/11_IV2.Rmd at master · yutatoyama/AppliedEconometrics · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---
output:
  pdf_document: default
  html_document: default
---

# Instrumental Variable 2: Implementation in R

## Example 1: Wage regression

- Use dataset "Mroz", cross-sectional labor force participation data that accompany "Introductory Econometrics" by Wooldridge.
    - Original data from  *"The Sensitivity of an Empirical Model of Married Women's Hours of Work to Economic and Statistical Assumptions"* by Thomas Mroz published in *Econometrica* in 1987.
    - Detailed description of data: https://www.rdocumentation.org/packages/npsf/versions/0.4.2/topics/mroz


```{r}

library("foreign")

# You might get a message "cannot read factor labels from Stata 5 files", but you do not have to worry about it.
data <- read.dta("MROZ.DTA")


```

- Describe data
```{r}
library(stargazer)
stargazer(data, type = "text")
```

- Consider the wage regression
$$
\log(w_i) = \beta_0 + \beta_1 educ_i + \beta_2 exper_i + \beta_3 exper_i^2 + \epsilon_i
$$
- We assume that $exper_i$ is exogenous but $educ_i$ is endogenous.
- As an instrument for $educ_i$, we use the years of schooling for his or her father and mother, which we call $fathereduc_i$ and $mothereduc_i$.
- Discussion on these IVs will be later.

```{r}

library("AER")
library("dplyr")

# data cleaning
data %>%
  select(lwage, educ, exper, expersq, motheduc, fatheduc) %>%
  filter( is.na(lwage) == 0 ) -> data

# OLS regression
result_OLS <- lm( lwage ~ educ + exper + expersq, data = data)

# IV regression using fathereduc and mothereduc
result_IV <- ivreg(lwage ~ educ + exper + expersq | fatheduc + motheduc + exper + expersq, data = data)

# Robust standard errors
# gather robust standard errors in a list
rob_se <- list(sqrt(diag(vcovHC(result_OLS, type = "HC1"))),
               sqrt(diag(vcovHC(result_IV, type = "HC1"))))

# Show result
stargazer(result_OLS, result_IV, type ="text", se = rob_se)


```

- How about the first stage? You should always check this!!

```{r}

# First stage regression

result_1st <- lm(educ ~ motheduc + fatheduc + exper + expersq, data = data)

# F test
linearHypothesis(result_1st,
                 c("fatheduc = 0", "motheduc = 0" ),
                 vcov = vcovHC, type = "HC1")

```

### Discussion on IV

- Labor economists have used family background variables as IVs for education.
- Relevance: OK from the first stage regression.
- Independence: A bit suspicious. Parents' education would be correlated with child's ability through quality of nurturing at an early age.
- Still, we can see that these IVs can mitigate (though may not eliminate completely) the omitted variable bias.
- Discussion on the validity of instruments is crucial in empirical research.


## Example 2: Estimation of the Demand for Cigaretts

- Demand model is a building block in many branches of Economics.
- For example, health economics is concerned with the study of how health-affecting behavior of individuals is influenced by the health-care system and regulation policy.
- Smoking is a prominent example as it is related to many illnesses and negative externalities.
- It is plausible that cigarette consumption can be reduced by taxing cigarettes more heavily.
- Question: how much taxes must be increased to reach a certain reduction in cigarette consumption? -> Need to know **price elasticity of demand** for cigaretts.

- Use `CigarrettesSW` in the package `AER`.
- a panel data set that contains observations on cigarette consumption and several economic indicators for all 48 continental federal states of the U.S. from 1985 to 1995.
- What is **panel data**? The data involves both time series and cross-sectional information.
    - The variable is denoted as $y_{it}$, which indexed by individual $i$ and time $t$.
    - Cross section data $y_i$: information for a particular individual $i$ (e.g., income for person $i$).
    - Time series data $y_t$: information for a particular time period (e.g., GDP in year $y$)
    - Panel data $y_{it}$: income of person $i$ in year $t$.
- We will see more on panel data later in this course. For now, we use the panel data as just  cross-sectional data (**pooled cross-sections**)


```{r}
# load the data set and get an overview
library(AER)
data("CigarettesSW")
summary(CigarettesSW)
```

- Consider the following model
$$
\log (Q_{it}) = \beta_0 + \beta_1 \log (P_{it}) + \beta_2 \log(income_{it}) + u_{it}
$$
where
    - $Q_{it}$ is the number of packs per capita in state $i$ in year $t$,
    - $P_{it}$ is the after-tax average real price per pack of cigarettes, and
    - $income_{it}$ is the real income per capita. This is demand shifter.
- As an IV for the price, we use the followings:
    - $SalesTax_{it}$: the proportion of taxes on cigarettes arising from the general sales tax.
        - Relevant as it is included in the after-tax price
        - Exogenous(indepndent) since the sales tax does not influence demand directly, but indirectly through the price.
    - $CigTax_{it}$: the cigarett-specific taxes


```{r}
library(dplyr)
CigarettesSW %>%
  mutate( rincome = (income / population) / cpi) %>%
  mutate( rprice  = price / cpi ) %>%
  mutate( salestax = (taxs - tax) / cpi ) %>%
  mutate( cigtax = tax/cpi ) -> Cigdata

```

- Let's run the regressions

```{r}

cig_ols <- lm(log(packs) ~ log(rprice) + log(rincome) ,  data = Cigdata)
#coeftest(cig_ols, vcov = vcovHC, type = "HC1")


cig_ivreg <- ivreg(log(packs) ~ log(rprice) + log(rincome)  |
                    log(rincome) +  salestax +  cigtax, data = Cigdata)
#coeftest(cig_ivreg, vcov = vcovHC, type = "HC1")

# Robust standard errors
rob_se <- list(sqrt(diag(vcovHC(cig_ols, type = "HC1"))),
               sqrt(diag(vcovHC(cig_ivreg, type = "HC1"))))

# Show result
stargazer(cig_ols, cig_ivreg, type ="text", se = rob_se)

```


- The first stage regression


```{r}

# First stage regression

result_1st <- lm(log(rprice) ~ log(rincome) + log(rincome) + salestax + cigtax , data= Cigdata)

# F test
linearHypothesis(result_1st,
                 c("salestax = 0", "cigtax = 0" ),
                 vcov = vcovHC, type = "HC1")

```


## Example 3: Effects of Turnout on Partisan Voting

- THOMAS G. HANSFORD and BRAD T. GOMEZ "Estimating the Electoral Effects of Voter Turnout" The American Political Science Review Vol. 104, No. 2 (May 2010), pp. 268-288
    - Link: https://www.cambridge.org/core/journals/american-political-science-review/article/estimating-the-electoral-effects-of-voter-turnout/8A880C28E79BE770A5CA1A9BB6CF933C
- Here, we will see a simplified version of their analysis.
- The dataset is [here](HansfordGomez_Data.csv)

```{r}
library(readr)

HGdata <- read_csv("HansfordGomez_Data.csv")

stargazer::stargazer(as.data.frame(HGdata), type="text")

```

- Data description:

| Name | Description|
|---|-------------------------------------------------------------------|
| Year | Election Year|
| FIPS_County | FIPS County Code|
| Turnout | Turnout as Pcnt VAP|
| Closing2 | Days between registration closing date and election|
| Literacy | Literacy Test|
| PollTax | Poll Tax|
| Motor | Motor Voter|
| GubElection | Gubernatorial Election in State|
| SenElection | U.S. Senate Election in State|
| GOP_Inc | Republican Incumbent|
| Yr52 | 1952 Dummy|
| Yr56 | 1956 Dummy|
| Yr60 | 1960 Dummy|
| Yr64 | 1964 Dummy|
| Yr68 | 1968 Dummy|
| Yr72 | 1972 Dummy|
| Yr76 | 1976 Dummy|
| Yr80 | 1980 Dummy|
| Yr84 | 1984 Dummy|
| Yr88 | 1988 Dummy|
| Yr92 | 1992 Dummy|
| Yr96 | 1996 Dummy|
| Yr2000 | 2000 Dummy|
| DNormPrcp_KRIG | Election day rainfall - differenced from normal rain for the day|
| GOPIT | Turnout x Republican Incumbent|
| DemVoteShare2_3MA | Partisan composition measure = 3 election moving avg. of Dem Vote Share|
| DemVoteShare2 | Democratic Pres Candidate's Vote Share|
| RainGOPI | Rainfall measure x Republican Incumbent|
| TO_DVS23MA | Turnout x Partisan Composition measure|
| Rain_DVS23MA | Rainfall measure x Partisan composition measure|
| dph | =1 if home state of Dem pres candidate|
| dvph | =1 if home state of Dem vice pres candidate|
| rph | =1 if home state of Rep pres candidate|
| rvph | =1 if home state of Rep vice pres candidate|
| state_del | avg common space score for the House delegation|
| dph_StateVAP | = dph*State voting age population|
| dvph_StateVAP | = dvph*State voting age population|
| rph_StateVAP | = rph*State voting age population|
| rvph_StateVAP | = rvph*State voting age population|
| State_DVS_lag | State-wide Dem vote share, lagged one election|
| State_DVS_lag2 | State_DVS_lag squared|

 - Consider the following regression
 $$
DemoShare_{it} = \beta_0 + \beta_1 Turnout_{it} + u_t + u_{it}
 $$
where
    - $Demoshare_{it}$: Two-party vote share for Democrat candidate in county $i$ in the presidential election in year $t$
    - $Turnout_{it}$: Turnout rate in county $i$ in the presidential election in year $t$
    - $u_t$: **Year fixed effects**. Time dummies for each presidential election year
- As an IV, we use the rainfall measure denoted by `DNormPrcp_KRIG`


```{r}


# You can do this, but it is tedious.
hg_ols <- lm( DemVoteShare2 ~ Turnout + Yr52 + Yr56 + Yr60 + Yr64 + Yr68 +
                Yr72 + Yr76 + Yr80 + Yr84 + Yr88 + Yr92 + Yr96 + Yr2000,  data = HGdata)
#coeftest(hg_ols, vcov = vcovHC, type = "HC1")

# By using "factor(Year)" as an explanatory variable, the regression automatically incorporates the dummies for each value.
hg_ols <- lm( DemVoteShare2 ~ Turnout + factor(Year)   ,  data = HGdata)
#coeftest(hg_ols, vcov = vcovHC, type = "HC1")

# Iv regression
hg_ivreg <- ivreg( DemVoteShare2 ~ Turnout + factor(Year) |
                    factor(Year) + DNormPrcp_KRIG, data = HGdata)
#coeftest(hg_ivreg, vcov = vcovHC, type = "HC1")

# Robust standard errors
rob_se <- list(sqrt(diag(vcovHC(hg_ols, type = "HC1"))),
               sqrt(diag(vcovHC(hg_ivreg, type = "HC1"))))

# Show result
stargazer(hg_ols, hg_ivreg, type ="text", se = rob_se)

# First stage regression
hg_1st <- lm(Turnout ~  factor(Year) + DNormPrcp_KRIG, data= HGdata)

# F test
linearHypothesis(hg_1st,
                 c("DNormPrcp_KRIG = 0" ),
                 vcov = vcovHC, type = "HC1")

```

- We will see this paper more in excercise 4 (due on July 9th).