Skip to content

Commit

Permalink
Merge branch 'bl-ch1'
Browse files Browse the repository at this point in the history
  • Loading branch information
benslack19 committed May 19, 2024
2 parents 3b3bf7c + 734f61c commit 25ca4e1
Show file tree
Hide file tree
Showing 6 changed files with 388 additions and 35 deletions.
4 changes: 2 additions & 2 deletions _posts/2024-05-12-experimental-survey-designs.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ Notes for Chapter 2 of [Causal Inference with Survey Data](https://www.linkedin.
| ----- | ----- | ------- |
| Simple randomization | Assigns equal probability to treatment and control groups |- Easy to implement<br>- Can lead to unequal group sizes in small trials |
| Block randomization | Get similar group sizes by dividing subjects into predetermined number of subjects (often a multiple of the number of groups, like 4, 8, etc. for two groups). Within each block, participants are then randomly assigned to treatment groups. |- Requires choosing the total number of subjects in each block <br>- Can often avoid imbalance in small trials seen with simple randomization |
| Stratified randomization | Balance based on characteristics/covariates (like age, sex, etc.) before randomizing within these strata |<ul><li>Ensures balance in important covariates between groups |
| Stratified randomization | Balance based on characteristics/covariates (like age, sex, etc.) before randomizing within these strata |- Ensures balance in important covariates between groups |
| Cluster randomization | Randomizes entire groups (like schools or hospitals) |- Suitable for group-level interventions or when individual assignment is impractical |
| Covariate adaptive randomization | Increases the probability of being assigned to a group to address a deficit of a particular characteristic within the group |<ul><li>Effective in trials with small sample sizes or multiple important covariates |
| Covariate adaptive randomization | Increases the probability of being assigned to a group to address a deficit of a particular characteristic within the group |- Effective in trials with small sample sizes or multiple important covariates |

- Before randomization, the number of observations is typically calculated ahead of time based on a specific effect size (power analysis). You need to consider measure of variability (standard deviation), significance level, and power.
- You can make a graph of sample size on y-axis and effect size on x-axis to see the relationship.
Expand Down
197 changes: 197 additions & 0 deletions _posts/2024-05-16-cross-sectional-survey-designs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
---
title: "Cross-Sectional Survey Designs"
mathjax: true
toc: true
toc_sticky: true
categories: [data science, statistics]
---

Notes for Chapter 3 of [Causal Inference with Survey Data](https://www.linkedin.com/learning/causal-inference-with-survey-data/surveys-with-cross-sectional-data?autoSkip=true&resume=false&u=185169545) on LinkedIn Learning, given by Franz Buscha. I'm using this series of posts to take some notes.


```python
import graphviz as gr
```


```python
def draw_causal_graph(
edge_list, node_props=None, edge_props=None, graph_direction="UD"
):
"""Utility to draw a causal (directed) graph
Taken from: https://github.com/dustinstansbury/statistical-rethinking-2023/blob/a0f4f2d15a06b33355cf3065597dcb43ef829991/utils.py#L52-L66
"""
g = gr.Digraph(graph_attr={"rankdir": graph_direction})

edge_props = {} if edge_props is None else edge_props
for e in edge_list:
props = edge_props[e] if e in edge_props else {}
g.edge(e[0], e[1], **props)

if node_props is not None:
for name, props in node_props.items():
g.node(name=name, **props)
return g
```

# Cross-sectional survey designs

- It's a snapshot in time, capturing information from many subjects.
- Most common type of survey.

**Examples**
1. Census surveys. Provides a snapshot of a country's population (e.g. US Census done every 10 years).
1. Expenditure surveys. Information on buying habits (e.g. annual Consumer Expenditure Survey).
1. Labor force surveys. Collect data on employment (e.g. UK Labour Force Survey, conducted quarterly).

**Advantages**
- Availability
- Cheap to conduct
- Versatility in topics

**Disadvantages**
- Lack temporal data
- Sampling, selection, and response bias
- Lack of depth (limited data on complex issues)

**Statistical Framework**
- A key to working with cross-sectional data is the $i$ subscript, such as in the form:

$$ Y_i = \beta_0 + \beta_1X1_i + \beta_2X2_i + ... \epsilon_i$$

- The $i$ denotes different observations in the data (e.g. subjects or entities at a single point in time)

**Conclusion**
- Broad application and more themes.
- Explanatory variables must be used in innovative ways for cause-and-effect analysis.

# Regression analysis

- A fundamental statistical method
- A powerful tool for controlling observable factors
- Mainstay of causal analysis

**DAG: Controlling for Observable Factors**

- A regression model can answer this question: What is the causal effect of X on Y?




```python
draw_causal_graph(
edge_list=[("X1i", "Yi"), ("&#x03B5;", "Yi")],
edge_props={("&#x03B5;", "Yi"): {"style": "dashed", "label": "&beta;"}},
graph_direction="LR",
)
```





![svg](/assets/2024-05-16-cross-sectional-survey-designs_files/2024-05-16-cross-sectional-survey-designs_6_0.svg)




Other factors that are not seen in the survey data are summed up in the hidden error term.

$ Y_i = \beta_0 + \beta_1X1_i + \epsilon_i $

- Regression can control for many observable factors
- Effects estimated in a regression model are independent of other effects in the model
- Causal infrence relies on there being no confounders (exogeneity assumption)
- Variables that don't gice a choice are often exogenous (sex, age, parents birthplace, etc.). These are variables that are "hard to influence".
- Assumption of exogeneity can be difficult. There can be many factors that drive both Y and X1. This creates a backdoor pathway.



```python
# `&#x03B5;` is unicode for epsilon since `&epsilon;` fails to render
draw_causal_graph(
edge_list=[("X1i", "Yi"), ("&#x03B5;", "Yi"), ("&#x03B5;", "X1i")],
edge_props={
("X1i", "Yi"): {"label": "&beta;1"},
("&#x03B5;", "Yi"): {"style": "dashed"},
("&#x03B5;", "X1i"): {"style": "dashed", "label": "backdoor"},
},
graph_direction="LR",
)
```





![svg](/assets/2024-05-16-cross-sectional-survey-designs_files/2024-05-16-cross-sectional-survey-designs_8_0.svg)




If the backdoor is present, then the estimate of $\beta_1$ will not be correct.

But imagine that $X2i$ in the error term can be observed. A new DAG might look like this.


```python
draw_causal_graph(
edge_list=[("X1i", "Yi"), ("&#x03B5;", "Yi"), ("X2i", "X1i"), ("X2i", "Yi")],
edge_props={
("X1i", "Yi"): {"label": "&beta;1"},
("X2i", "Yi"): {"label": "&beta;2"},
("&#x03B5;", "Yi"): {"style": "dashed"},
("&#x03B5;", "X1i"): {"style": "dashed", "label": "backdoor"},
},
graph_direction="LR",
)
```





![svg](/assets/2024-05-16-cross-sectional-survey-designs_files/2024-05-16-cross-sectional-survey-designs_10_0.svg)




$ Y_i = \beta_0 + \beta_1X1_i + \beta_2X2_i + \epsilon_i $

$X2$ is now specifically controlled for.

Triangular Tables:
A way to observe the effect on a regression model of incrementally adding more variables but be careful of overfitting. Knowing what variables to include requires some domain knowledge.

**Advantages**
- Flexibility in variables
- Many different forms for different data
- Easy to understand

**Disadvantages**
- Often too simple
- Cannot control for unobserved confounders

**Conclusion**
- Don't dismiss basic regression
- Underpins more complex models
- Works well with large surveys and many variables


```python
%load_ext watermark
%watermark -n -u -v -iv -w
```

Last updated: Sun May 19 2024

Python implementation: CPython
Python version : 3.11.7
IPython version : 8.21.0

graphviz: 0.20.1

Watermark: 2.4.3


Loading

0 comments on commit 25ca4e1

Please sign in to comment.