addressr/README.Rmd at main · EvictionLab/addressr · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# addressr

<!-- badges: start -->
<!-- badges: end -->

The goal of addressr is to standardize address cleaning for various datasets used by the Eviction Lab.

## Installation

You can install the development version of addressr from [GitHub](https://github.com/) with:

``` r
# install.packages("pak")
pak::pak("EvictionLab/addressr")
```

## Usage

Addresses can be difficult to work with and messy in many different ways.

Here is an example of a relatively clean table:

```{r message=FALSE}
library(dplyr)

address_table <- dplyr::tribble(
  ~address, ~city, ~state,
  "456 Jersey Avenue #102", "Montclair", "NJ",
  "123-125 N Street Rd", "Cincinnati", "OH",
  "789 Pirate Cv East DBA ELAB LLC", "Memphis", "TN",
  "3548 1ST ST FL 1", "St. Louis", "MO",
)
```

Geocoders can be thrown off by various things, such as address ranges, unit numbers, directionals, and street endings.

The `clean_address()` function streamlines address cleaning and outputs data into a standard format

```{r message=FALSE}
library(addressr)

cleaned_addresses <- address_table |> clean_address(address) |> janitor::remove_empty("cols")

cleaned_addresses
```

By default, the function returns all address components. You can also select the output:

```{r}
address_table |> clean_address(address, output = c("clean_address", "short_address", "street_number", "unit", "extra"))
```

`clean_address` will return the street number, pre-direction, street name, street suffix and post-direction. `short_address` will return the street number and street name.

The package can also separate rows with a street range or multiple addresses:

```{r}
address_table <- address_table |>
  add_row(address = "928-928 S Montgomery Ave 1500 Jefferson Rd", city = "New York", state = "NY") |>
  add_row(address = "1500-1502 1550 Ptree Rd", city = "Atlanta", state = "GA")

address_table |> clean_address(address, separate_street_range = TRUE, separate_multi_address = TRUE)
```

If there is a common pattern that is not removed with the function, you can use `extract_remove_squish()`, to pre-clean the data.

```{r, message=FALSE}
address_table |>
  add_row(address = "246 S Bend St Unit 530 3 Bedroom") |>
  extract_remove_squish(address, other, "\\d Bedroom")
```

To switch a column from the abbreviated spelling to the long format, there are two functions: `switch_abbreviation` and `str_replace_names`

- `switch_abbreviation()` for abbreviations included in the `address_abbreviations` dataset: `r stringr::str_flatten_comma((unique(address_abbreviations$type)), last = ", and ")`.
  - `all_street_suffixes` has a one-to-many relationship and should only be used `long-to-short` to standardize spellings into the same official abbreviation (AV, AVE, AVN to AVE)
  - `official_street_suffixes` has a one-to-one relationship between short and long spellings. It can be used either `short-to-long` or `long-to-short` (assuming all endings are in the official format).

```{r}
address_table |>
  clean_address(address) |>
  mutate(
    street_suffix_short = switch_abbreviation(street_suffix, "official_street_suffixes", "long-to-short"),
    .keep = "used"
    )
```

- `str_replace_names()` to replace any vector with another vector of the same length

```{r}
address_table |>
  mutate(state = str_replace_names(state, state.abb, state.name))
```