02-overview-surveys.Rmd

# Overview of surveys {#c02-overview-surveys}

## Introduction
 
Developing surveys to gather accurate information about populations involves an intricate and time-intensive process. Researchers can spend months, or even years, developing the study design, questions, and other methods for a single survey to ensure high-quality data is collected.

\index{Research topic|(} \index{Question of interest|see {Research topic}} \index{Research question|see {Research topic}} \index{Burden|(} \index{Respondent burden|see {Burden}} \index{Survey burden|see {Burden}} \index{Survey life cycle|(}
Before analyzing survey data, we recommend understanding the entire survey life cycle. This understanding can provide better insight into what types of analyses should be conducted on the data. The survey life cycle consists of the necessary stages to execute a survey project successfully. Each stage influences the survey's timing, costs, and feasibility, consequently impacting the data collected and how we should analyze them. Figure \@ref(fig:overview-diag) shows a high-level overview of the survey process.

```{r}
#| label: overview-diag
#| echo: false
#| fig.cap: "Overview of the survey process"
#| fig.alt: "Diagram of survey process beginning with survey concept (first level), then sampling design, questionnaire design, and data collection planning (second level), data collection (third level), post-survey processing (fourth level), analysis (fifth level), and reporting (sixth level)"
library(DiagrammeR)
mermaid("
graph TD
  A[Survey Concept]-->B[Sampling Design]
  A-->C[Questionnaire Design]
  A-->D[Data Collection Planning]
  B-->E[Data Collection]
  C-->E
  D-->E
  E-->F[Post-Survey Processing]
  F-->G[Analysis]
  G-->H[Reporting]
  
  style A fill: #bfd7ea, stroke: #0b3954
  style B fill: #bfd7ea, stroke: #0b3954
  style C fill: #bfd7ea, stroke: #0b3954
  style D fill: #bfd7ea, stroke: #0b3954
  style E fill: #bfd7ea, stroke: #0b3954
  style F fill: #bfd7ea, stroke: #0b3954
  style G fill: #bfd7ea, stroke: #0b3954
  style H fill: #bfd7ea, stroke: #0b3954
")
```


\index{Research topic|(} \index{Question of interest|see {Research topic}} \index{Research question|see {Research topic}} \index{Burden|(} \index{Respondent burden|see {Burden}} \index{Survey burden|see {Burden}} \index{Questionnaire|(}
The survey life cycle starts with a research topic or question of interest (e.g., the impact that childhood trauma has on health outcomes later in life). Drawing from available resources can result in a reduced burden on respondents, lower costs, and faster research outcomes. Therefore, we recommend reviewing existing data sources to determine if data that can address this question are already available. However, if existing data cannot answer the nuances of the research question, we can capture the exact data we need through a questionnaire, or a set of questions.\index{Research topic|)} \index{Burden|)} \index{Survey life cycle|)} \index{Questionnaire|)}

To gain a deeper understanding of survey design and implementation, we recommend reviewing several pieces of existing literature in detail [e.g., @biemer2003survqual; @Bradburn2004; @dillman2014mode; @groves2009survey; @Tourangeau2000psych;  @valliant2013practical].

## Searching for public-use survey data

Throughout this book, we use public-use datasets from different surveys, including the American National Election Studies (ANES), the Residential Energy Consumption Survey (RECS), the National Crime Victimization Survey (NCVS), and the AmericasBarometer surveys.

\index{Research topic|(}As mentioned above, we should look for existing data that can provide insights into our research questions before embarking on a new survey. One of the greatest sources of data is the government. For example, in the U.S., we can get data directly from the various statistical agencies such as the U.S. Energy Information Administration or Bureau of Justice Statistics. Other countries often have data available through official statistics offices, such as the Office for National Statistics in the United Kingdom.\index{Research topic|)}

In addition to government data, many researchers make their data publicly available through repositories such as the [Inter-university Consortium for Political and Social Research (ICPSR)](https://www.icpsr.umich.edu/web/pages/ICPSR/ssvd/) or the [Odum Institute Data Archive](https://odum.unc.edu/archive/). Searching these repositories or other compiled lists (e.g., [Analyze Survey Data for Free](https://asdfree.com)) can be an efficient way to identify surveys with questions related to our research topic.

## Pre-survey planning {#pre-survey-planning}

There are multiple things to consider when starting a survey. Errors are the differences between the true values of the variables being studied and the values obtained through the survey. Each step and decision made before the launch of the survey impact the types of errors that are introduced into the data, which in turn impact how to interpret the results.

\index{Representation|(}\index{Total survey error|(}Generally, survey researchers consider there to be seven main sources of error that fall under either Representation or \index{Measurement}Measurement [@groves2009survey]:

- Representation
    - \index{Coverage error|(}Coverage Error: A mismatch between the \index{Population of interest|(}\index{Target population|see {Population of interest}}population of interest\index{Population of interest|)} and \index{Sampling frame|(}the sampling frame, the list from which the sample is drawn.\index{Coverage error|)}\index{Sampling frame|)}
    - \index{Sampling error|(}Sampling Error: \index{Sampling frame|(}\index{Sample|(}Error produced when selecting a sample, the subset of the population, from the sampling frame.\index{Sampling frame|)} This error is due to randomization, and we discuss how to quantify this error in Chapter \@ref(c10-sample-designs-replicate-weights). There is no sampling error in a census, as there is no randomization. The sampling error measures the difference between all potential samples under the same sampling method.\index{Sampling error|)}\index{Sample|)}
    - \index{Nonresponse error|(}Nonresponse Error: Differences between those who responded and \index{Unit nonresponse|(}did not respond to the survey (unit nonresponse)\index{Unit nonresponse|)} or \index{Item nonresponse|(}a given question (item nonresponse).\index{Nonresponse error|)}\index{Item nonresponse|)}
    - \index{Adjustment error|(}Adjustment Error: Error introduced during post-survey statistical adjustments. \index{Representation|)}
- \index{Measurement|(}Measurement\index{Adjustment error|)}
    - \index{Validity|(}Validity: A mismatch between the research topic and the question(s) used to collect that information.\index{Validity|)}
    - \index{Measurement error|(}Measurement Error: A mismatch between what the researcher asked and how the respondent answered.\index{Measurement error|)}
    - \index{Processing error|(}Processing Error: Edits by the researcher to responses provided by the respondent (e.g., adjustments to data based on illogical responses).\index{Measurement|)}\index{Processing error|)}

Almost every survey has errors. Researchers attempt to conduct a survey that reduces the total survey error, or the accumulation of all errors that may arise throughout the survey life cycle. By assessing these different types of errors together, researchers can seek strategies to maximize the overall survey quality and improve the reliability and validity of results [@tse-doc]. However, attempts to reduce individual source errors (and therefore total survey error) come at the price of time and money. For example:

- \index{Coverage error|(}\index{Sampling frame|(}Coverage Error Tradeoff: Researchers can search for or create more accurate and updated sampling frames, but they can be difficult to construct or obtain.\index{Coverage error|)}\index{Sampling frame|)}
- \index{Sampling error|(}Sampling Error Tradeoff: Researchers can increase the sample size to reduce sampling error; however, larger samples can be expensive and time-consuming to field.\index{Sampling error|)}
- \index{Nonresponse error|(}Nonresponse Error Tradeoff: Researchers can increase or diversify efforts to improve survey participation, but this may be resource-intensive while not entirely removing nonresponse bias.\index{Nonresponse error|)}
- \index{Weighting|(}\index{Weights|see {Weighting}}Adjustment Error Tradeoff: Weighting is a statistical technique used to adjust the contribution of individual survey responses to the final survey estimates. It is typically done to make the sample more representative of the population of interest.  However, if researchers do not carefully execute the adjustments or base them on inaccurate information, they can introduce new biases, leading to less accurate estimates.\index{Weighting|)}
- \index{Validity|(}Validity Error Tradeoff: Researchers can increase validity through a variety of ways, such as using established scales or collaborating with a psychometrician during survey design to pilot and evaluate questions. However, doing so increases the amount of time and resources needed to complete survey design.\index{Validity|)}
- \index{Measurement error|)}\index{Questionnaire testing|(}\index{Piloting|see {Questionnaire testing}} \index{Cognitive interview|(}Measurement Error Tradeoff: Researchers can use techniques such as questionnaire testing and cognitive interviewing to ensure respondents are answering questions as expected. However, these activities require time and resources to complete.\index{Measurement error|)} \index{Questionnaire testing|(} \index{Cognitive interview|)}
- \index{Processing error|(}Processing Error Tradeoff: Researchers can impose rigorous data cleaning and validation processes. However, this requires supervision, training, and time.\index{Processing error|)}

The challenge for survey researchers is to find the optimal tradeoffs among these errors. They must carefully consider ways to reduce each error source and total survey error while balancing their study's objectives and resources.

For survey analysts, understanding the decisions that researchers took to minimize these error sources can impact how results are interpreted. The remainder of this chapter explores critical considerations for survey development. We explore how to consider each of these sources of error and how these error sources can inform the interpretations of the data.\index{Total survey error|)}

## Study design {#overview-design}

\index{Survey life cycle|(} \index{Sampling frame|(} \index{Study design|(}From formulating methodologies to choosing an appropriate sampling frame, the study design phase is where the blueprint for a successful survey takes shape. \index{Population of interest|(}Study design encompasses multiple parts of the survey life cycle, including decisions on the population of interest, \index{Population of interest|)} \index{Mode|(}\index{Survey mode|see {Mode}}survey mode (the format through which a survey is administered to respondents)\index{Mode|)}, timeline, and questionnaire design. Knowing who and how to survey individuals depends on the study's goals and the feasibility of implementation. This section explores the strategic planning that lays the foundation for a survey.\index{Sampling frame|)} \index{Survey life cycle|)}

### Sampling design {#overview-design-sampdesign}

\index{Population of interest|(}The set or group we want to survey is known as the population of interest or the target population. The population of interest could be broad, such as “all adults age 18+ living in the U.S.” or a specific population based on a particular characteristic or location. For example, we may want to know about "adults aged 18--24 who live in North Carolina" or "eligible voters living in Illinois." \index{Population of interest|)}

\index{Sampling frame|(}However, a sampling frame with contact information is needed to survey individuals in these populations of interest. If we are looking at eligible voters, the sampling frame could be the voting registry for a given state or area. If we are looking at more broad populations of interest, like all adults in the United States, the sampling frame is likely imperfect. In these cases, a full list of individuals in the United States is not available for a sampling frame. Instead, we may choose to use a sampling frame of mailing addresses and send the survey to households, or we may choose to use random digit dialing (RDD) and call random phone numbers (that may or may not be assigned, connected, and working). 

\index{Coverage error|(}These imperfect sampling frames can result in coverage error where there is a mismatch between the population of interest and the list of individuals we can select. For example, if we are looking to obtain estimates for "all adults aged 18+ living in the U.S.," a sampling frame of mailing addresses will miss specific types of individuals, such as the homeless, transient populations, and incarcerated individuals. Additionally, many households have more than one adult resident, so we would need to consider how to get a specific individual to fill out the survey (called within household selection) or adjust the population of interest to report on "U.S. households" instead of "individuals."\index{Coverage error|)}

Once we have selected the sampling frame, the next step is determining how to select individuals for the survey. In rare cases, we may conduct a census and survey everyone on the sampling frame. However, the ability to implement a questionnaire at that scale is something only a few can do (e.g., government censuses). \index{Weighting|(}Instead, we typically choose to sample individuals and use weights to estimate numbers in the population of interest. They can use a variety of different sampling methods, and more information on these can be found in Chapter \@ref(c10-sample-designs-replicate-weights). \index{Sampling error|(}This decision of which sampling method to use impacts sampling error and can be accounted for in weighting.\index{Sampling error|)}\index{Sampling frame|)}\index{Weighting|)}

#### Example: Number of pets in a household {.unnumbered #overview-design-sampdesign-ex}

Let's use a simple example where we are interested in the average number of pets in a household. We need to consider the population of interest for this study. Specifically, are we interested in all households in a given country or households in a more local area (e.g., city or state)? Let's assume we are interested in the number of pets in a U.S. household with at least one adult (18 years or older). \index{Coverage error|(}\index{Sampling frame|(}In this case, a sampling frame of mailing addresses would introduce only a small amount of coverage error as the frame would closely match our population of interest.\index{Coverage error|)} Specifically, we would likely want to use the Computerized Delivery Sequence File (CDSF), which is a file of mailing addresses that the United States Postal Service (USPS) creates and covers nearly 100% of U.S. households [@harter2016address]. To sample these households, for simplicity, \index{Stratified sampling|(}we use a stratified simple random sample design (see Chapter \@ref(c10-sample-designs-replicate-weights) for more information on sample designs), where we randomly sample households within each state (i.e., we stratify by state).\index{Stratified sampling|)}\index{Sampling frame|)} 

Throughout this chapter, we build on this example research question to plan a survey. 

### Data collection planning {#overview-design-dcplanning}

\index{Mode|(} \index{Data collection|(}
With the sampling design decided, researchers can then decide how to survey these individuals. Specifically, the modes used for contacting and surveying the sample, how frequently to send reminders and follow-ups, and the overall timeline of the study are some of the major data collection determinations. Traditionally, survey researchers have considered there to be four main modes^[Other modes such as using mobile apps or text messaging can also be considered, but at the time of publication, they have smaller reach or are better for longitudinal studies (i.e., surveying the same individuals over many time periods of a single study).]:

- Computer-Assisted Personal Interview (CAPI; also known as face-to-face or in-person interviewing)
- Computer-Assisted Telephone Interview (CATI; also known as phone or telephone interviewing)
- Computer-Assisted Web Interview (CAWI; also known as web or online interviewing)
- Paper and Pencil Interview (PAPI)

We can use a single mode to collect data or multiple modes (also called mixed-modes). Using mixed-modes can allow for broader reach and increase response rates depending on the population of interest [@biemer_choiceplus; @deLeeuw2005; @DeLeeuw_2018]. For example, we could both call households to conduct a CATI survey and send mail with a PAPI survey to the household. By using both modes, we could gain participation through the mail from individuals who do not pick up the phone to unknown numbers or through the phone from individuals who do not open all of their mail. However, mode effects (where responses differ based on the mode of response) can be present in the data and may need to be considered during analysis.

\index{Sampling frame|(}When selecting which mode, or modes, to use, understanding the unique aspects of the chosen population of interest and sampling frame provides insight into how they can best be reached and engaged. For example, if we plan to survey adults aged 18--24 who live in North Carolina, asking them to complete a survey using CATI (i.e., over the phone) would likely not be as successful as other modes like the web. This age group does not talk on the phone as much as other generations and often does not answer phone calls from unknown numbers. Additionally, the mode for contacting respondents relies on what information is available in the sampling frame. For example, if our sampling frame includes an email address, we could email our selected sample members to convince them to complete a survey. Alternatively, if the sampling frame is a list of mailing addresses, we could contact sample members with a letter. 

It is important to note that there can be a difference between the contact and survey modes. For example, if we have a sampling frame with addresses, we can send a letter to our sample members and provide information on completing a web survey.\index{Sampling frame|)} Another option is using mixed-mode surveys by mailing sample members a paper and pencil survey but also including instructions to complete the survey online. \index{Nonresponse error|(}\index{Unit nonresponse|(}Combining different contact modes and different survey modes can be helpful in reducing unit nonresponse error--where the entire unit (e.g., a household) does not respond to the survey at all--as different sample members may respond better to different contact and survey modes. \index{Burden|(} However, when considering which modes to use, it is important to make access to the survey as easy as possible for sample members to reduce burden and unit nonresponse.\index{Mode|)} \index{Burden|)} 

Another way to reduce unit nonresponse error is by varying the language of the contact materials [@dillman2014mode]. People are motivated by different things, so constantly repeating the same message may not be helpful. Instead, mixing up the messaging and the type of contact material the sample member receives can increase response rates and reduce the unit nonresponse error. For example, instead of only sending standard letters, we could consider sending mailings that invoke "urgent" or "important" thoughts by sending priority letters or using other delivery services like FedEx, UPS, or DHL.\index{Nonresponse error|)}\index{Unit nonresponse|)}

A study timeline may also determine the number and types of contacts. If the timeline is long, there is plentiful time for follow-ups and diversified messages in contact materials. If the timeline is short, then fewer follow-ups can be implemented. Many studies start with the tailored design method put forth by @dillman2014mode and implement five contacts:

* Pre-notification (Pre-notice) to let sample members know the survey is coming
* Invitation to complete the survey
* Reminder to also thank the respondents who have already completed the survey
* Reminder (with a replacement paper survey if needed)
* Final reminder

This method is easily adaptable based on the study timeline and needs but provides a starting point for most studies.

#### Example: Number of pets in a household {.unnumbered #overview-design-dcplanning-ex}

Let's return to our example of the average number of pets in a household. \index{Nonresponse error|(}\index{Sampling frame|(}\index{Unit nonresponse|(}We are using a sampling frame of mailing addresses, so we recommend starting our data collection with letters mailed to households, but later in data collection, we want to send interviewers to the house to conduct an in-person (or CAPI) interview to decrease unit nonresponse error.\index{Nonresponse error|)}\index{Sampling frame|)}\index{Unit nonresponse|)} This means we have two contact modes (paper and in-person). \index{Mode|(}As mentioned above, the survey mode does not have to be the same as the contact mode, so we recommend a mixed-mode study with both web and CAPI modes. Let's assume we have 6 months for data collection, so we could recommend Table \@ref(tab:prot-examp)'s protocol:

Table: (\#tab:prot-examp) Protocol example for 6-month web and CAPI data collection 

| Week | Contact Mode | Contact Message | Survey Mode Offered |
|:----:|-----------|------------------|---------------|
|  1 | Mail: Letter | Pre-notice | --- |
|  2 | Mail: Letter | Invitation | Web |
|  3 | Mail: Postcard | Thank You/Reminder | Web |
|  6 | Mail: Letter in large envelope | Animal Welfare Discussion | Web |
| 10 | Mail: Postcard | Inform Upcoming In-Person Visit | Web |
| 14 | In-Person Visit | --- | CAPI |
| 16 | Mail: Letter | Reminder of In-Person Visit | Web, but includes a number to call to schedule CAPI | 
| 20 | In-Person Visit | --- | CAPI |
| 25 | Mail: Letter in large envelope | Survey Closing Notice | Web, but includes a number to call to schedule CAPI |

This is just one possible protocol that we can use that starts respondents with the web (typically done to reduce costs). However, we could begin in-person data collection earlier during the data collection period or ask interviewers to attempt more than two visits with a household.\index{Mode|)} \index{Data collection|)}

### Questionnaire design {#overview-design-questionnaire}

\index{Research topic|(} \index{Burden|(} When developing the questionnaire, it can be helpful to first outline the topics to be asked and include the "why" each question or topic is important to the research question(s). This can help us better tailor the questionnaire and reduce the number of questions (and thus the burden on the respondent) if topics are deemed irrelevant to the research question.\index{Burden|)} \index{Weighting|(}When making these decisions, we should also consider questions needed for weighting.\index{Research topic|)} While we would love to have everyone in our population of interest answer our survey, this rarely happens. \index{Nonresponse error|(}\index{Item nonresponse|(}Thus, including questions about demographics in the survey can assist with weighting for nonresponse errors (both unit and item nonresponse).\index{Nonresponse error|)}\index{Item nonresponse|)}\index{Weighting|)} \index{Coverage error|(}\index{Sampling error|(}Knowing the details of the sampling plan and what may impact coverage error and sampling error can help us determine what types of demographics to include. Thus questionnaire design is typically done in conjunction with sampling design.\index{Coverage error|)}\index{Sampling error|)}

We can benefit from the work of others by using questions from other surveys. Demographic sections in surveys, such as race, ethnicity, or education, often are borrowed questions from a government census or other official surveys. Question banks such as the [ICPSR variable search](https://www.icpsr.umich.edu/web/pages/ICPSR/ssvd/) can provide additional potential questions. 

\index{Research topic|(} \index{Questionnaire testing|(}
If a question does not exist in a question bank, we can craft our own. When developing survey questions, we should start with the research topic and attempt to write questions that match the concept. \index{Validity|(}The closer the question asked is to the overall concept, the better validity there is. For example, if we want to know how people consume T.V. series and movies but only ask a question about how many T.V.s are in the house, then we would be missing other ways that people watch T.V. series and movies, such as on other devices or at places outside of the home. As mentioned above, we can employ techniques to increase the validity of questionnaires. For example, questionnaire testing involves piloting the survey instrument to identify and fix potential issues before conducting the main survey. \index{Cognitive interview|(}Additionally, we could conduct cognitive interviews -- a technique where we walk through the survey with participants, encouraging them to speak their thoughts out loud to uncover how they interpret and understand survey questions. \index{Research topic|)}\index{Validity|)} \index{Questionnaire testing|)} \index{Cognitive interview|)}

\index{Mode|(}Additionally, when designing questions, we should consider the mode for the survey and adjust the language appropriately.\index{Mode|)} In self-administered surveys (e.g., web or mail), respondents can see all the questions and response options, but that is not the case in interviewer-administered surveys (e.g., CATI or CAPI). With interviewer-administered surveys, the response options must be read aloud to the respondents, so the question may need to be adjusted to create a better flow to the interview. \index{Measurement error|)}Additionally, with self-administered surveys, because the respondents are viewing the questionnaire, the formatting of the questions is even more critical to ensure accurate measurement. Incorrect formatting or wording can result in measurement error, so following best practices or using existing validated questions can reduce error. \index{Mode|(}There are multiple resources to help researchers draft questions for different modes [e.g., @Bradburn2004; @dillman2014mode; @Fowler1989; @Tourangeau2004spacing].\index{Measurement error|)}\index{Mode|)}

#### Example: Number of pets in a household {.unnumbered #overview-design-questionnaire-ex}

As part of our survey on the average number of pets in a household, we may want to know what animal most people prefer to have as a pet. Let's say we have a question in our survey as displayed in Figure \@ref(fig:overview-pet-examp1).

```{r}
#| label: overview-pet-examp1
#| echo: false
#| fig.cap: Example question asking pet preference type
#| fig.alt: Example question asking "What animal do you prefer to have as a pet?" with response options of Dogs and Cats.
#| out.width: 70%
#| fig.align: center

knitr::include_graphics(path="images/PetExample1.png")
```

\index{Validity|(}This question may have validity issues as it only provides the options of "dogs" and "cats" to respondents, and the interpretation of the data could be incorrect. For example, if we had 100 respondents who answered the question and 50 selected dogs, then the results of this question cannot be "50% of the population prefers to have a dog as a pet," as only two response options were provided.\index{Validity|)} \index{Measurement error|)}If a respondent taking our survey prefers turtles, they could either be forced to choose a response between these two (i.e., interpret the question as "between dogs and cats, which do you prefer?" and result in measurement error)\index{Measurement error|)}, or \index{Nonresponse error|(}\index{Item nonresponse|(}they may not answer the question (which results in item nonresponse error).\index{Nonresponse error|)}\index{Item nonresponse|)} Based on this, the interpretation of this question should be, "When given a choice between dogs and cats, 50% of respondents preferred to have a dog as a pet." 

To avoid this issue, we should consider these possibilities and adjust the question accordingly. One simple way could be to add an "other" response option to give respondents a chance to provide a different response. The "other" response option could then include a way for respondents to write their other preference. For example, we could rewrite this question as displayed in Figure \@ref(fig:overview-pet-examp2).

```{r}
#| label: overview-pet-examp2
#| echo: false
#| fig.cap: Example question asking pet preference type with other specify option
#| fig.alt: Example question asking "What animal do you prefer to have as a pet?" with response options of Dogs, Cats, and Other.  The other option includes an open-ended box after for write in responses.
#| out.width: 70%
#| fig.align: center

knitr::include_graphics(path="images/PetExample2.png")
```

We can then code the responses from the open-ended box and get a better understanding of the respondent's choice of preferred pet. Interpreting this question becomes easier as researchers no longer need to qualify the results with the choices provided.

This is a simple example of how the presentation of the question and options can impact the findings. For more complex topics and questions, we must thoroughly consider how to mitigate any impacts from the presentation, formatting, wording, and other aspects. For survey analysts, reviewing not only the data but also the wording of the questions is crucial to ensure the results are presented in a manner consistent with the question asked. Chapter \@ref(c03-survey-data-documentation) provides further details on how to review existing survey documentation to inform our analyses, and Chapter \@ref(c08-communicating-results) goes into more details on communicating results.
\index{Study design|)}

## Data collection {#overview-datacollection}

\index{Data collection|(}
Once the data collection starts, we try to stick to the data collection protocol designed during pre-survey planning. However, effective researchers also prepare to adjust their plans and adapt as needed to the current progress of data collection [@Schouten2018]. Some extreme examples could be natural disasters that could prevent mailings or interviewers from getting to the sample members. This could cause an in-person survey needing to quickly pivot to a self-administered survey, or the field period could be delayed, for example. Others could be smaller in that something newsworthy occurs connected to the survey, so we could choose to play this up in communication materials. In addition to these external factors, there could be factors unique to the survey, such as lower response rates for a specific subgroup, so the data collection protocol may need to find ways to improve response rates for that specific group.
\index{Data collection|)}

## Post-survey processing {#overview-post}

After data collection, various activities need to be completed before we can analyze the survey. \index{Weighting|(}Multiple decisions made during this post-survey phase can assist us in reducing different error sources, such as weighting to account for the sample selection. Knowing the decisions made in creating the final analytic data can impact how we use the data and interpret the results.\index{Weighting|)}

### Data cleaning and imputation {#overview-post-cleaning}

\index{Imputation|(}Post-survey cleaning is one of the first steps we do to get the survey responses into an analytic dataset. Data cleaning can consist of correcting inconsistent data (e.g., with skip pattern errors or multiple questions throughout the survey being consistent with each other), editing numeric entries or open-ended responses for grammar and consistency, or recoding open-ended questions into categories for analysis. There is no universal set of fixed rules that every survey must adhere to. Instead, each survey or research study should establish its own guidelines and procedures for handling various cleaning scenarios based on its specific objectives.

\index{Processing error|(}We should use our best judgment to ensure data integrity, and all decisions should be documented and available to those using the data in the analysis. Each decision we make impacts processing error, so often, multiple people review these rules or recode open-ended data and adjudicate any differences in an attempt to reduce this error. \index{Processing error|)}

\index{Nonresponse error|(}\index{Item nonresponse|(} \index{Missing data|(}Another crucial step in post-survey processing is imputation. Often, there is item nonresponse where respondents do not answer specific questions. If the questions are crucial to analysis efforts or the research question, we may implement imputation to reduce item nonresponse error. Imputation is a technique for replacing missing or incomplete data values with estimated values. \index{Processing error|(}However, as imputation is a way of assigning values to missing data based on an algorithm or model, it can also introduce processing error, so we should consider the overall implications of imputing data compared to having item nonresponse.\index{Processing error|)}\index{Item nonresponse|)} There are multiple ways to impute data. We recommend reviewing other resources like @Kim2021 for more information. \index{Imputation|)}\index{Nonresponse error|)} \index{Missing data|)}

#### Example: Number of pets in a household {.unnumbered #overview-post-cleaning-ex}

Let's return to the question we created to ask about [animal preference](#overview-design-questionnaire-ex). The "other specify" invites respondents to specify the type of animal they prefer to have as a pet. If respondents entered answers such as "puppy," "turtle," "rabit," "rabbit," "bunny," "ant farm," "snake," "Mr. Purr," then we may wish to categorize these write-in responses to help with analysis. In this example, "puppy" could be assumed to be a reference to a "Dog" and could be recoded there. The misspelling of "rabit" could be coded along with "rabbit" and "bunny" into a single category of "Bunny or Rabbit." These are relatively standard decisions that we can make. The remaining write-in responses could be categorized in a few different ways. "Mr. Purr," which may be someone's reference to their own cat, could be recoded as "Cat," or it could remain as "Other" or some category that is "Unknown." Depending on the number of responses related to each of the others, they could all be combined into a single "Other" category, or maybe categories such as "Reptiles" or "Insects" could be created. Each of these decisions may impact the interpretation of the data, so we should document the types of responses that fall into each of the new categories and any decisions made.

### Weighting {#overview-post-weighting}

\index{Weighting|(}We can address some error sources identified in the previous sections using weighting. During the weighting process, weights are created for each respondent record. These weights allow the survey responses to generalize to the population. A weight, generally, reflects how many units in the population each respondent represents. Often, the weight is constructed such that the sum of the weights is the size of the population.

\index{Coverage error|(}\index{Adjustment error|(}Weights can address coverage, sampling, and nonresponse errors.\index{Coverage error|)}\index{Adjustment error||)} Many published surveys include an "analysis weight" variable that combines these adjustments. However, weighting itself can also introduce adjustment error, so we need to balance which types of errors should be corrected with weighting. The construction of weights is outside the scope of this book, we recommend referencing other materials if interested in weight construction [@Valliant2018weights]. Instead, this book assumes the survey has been completed, weights are constructed, and data are available to users. 

#### Example: Number of pets in a household {.unnumbered #overview-post-weighting-ex}

In the simple example of our survey, we decided to obtain a random sample from each state to select our sample members. Knowing this sampling design, we can include selection weights for analysis that account for how the sample members were selected for the survey. \index{Research topic|(}\index{Sampling frame|(}Additionally, the sampling frame may have the type of building associated with each address, so we could include the building type as a potential nonresponse weighting variable, along with some interviewer observations that may be related to our research topic of the average number of pets in a household.\index{Sampling frame|)} Combining these weights, we can create an analytic weight that analysts need to use when analyzing the data.\index{Research topic|)}\index{Weighting|)}

### Disclosure {#overview-post-disclosure}

\index{Research topic|(}Before data is released publicly, we need to ensure that individual respondents cannot be identified by the data when confidentiality is required. There are a variety of different methods that can be used. Here we describe a few of the most commonly used:

- Data swapping: We may swap specific data values across different respondents so that it does not impact insights from the data but ensures that specific individuals cannot be identified.
- Top/bottom coding: We may choose top or bottom coding to mask extreme values. For example, we may top-code income values such that households with income greater than \$500,000 are coded as "\$500,000 or more" with other incomes being presented as integers between \$0 and \$499,999. This can impact analyses at the tails of the distribution.
- Coarsening: We may use coarsening to mask unique values.  For example, a survey question may ask for a precise income but the public data may include income as a categorical variable. Another example commonly used in survey practice is to coarsen geographic variables.  Data collectors likely know the precise address of sample members, but the public data may only include the state or even region of respondents.
- Perturbation: We may add random noise to outcomes. As with swapping, this is done so that it does not impact insights from the data but ensures that specific individuals cannot be identified.

There is as much art as there is science to the methods used for disclosure. Only high-level comments about the disclosure are provided in the survey documentation, not specific details. This ensures nobody can reverse the disclosure and thus identify individuals. For more information on different disclosure methods, please see @Skinner2009 and the [AAPOR Standards](https://aapor.org/standards-and-ethics/disclosure-standards/).

### Documentation {#overview-post-documentation}

Documentation is a critical step of the survey life cycle. We should systematically record all the details, decisions, procedures, and methodologies to ensure transparency, reproducibility, and the overall quality of survey research.

Proper documentation allows analysts to understand, reproduce, and evaluate the study's methods and findings. Chapter \@ref(c03-survey-data-documentation) dives into how analysts should use survey data documentation.

## Post-survey data analysis and reporting

After completing the survey life cycle, the data are ready for analysts. Chapter \@ref(c04-getting-started) continues from this point. For more information on the survey life cycle, please explore the references cited throughout this chapter.