diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index d95c75f3..6f108a5b 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -1,17 +1,18 @@ --- title: 'Clean case data' -teaching: 30 -exercises: 10 +teaching: 45 +exercises: 15 --- -:::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::: questions - How to clean and standardize case data? + :::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: objectives -- Explain how to clean, curate, and standardize case data using `{cleanepi}` package +- Explain how to clean, curate, and standardize case data using `{cleanepi}` package. - Perform essential data-cleaning operations on a real case dataset. :::::::::::::::::::::::::::::::::::::::::::::::: @@ -20,24 +21,53 @@ exercises: 10 In this episode, we will use a simulated Ebola dataset that can be: -- Download the [simulated_ebola_2.csv](https://epiverse-trace.github.io/tutorials-early/data/simulated_ebola_2.csv) -- Save it in the `data/` folder. Follow instructions in Setup to [configure an RStudio Project and folder](../learners/setup.md#setup-an-rstudio-project-and-folder) +- Download the [simulated\_ebola\_2.csv](https://epiverse-trace.github.io/tutorials-early/data/simulated_ebola_2.csv) +- Save it in the `data/` folder. +- Follow instructions in Setup to [configure an RStudio Project and folder](../learners/setup.md#setup-an-rstudio-project-and-folder) ::::::::::::::::::::: ## Introduction -In the process of analyzing outbreak data, it's essential to ensure that the dataset is clean, curated, standardized, and validated. This will ensure that analysis is accurate (i.e. you are analysing what you think you are analysing) and reproducible (i.e. if someone wants to go back and repeat your analysis steps with your code, you can be confident they will get the same results). - This episode focuses on cleaning epidemics and outbreaks data using the - [cleanepi](https://epiverse-trace.github.io/cleanepi/) package, - For demonstration purposes, we'll work with a simulated dataset of Ebola cases. -Let's start by loading the package `{rio}` to read data and the package `{cleanepi}` -to clean it. We'll use the pipe `%>%` to connect some of their functions, including others from -the package `{dplyr}`, so let's also call to the tidyverse package: +In the process of analyzing outbreak data, as in other disciplines of data science, it's essential to ensure that the dataset is clean, curated, standardized, and validated. +This will facilitate accurate (i. +e. +, you are analysing what you think you are analysing) and reproducible (i. +e. +, if someone wants to go back and repeat your analysis steps with your code, you can be confident they will get the same results) analysis. +This episode focuses on cleaning epidemics and outbreaks data using the [`{cleanepi}`](https://epiverse-trace.github.io/cleanepi/) package. +For demonstration purposes, we'll work with a simulated dataset of Ebola cases. + +### Set Up + +In addition to the `{cleanepi}` package, we will use the following R packages in this data cleaning workflow: + +- `{here}` for easy file referencing, +- `{rio}` to import the data into R, +- `{dplyr}` to perform some data processing operations, +- `{magrittr}` to use its **pipe operator (`%>%`)**. + +We encourage users with recent versions of R (version > 4. +4\. +1\) to use the base R pipe operator (`|>`) instead of `%>%`. + +We also encourage using the `{pak}` package when installing R packages as shown below. +You can refer to the [{pak} reference document](https://pak.r-lib.org/reference/features.html) for more details about the advantages of using this. + +```{r, eval=TRUE, message=FALSE, warning=FALSE} +# Check if a package is already installed and install it if not + +# nolint start +if (!require("pak")) install.packages("pak") +if (!require("here")) pak::pak("here") +if (!require("rio")) pak::pak("rio") +if (!require("dplyr")) pak::pak("dplyr") +if (!require("magrittr")) pak::pak("magrittr") +if (!require("cleanepi")) pak::pak("cleanepi") +# nolint end -```{r,eval=TRUE,message=FALSE,warning=FALSE} # Load packages -library(tidyverse) # for {dplyr} functions and the pipe %>% +library(dplyr) # for {dplyr} functions and the pipe %>% library(rio) # for importing data library(here) # for easy file referencing library(cleanepi) @@ -47,24 +77,22 @@ library(cleanepi) ### The double-colon (`::`) operator -The `::` in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important -advantages including the followings: +The`::`in R lets you access functions or objects from a specific package without attaching the entire package to the search path. +It offers several important advantages including the followings: -* Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name. -* Allowing to call a function from a package without loading the whole package -with library(). +- Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name. +- Allowing to call a function from a package without loading the whole package + with library(). -For example, the command `dplyr::filter(data, condition)` means we are calling -the `filter()` function from the `{dplyr}` package. +For example, the command `dplyr::filter(data, condition)` means we are calling the `filter()` function from the `{dplyr}` package. ::::::::::::::::::: +The first step is to import the dataset into working environment. +This can be done by following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. +It involves loading the dataset into `R` environment and view its structure and content. -The first step is to import the dataset into working environment. This can be -done by following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. It involves loading the dataset into -`R` environment and view its structure and content. - -```{r,eval=FALSE,echo=TRUE,message=FALSE} +```{r, eval=FALSE, echo=TRUE, message=FALSE} # Read data # e.g.: if path to file is data/simulated_ebola_2.csv then: raw_ebola_data <- rio::import( @@ -73,7 +101,7 @@ raw_ebola_data <- rio::import( dplyr::as_tibble() # for a simple data frame output ``` -```{r,eval=TRUE,echo=FALSE,message=FALSE} +```{r, eval=TRUE, echo=FALSE, message=FALSE} # Read data raw_ebola_data <- rio::import( file.path("data", "simulated_ebola_2.csv") @@ -88,7 +116,8 @@ raw_ebola_data ::::::::::::::::: discussion -Let's first **diagnose** the data frame. List all the characteristics in the data frame above that are problematic for data analysis. +Let's first **diagnose** the data frame. +List all the characteristics in the data frame above that are problematic for data analysis. Are any of those characteristics familiar from any previous data analysis you have performed? @@ -96,15 +125,15 @@ Are any of those characteristics familiar from any previous data analysis you ha ::::::::::::::::::: instructor -Lead a short discussion to relate the diagnosed characteristics with required cleaning operations. +Lead a short discussion to relate the diagnosed characteristics with required cleaning operations. -You can use the following terms to **diagnose characteristics**: +You can use the following terms to **diagnose characteristics**: - *Codification*, like the codification of values in columns like 'gender' and 'age' using numbers, letters, and words. Also the presence of multiple dates -formats ("dd/mm/yyyy", "yyyy/mm/dd", etc) in the same column like in -'date_onset'. Less visible, but also the column names. + formats ("dd/mm/yyyy", "yyyy/mm/dd", etc) in the same column like in + 'date\_onset'. Less visible, but also the column names. - *Missing*, how to interpret an entry like "" in the 'status' column or "-99" -in other circumstances? Do we have a data dictionary from the data collection process? + in other circumstances? Do we have a data dictionary from the data collection process? - *Inconsistencies*, like having a date of sample before the date of onset. - *Non-plausible values*, like observations where some dates values are outside of the expected timeframe. - *Duplicates*, are all observations unique? @@ -114,25 +143,25 @@ You can use these terms to relate to **cleaning operations**: - Standardize column name - Standardize categorical variables like 'gender' - Standardize date columns -- Convert character values into numeric +- Convert character values into numeric - Check the sequence of dated events :::::::::::::::::::::::::::::: -## A quick inspection +## A quick inspection -Quick exploration and inspection of the dataset are crucial to identify -potential data issues before diving into any analysis tasks. The `{cleanepi}` -package simplifies this process with the `scan_data()` function. Let's take a look at how you can use it: +Quick exploration and inspection of the dataset are crucial to identify potential data issues before diving into any analysis tasks. +The `{cleanepi}` package simplifies this process with the `scan_data()` function. +Let's take a look at how you can use it: ```{r} -cleanepi::scan_data(raw_ebola_data) +cleanepi::scan_data(raw_ebola_data, format = "percentage") ``` - The results provide an overview of the content of all character columns, including column names, and the percent of some data types within them. -You can see that the column names in the dataset are descriptive but lack consistency. Some are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are -missing values in the form of an empty string in others. +You can see that the column names in the dataset are descriptive but lack consistency. +Some are composed of multiple words separated by white spaces. +Additionally, some columns contain more than one data type, and there are missing values in the form of an empty string in others. ## Common operations @@ -140,16 +169,18 @@ This section demonstrate how to perform some common data cleaning operations us ### Standardizing column names -For this example dataset, standardizing column names typically involves removing with spaces and connecting different words with “_”. This practice helps -maintain consistency and readability in the dataset. However, the function used for standardizing column names offers more options. Type `?cleanepi::standardize_column_names` in the console for more details. +For this example dataset, standardizing column names typically involves removing with spaces and connecting different words with "\_". +This practice helps maintain consistency and readability in the dataset. +However, the function used for standardizing column names offers more options. +Type `?cleanepi::standardize_column_names` in the console for more details. ```{r} sim_ebola_data <- cleanepi::standardize_column_names(raw_ebola_data) names(sim_ebola_data) ``` -If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the `keep` argument of the function `cleanepi::standardize_column_names()`. This argument accepts a vector of -column names that are intended to be kept unchanged. +If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the `keep` argument of the function `cleanepi::standardize_column_names()`. +This argument accepts a vector of column names that are intended to be kept unchanged. ::::::::::::::::::::::::::::::::::::: challenge @@ -167,8 +198,9 @@ You can try `cleanepi::standardize_column_names(data = raw_ebola_data, keep = "V ### Removing irregularities -Raw data may contain fields that don't add any variability to the data such as **empty** rows and columns, or **constant** columns (where all entries have the same value). It can also contain **duplicated** rows. Functions from -`{cleanepi}` like `remove_duplicates()` and `remove_constants()` remove such irregularities as demonstrated in the below code chunk. +Raw data may contain fields that don't add any variability to the data such as **empty** rows and columns, or **constant** columns (where all entries have the same value). +It can also contain **duplicated** rows. +Functions from `{cleanepi}` like `remove_duplicates()` and `remove_constants()` remove such irregularities as demonstrated in the below code chunk. ```{r} # Remove constants @@ -189,12 +221,12 @@ columns. --> #### How many rows you removed? What rows where removed? -You can get the number and location of the duplicated rows that where found. Run `cleanepi::print_report()`, wait for the report to open in your browser, and -find the "Duplicates" tab. +You can get the number and location of the duplicated rows that where found. +Run `cleanepi::print_report()`, wait for the report to open in your browser, and find the "Duplicates" tab. To use this information within R, you can print data frames with specific sections of the report in the console using the argument `what`. -```{r,eval=FALSE,echo=TRUE} +```{r, eval=FALSE, echo=TRUE} # Print a report of found duplicates cleanepi::print_report(data = sim_ebola_data, what = "found_duplicates") @@ -208,7 +240,7 @@ cleanepi::print_report(data = sim_ebola_data, what = "removed_duplicates") In the following data frame: -```{r,echo=FALSE,eval=TRUE} +```{r, echo=FALSE, eval=TRUE} library(tidyverse) #create dataset @@ -255,13 +287,15 @@ Point out to learners that they create a different set of constant data after re df <- df %>% cleanepi::remove_constants(cutoff = 0.5) ``` - ::::::::::::::: - ### Replacing missing values -In addition to the irregularities, raw data may contain missing values, and these may be encoded by different strings (e.g. `"NA"`, `""`, `character(0)`). To ensure robust analysis, it is a good practice to replace all missing values by `NA` in the entire dataset. Below is a code snippet demonstrating how you can achieve this in `{cleanepi}` for missing entries represented by an empty string `""`: +In addition to the irregularities, raw data may contain missing values, and these may be encoded by different strings (e. +g. +`"NA"`, `""`, `character(0)`). +To ensure robust analysis, it is a good practice to replace all missing values by `NA` in the entire dataset. +Below is a code snippet demonstrating how you can achieve this in `{cleanepi}` for missing entries represented by an empty string `""`: ```{r} sim_ebola_data <- cleanepi::replace_missing_values( @@ -272,11 +306,34 @@ sim_ebola_data <- cleanepi::replace_missing_values( sim_ebola_data ``` -### Validating subject IDs +::: callout + +By default, `{cleanepi}` supports wide range of missing value formats, as listed by the below code chunk: + +```{r} +cleanepi::common_na_strings + +tibble::tribble( + ~case_id, ~outcome, ~gender, ~hospital, + "d1fafd", "NA", "f", "Military Hospital", + "53371b", "nan", "m", "Connaught Hospital", + "f5c3d8", "Recover", "f", "other", + "6c286a", "Death", "null", "na", + "0f58c4", "Recover", "f", "other" +) %>% + cleanepi::replace_missing_values() + +``` +::: -Each entry in the dataset represents a subject (e.g. a disease case or study participant) and should be distinguishable by a specific ID formatted in a -particular way. These IDs can contain numbers falling within a specific range, a prefix and/or suffix, and might be written such that they contain a specific number of characters. The `{cleanepi}` package offers the function `check_subject_ids()` designed precisely for this task as shown in the below code chunk. This function checks whether the IDs are unique and meet the required criteria specified by the user. +### Validating subject IDs +Each entry in the dataset represents a subject (e. +g. +a disease case or study participant) and should be distinguishable by a specific ID formatted in a particular way. +These IDs can contain numbers falling within a specific range, a prefix and/or suffix, and might be written such that they contain a specific number of characters. +The `{cleanepi}` package offers the function `check_subject_ids()` designed precisely for this task as shown in the below code chunk. +This function checks whether the IDs are unique and meet the required criteria specified by the user. ```{r} # check if the subject IDs in the 'case_id' column contains numbers ranging @@ -294,8 +351,8 @@ Note that our simulated dataset contains duplicated subject IDs. #### How to correct the subject IDs? -Let's print a preliminary report with `cleanepi::print_report(sim_ebola_data)`. Focus on the "Unexpected subject ids" -tab to identify what IDs require an extra treatment. +Let's print a preliminary report with `cleanepi::print_report(sim_ebola_data)`. +Focus on the "Unexpected subject ids" tab to identify what IDs require an extra treatment. In the console, you can print: @@ -303,14 +360,16 @@ In the console, you can print: print_report(data = sim_ebola_data, "incorrect_subject_id") ``` -After finishing this tutorial, we invite you to explore the package reference guide of [cleanepi::check_subject_ids()](https://epiverse-trace.github.io/cleanepi/reference/check_subject_ids.html) to find the -function that can fix this situation. +After finishing this tutorial, we invite you to explore the package reference guide of [cleanepi::check\_subject\_ids()](https://epiverse-trace.github.io/cleanepi/reference/check_subject_ids.html) to find the function that can fix this situation. ::::::::::::::::::::::::: ### Standardizing dates -An epidemic dataset typically contains date columns for different events, such as the date of infection, date of symptoms onset, etc. These dates can come in different date formats, and it is good practice to standardize them to benefit from the powerful R functionalities designed to handle date values in downstream analyses. The `{cleanepi}` package provides functionality for converting date columns of epidemic datasets into ISO8601 format, ensuring consistency across the different date columns. Here's how you can use it on our simulated dataset: +An epidemic dataset typically contains date columns for different events, such as the date of infection, date of symptoms onset, etc. +These dates can come in different date formats, and it is good practice to standardize them to benefit from the powerful R functionalities designed to handle date values in downstream analyses. +The `{cleanepi}` package provides functionality for converting date columns of epidemic datasets into ISO8601 format, ensuring consistency across the different date columns. +Here's how you can use it on our simulated dataset: ```{r} sim_ebola_data <- cleanepi::standardize_dates( @@ -327,19 +386,20 @@ This function converts the values in the target columns into the **YYYY-mm-dd** #### How is this possible? -We invite you to find the key package that makes this standardisation possible inside `{cleanepi}` by reading the “Details” section of the -[Standardize date variables reference manual](https://epiverse-trace.github.io/cleanepi/reference/standardize_dates.html#details)! +We invite you to find the key package that makes this standardisation possible inside `{cleanepi}` by reading the "Details" section of the [Standardize date variables reference manual](https://epiverse-trace.github.io/cleanepi/reference/standardize_dates.html#details)! -Also, check how to use the `orders` argument if you want to target US format character strings. You can explore [this reproducible example](https://github.com/epiverse-trace/cleanepi/discussions/262). +Also, check how to use the `orders` argument if you want to target US format character strings. +You can explore [this reproducible example](https://github.com/epiverse-trace/cleanepi/discussions/262). ::::::::::::::::::: ### Converting to numeric values -In the raw dataset, some columns can come with mixture of character and numerical values, and you will often want to convert -character values for numbers explicitly into numeric values (e.g. `"seven"` to `7`). For example, in our simulated data set, in the age column some entries are -written in words. In `{cleanepi}` the function `convert_to_numeric()` does such conversion as illustrated in the below -code chunk. +In the raw dataset, some columns can come with mixture of character and numerical values, and you will often want to convert character values for numbers explicitly into numeric values (e. +g. +`"seven"` to `7`). +For example, in our simulated data set, in the age column some entries are written in words. +In `{cleanepi}` the function `convert_to_numeric()` does such conversion as illustrated in the below code chunk. ```{r} sim_ebola_data <- cleanepi::convert_to_numeric( @@ -360,15 +420,17 @@ Thanks to the `{numberize}` package, we can convert numbers written in English, ## Epidemiology related operations -In addition to common data cleansing tasks, such as those discussed in the above section, the `{cleanepi}` package offers additional functionalities tailored specifically for processing and analyzing outbreak and epidemic data. This section covers some of these specialized tasks. +In addition to common data cleansing tasks, such as those discussed in the above section, the `{cleanepi}` package offers additional functionalities tailored specifically for processing and analyzing outbreak and epidemic data. +This section covers some of these specialized tasks. ### Checking sequence of dated-events -Ensuring the correct order and sequence of dated events is crucial in epidemiological data analysis, especially when analyzing infectious diseases where the timing of events like symptom onset and sample collection is essential. The `{cleanepi}` package provides a helpful function called `check_date_sequence()` designed for this purpose. +Ensuring the correct order and sequence of dated events is crucial in epidemiological data analysis, especially when analyzing infectious diseases where the timing of events like symptom onset and sample collection is essential. +The `{cleanepi}` package provides a helpful function called `check_date_sequence()` designed for this purpose. Here's an example of a code chunk demonstrating the usage of the function `check_date_sequence()` in the first 100 records of our simulated Ebola dataset. -```{r, warning=FALSE, results = 'hide'} +```{r, warning=FALSE, results="hide"} cleanepi::check_date_sequence( data = sim_ebola_data[1:100, ], target_columns = c("date_onset", "date_sample") @@ -377,12 +439,35 @@ cleanepi::check_date_sequence( This functionality is crucial for ensuring data integrity and accuracy in epidemiological analyses, as it helps identify any inconsistencies or errors in the chronological order of events, allowing you to address them appropriately. +::::::::::::::::::::: spoiler + +#### How to remove inconsistent observations? + +The `{cleanepi}` package does not automatically remove inconsistent observations; it only identifies them and reports their indices. +To remove them, use the code below: + +```{r, eval=TRUE, echo=TRUE} +# get the indices of incorrect row from the output of the above code chunk +obs_incorrect <- c(8, 15, 18, 20, 21, 23, 26, 28, 29, 32, 34, 35, 37, 38, + 40, 43, 46, 49, 52, 54, 56, 58, 60, 63) + +# drop inconsistent observations +dat <- sim_ebola_data[1:100, ] %>% + slice(-obs_incorrect) +dat +``` + +::::::::::::::::::::: + ### Dictionary-based substitution -In the realm of data pre-processing, it's common to encounter scenarios where certain columns in a dataset, such as the “gender” column in our simulated Ebola dataset, are expected to have specific values or factors. -However, it's also common for unexpected or erroneous values to appear in these columns, which need to be replaced with the appropriate values. The `{cleanepi}` package offers support for dictionary-based substitution, a method that allows you to replace values in specific columns based on mappings defined in a data dictionary. This approach ensures consistency and accuracy in data cleaning. +In the realm of data pre-processing, it's common to encounter scenarios where certain columns in a dataset, such as the "gender" column in our simulated Ebola dataset, are expected to have specific values or factors. +However, it's also common for unexpected or erroneous values to appear in these columns, which need to be replaced with the appropriate values. +The `{cleanepi}` package offers support for dictionary-based substitution, a method that allows you to replace values in specific columns based on mappings defined in a data dictionary. +This approach ensures consistency and accuracy in data cleaning. -Moreover, `{cleanepi}` provides a built-in dictionary specifically tailored for epidemiological data. The example dictionary below includes mappings for the “gender” column. +Moreover, `{cleanepi}` provides a built-in dictionary specifically tailored for epidemiological data. +The example dictionary below includes mappings for the "gender" column. ```{r} test_dict <- base::readRDS( @@ -393,7 +478,8 @@ test_dict <- base::readRDS( test_dict ``` -Now, we can use this dictionary to standardize values of the the “gender” column according to predefined categories. Below is an example code chunk demonstrating how to perform this using the `clean_using_dictionary()` function from the {cleanepi} package. +Now, we can use this dictionary to standardize values of the the "gender" column according to predefined categories. +Below is an example code chunk demonstrating how to perform this using the `clean_using_dictionary()` function from the {cleanepi} package. ```{r} sim_ebola_data <- cleanepi::clean_using_dictionary( @@ -406,13 +492,13 @@ sim_ebola_data This approach simplifies the data cleaning process, ensuring that categorical variables in epidemiological datasets are accurately categorized and ready for further analysis. - :::::::::::::::::::::::::: spoiler #### How to create your own data dictionary? -Note that, when a column in the dataset contains values that are not in the dictionary, the function `cleanepi::clean_using_dictionary()` will raise an error. -You can start a custom dictionary with a data frame inside or outside R and use the function `cleanepi::add_to_dictionary()` to include new elements in the dictionary. For example: +Note that, when a column in the dataset contains values that are not in the dictionary, the function `cleanepi::clean_using_dictionary()` will raise an error. +You can start a custom dictionary with a data frame inside or outside R and use the function `cleanepi::add_to_dictionary()` to include new elements in the dictionary. +For example: ```{r} new_dictionary <- tibble::tibble( @@ -431,24 +517,48 @@ new_dictionary <- tibble::tibble( new_dictionary ``` -You can have more details in the section about "Dictionary-based data substituting" in the package -[vignette](https://epiverse-trace.github.io/cleanepi/articles/cleanepi.html#dictionary-based-data-substituting). +You can have more details in the section about "Dictionary-based data substituting" in the package [vignette](https://epiverse-trace.github.io/cleanepi/articles/cleanepi.html#dictionary-based-data-substituting). :::::::::::::::::::::::::: - ### Calculating time span between different date events -In epidemiological data analysis, it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak (i.e., the time elapsed between today and the date the first case was reported) or the duration between date of sample collection and analysis (i.e., the time difference between today and the sample collection date). The most common example is to calculate the age of all the subjects given their dates of birth (i.e., the time difference between today and their date of birth). +In epidemiological data analysis, it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak (i. +e. +, the time elapsed between today and the date the first case was reported) or the duration between date of sample collection and analysis (i. +e. +, the time difference between today and the sample collection date). +A common example is to calculate the age of all the subjects given their dates of birth (i. +e. +, the time difference between today and their date of birth). + +The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. +For example, the below code snippet utilizes the function `cleanepi::timespan()` to compute **reporting delay** between the date of symptom onset (`date_onset`) and date of case confirmation (`date_sample`) + +```{r} +sim_ebola_data <- cleanepi::timespan(data = sim_ebola_data, + target_column = "date_onset", + end_date = "date_sample", + span_unit = "days", + span_column_name = "reporting_delay") +sim_ebola_data %>% + dplyr::select(case_id, date_sample, reporting_delay) +``` + +After executing the function `cleanepi::timespan()`, two new columns named `reporting_delay` and `remainder_months` are added to the **sim\_ebola\_data** dataset. +For each case, these columns respectively represent the calculated time elapsed since the date of sample collection measured in years, and the remaining time measured in months. + +::::::::::::::::::::::::::::::::::::::::::::::: challenge + +1- Calculate the time elapsed since the date of sampling of the identified cases until the $3^{rd}$ of January 2025 (`"2025-01-03"`). + +:::::::::::::::::::::::::: solution -The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the function `cleanepi::timespan()` to compute the -time elapsed since the date of sampling of the identified cases until the $3^{rd}$ of January 2025 (`"2025-01-03"`). - ```{r} sim_ebola_data <- cleanepi::timespan( data = sim_ebola_data, target_column = "date_sample", - end_date = as.Date("2025-01-03"), + end_date = lubridate::ymd("2025-01-03"), span_unit = "years", span_column_name = "years_since_collection", span_remainder_unit = "months" @@ -458,11 +568,14 @@ sim_ebola_data %>% dplyr::select(case_id, date_sample, years_since_collection, remainder_months) ``` -After executing the function `cleanepi::timespan()`, two new columns named `years_since_collection` and `remainder_months` are added to the **sim_ebola_data** dataset. For each case, these columns respectively represent the calculated time elapsed since the date of sample collection measured in years, and the remaining time measured in months. +:::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::::::::::: discussion -Age data is useful in many downstream analysis. You can categorize it to generate stratified estimates. +Age data is useful in many downstream analysis. +You can categorize it to generate stratified estimates. Read the `test_df.RDS` data frame within the `{cleanepi}` package: @@ -473,13 +586,14 @@ dat <- readRDS( dplyr::as_tibble() ``` -Calculate the age in years __until today's date__ of the subjects from their date of birth, and the remainder time in months. Clean and standardize the required elements to get this done. +Calculate the age in years **until today's date** of the subjects from their date of birth, and the remainder time in months. +Clean and standardize the required elements to get this done. :::::::::::::::::::::::::::: hint Before calculating the age, you may need to: -- standardize column names +- standardize column names - standardize dates columns - replace missing value strings with NA @@ -517,7 +631,10 @@ Now, How would you categorize a numerical variable? :::::::::::::::::::::::::: solution -The simplest alternative is using `Hmisc::cut2()`. You can also use `dplyr::case_when()`. However, this requires more lines of code and is more appropriate for custom categorization. Here we provide one solution using `base::cut()`: +The simplest alternative is using `Hmisc::cut2()`. +You can also use `dplyr::case_when()`. +However, this requires more lines of code and is more appropriate for custom categorization. +Here we provide one solution using `base::cut()`: ```{r} dat_clean <- dat_clean %>% @@ -540,8 +657,8 @@ dat_clean <- dat_clean %>% ) ``` -You can investigate the maximum values of variables from the summary made from the `skimr::skim()` function. Instead of `base::cut()` you can also use -`Hmisc::cut2(x = age_in_years, cuts = c(20,35,60))`, which gives the maximum value and do not require more arguments. +You can investigate the maximum values of variables from the summary made from the `skimr::skim()` function. +Instead of `base::cut()` you can also use `Hmisc::cut2(x = age_in_years, cuts = c(20,35,60))`, which gives the maximum value and do not require more arguments. :::::::::::::::::::::::::: @@ -549,20 +666,19 @@ You can investigate the maximum values of variables from the summary made from t ## Multiple operations at once -Performing data cleaning operations individually can be time-consuming and error-prone. The `{cleanepi}` package simplifies this process by offering a convenient wrapper function called `clean_data()`, which allows you to perform -multiple operations at once. +Performing data cleaning operations individually can be time-consuming and error-prone. +The `{cleanepi}` package simplifies this process by offering a convenient wrapper function called `clean_data()`, which allows you to perform multiple operations at once. -When no cleaning operation is specified, the `clean_data()` function automatically applies a series of data cleaning operations to the input dataset. Here's an example code chunk illustrating how to use `clean_data()` on a raw simulated Ebola dataset: +When no cleaning operation is specified, the `clean_data()` function automatically applies a series of data cleaning operations to the input dataset. +Here's an example code chunk illustrating how to use `clean_data()` on a raw simulated Ebola dataset: ```{r} cleaned_data <- cleanepi::clean_data(raw_ebola_data) ``` +Further more, you can combine multiple data cleaning tasks via the base R pipe (`%>%`) or the {magrittr} pipe (`%>%`) operator, as shown in the below code snippet. -Further more, you can combine multiple data cleaning tasks via the base R pipe (`%>%`) or the {magrittr} pipe (`%>%`) operator, as shown in the below code -snippet. - -```{r,warning = FALSE, message = FALSE} +```{r, warning=FALSE, message=FALSE} # Perform the cleaning operations using the pipe (%>%) operator cleaned_data <- raw_ebola_data %>% cleanepi::standardize_column_names() %>% @@ -590,7 +706,7 @@ cleaned_data <- raw_ebola_data %>% ) ``` -```{r,echo=FALSE,eval=TRUE} +```{r, echo=FALSE, eval=TRUE} cleaned_data %>% write_csv(file = file.path("data", "cleaned_data.csv")) ``` @@ -603,23 +719,28 @@ To identify both groups: - On a piece of paper, write the names of each function under the corresponding column: -| **Diagnose** cleaning status | **Perform** cleaning action | -|---|---| -| ... | ... | +| **Diagnose** cleaning status | **Perform** cleaning action | +| ---------------- | ---------------- | +| ... | ... | :::::::::::::: :::::::::::::: instructor -Notice that `{cleanepi}` contains a set of functions to **diagnose** the cleaning status (e.g., `check_subject_ids()` and `check_date_sequence()` in the chunk above) and another set to **perform** a cleaning action (the complementary functions from the chunk above). +Notice that `{cleanepi}` contains a set of functions to **diagnose** the cleaning status (e. +g. +, `check_subject_ids()` and `check_date_sequence()` in the chunk above) and another set to **perform** a cleaning action (the complementary functions from the chunk above). :::::::::::::: ## Cleaning report -The `{cleanepi}` package generates a comprehensive report detailing the findings and actions of all data cleansing operations conducted during the analysis. This report is presented as a HTML file that automatically opens in your browser with. Each section corresponds to a specific data cleansing operation, and clicking on each section allows you to access the results of that particular operation. This interactive approach enables users to efficiently review and analyze the effects of individual cleansing steps within the broader data cleansing process. +The `{cleanepi}` package generates a comprehensive report detailing the findings and actions of all data cleansing operations conducted during the analysis. +This report is presented as a HTML file that automatically opens in your browser with. +Each section corresponds to a specific data cleansing operation, and clicking on each section allows you to access the results of that particular operation. +This interactive approach enables users to efficiently review and analyze the effects of individual cleansing steps within the broader data cleansing process. -You can view the report using: +You can view the report using: ```r cleanepi::print_report(data = cleaned_data) @@ -634,13 +755,11 @@ cleanepi::print_report(data = cleaned_data) - - -::::::::::::::::::::::::::::::::::::: keypoints +::::::::::::::::::::::::::::::::::::: keypoints - Use `{cleanepi}` package to clean and standardize epidemiological-related data - Understand how to use `{cleanepi}` to perform common data cleansing tasks -- View the data cleaning report in a browser, consult it and make decisions. +- View the data cleaning report in a browser, consult it and make decisions. ::::::::::::::::::::::::::::::::::::: diff --git a/episodes/describe-cases.Rmd b/episodes/describe-cases.Rmd index 44a7d299..5934cba7 100644 --- a/episodes/describe-cases.Rmd +++ b/episodes/describe-cases.Rmd @@ -4,9 +4,9 @@ teaching: 20 exercises: 10 --- -:::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::: questions -- How to aggregate and summarise case data? +- How to aggregate and summarise case data? - How to visualize aggregated data? - What is distribution of cases across time, space, gender, and age? @@ -17,19 +17,25 @@ exercises: 10 - Simulate synthetic outbreak data - Convert linelist data into incidence over time - Create epidemic curves from incidence data + :::::::::::::::::::::::::::::::::::::::::::::::: ## Introduction -In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. EDA helps determine relationships between variables and summarize their main characteristics, often by means of data visualization. +In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. +EDA helps determine relationships between variables and summarize their main characteristics, often by means of data visualization. -This episode focuses on EDA of outbreak data using R packages. -A key aspect of EDA in epidemic analysis is 'person, place and time'. It is useful to identify how observed events - such as confirmed cases, hospitalizations, deaths, and recoveries - change over time, and how these vary across different locations and demographic factors, including gender, age, and more. +This episode focuses on EDA of outbreak data using R packages. +A key aspect of EDA in epidemic analysis is 'person, place and time'. +It is useful to identify how observed events - such as confirmed cases, hospitalizations, deaths, and recoveries - change over time, and how these vary across different locations and demographic factors, including gender, age, and more. -Let's start by loading the `{incidence2}` package to aggregate the linelist data according to specific characteristics, and visualize the resulting epidemic curves (epicurves) that plot the number of new events (i.e. case incidence over time). -We'll use the `{simulist}` package to simulate the outbreak data to analyse, and `{tracetheme}` for figure formatting. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the `{dplyr}` and `{ggplot2}` packages, so let's also call to the {tidyverse} package. +Let's start by loading the `{incidence2}` package to aggregate the linelist data according to specific characteristics, and visualize the resulting epidemic curves (epicurves) that plot the number of new events (i. +e. +case incidence over time). +We'll use the `{simulist}` package to simulate the outbreak data to analyse, and `{tracetheme}` for figure formatting. +We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the `{dplyr}` and `{ggplot2}` packages, so let's also call to the {tidyverse} package. -```{r,eval=TRUE,message=FALSE,warning=FALSE} +```{r, eval=TRUE, message=FALSE, warning=FALSE} # Load packages library(incidence2) # For aggregating and visualising library(simulist) # For simulating linelist data @@ -37,10 +43,11 @@ library(tracetheme) # For formatting figures library(tidyverse) # For {dplyr} and {ggplot2} functions and the pipe |> ``` - ## Synthetic outbreak data -To illustrate the process of conducting EDA on outbreak data, we will generate a line list for a hypothetical disease outbreak utilizing the `{simulist}` package. `{simulist}` generates simulated data for outbreak according to a given configuration. Its minimal configuration can generate a linelist, as shown in the below code chunk: +To illustrate the process of conducting EDA on outbreak data, we will generate a line list for a hypothetical disease outbreak utilizing the `{simulist}` package. +`{simulist}` generates simulated data for outbreak according to a given configuration. +Its minimal configuration can generate a linelist, as shown in the below code chunk: ```{r} # Simulate linelist data for an outbreak with size between 1000 and 1500 @@ -58,17 +65,20 @@ This linelist dataset has simulated entries on individual-level events during an ## Additional Resources on Outbreak Data -The above is the default configuration of `{simulist}`. It includes a number of assumptions about the transmissibility and severity of the pathogen. If you want to know more about the `simulist::sim_linelist()` function and other functionalities check the [documentation website](https://epiverse-trace.github.io/simulist/). +The above is the default configuration of `{simulist}`. +It includes a number of assumptions about the transmissibility and severity of the pathogen. +If you want to know more about the `simulist::sim_linelist()` function and other functionalities check the [documentation website](https://epiverse-trace.github.io/simulist/). You can also find data sets from past real outbreaks within the [`{outbreaks}`](https://www.reconverse.org/outbreaks/) R package. ::::::::::::::::::: - - ## Aggregating the data -Often we want to analyse and visualise the number of events that occur on a particular day or week, rather than focusing on individual cases. This requires grouping the linelist data into incidence data. The [{incidence2}]((https://www.reconverse.org/incidence2/articles/incidence2.html){.external target="_blank"}) package offers a useful function called `incidence2::incidence()` for grouping case data, usually based around dated events and/or other characteristics. The code chunk provided below demonstrates the creation of an `` class object from the simulated Ebola `linelist` data based on the date of onset. +Often we want to analyse and visualise the number of events that occur on a particular day or week, rather than focusing on individual cases. +This requires grouping the linelist data into incidence data. +The [{incidence2}](( target="\_blank"}) package offers a useful function called `incidence2::incidence()` for grouping case data, usually based around dated events and/or other characteristics. +The code chunk provided below demonstrates the creation of an `` class object from the simulated Ebola `linelist` data based on the date of onset. ```{r} # Create an incidence object by aggregating case data based on the date of onset @@ -82,7 +92,10 @@ daily_incidence <- incidence2::incidence( daily_incidence ``` -With the `{incidence2}` package, you can specify the desired interval (e.g. day, week) and categorize cases by one or more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset, sex, and type of case. +With the `{incidence2}` package, you can specify the desired interval (e. +g. +day, week) and categorize cases by one or more factors. +Below is a code snippet demonstrating weekly cases grouped by the date of onset, sex, and type of case. ```{r} # Group incidence data by week, accounting for sex and case type @@ -99,10 +112,12 @@ weekly_incidence ::::::::::::::::::::::::::::::::::::: callout -## Dates Completion +## Dates Completion + +When cases are grouped by different factors, it's possible that the events involving these groups may have different date ranges in the resulting `incidence2` object. +The `{incidence2}` package provides a function called `incidence2::complete_dates()` to ensure that an incidence object has the same range of dates for each group. +By default, missing counts for a particular group will be filled with 0 for that date. -When cases are grouped by different factors, it's possible that the events involving these groups may have different date ranges in the resulting `incidence2` object. The `{incidence2}` package provides a function called `incidence2::complete_dates()` to ensure that an incidence object has the same range of dates for each group. By default, missing counts for a particular group will be filled with 0 for that date. - This functionality is also available within the `incidence2::incidence()` function by setting the value of the `complete_dates` to `TRUE`. ```{r} @@ -116,7 +131,7 @@ daily_incidence_2 <- incidence2::incidence( ) ``` -```{r,echo=FALSE,eval=FALSE} +```{r, echo=FALSE, eval=FALSE} daily_incidence_2_complete <- incidence2::complete_dates( x = daily_incidence_2, expand = TRUE, # Expand to fill in missing dates @@ -126,22 +141,21 @@ daily_incidence_2_complete <- incidence2::complete_dates( ) ``` - :::::::::::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::: challenge ## Challenge 1: Can you do it? - - **Task**: Calculate the __biweekly__ incidence of cases from the `sim_data` linelist based on their admission date and outcome. Save the result in an object called `biweekly_incidence`. +- **Task**: Calculate the **biweekly** incidence of cases from the `sim_data` linelist based on their admission date and outcome. Save the result in an object called `biweekly_incidence`. :::::::::::::::::::::::::::::::::::::::::::::::: ## Visualization -The `incidence2` objects can be visualized using the `plot()` function from the base R package. -The resulting graph is referred to as an epidemic curve, or epi-curve for short. The following code snippets generate epi-curves for the `daily_incidence` and `weekly_incidence` incidence objects mentioned above. +The `incidence2` objects can be visualized using the `plot()` function from the base R package. +The resulting graph is referred to as an epidemic curve, or epi-curve for short. +The following code snippets generate epi-curves for the `daily_incidence` and `weekly_incidence` incidence objects mentioned above. ```{r} # Plot daily incidence data @@ -151,8 +165,7 @@ base::plot(daily_incidence) + y = "Dialy cases" # y-axis label ) + theme_bw() -``` - +``` ```{r} # Plot weekly incidence data @@ -162,27 +175,29 @@ base::plot(weekly_incidence) + y = "weekly cases" # y-axis label ) + theme_bw() -``` +``` :::::::::::::::::::::::: callout #### Easy aesthetics -We invite you to take a look at the `{incidence2}` [package vignette](https://www.reconverse.org/incidence2/articles/incidence2.html). Find how you can use the arguments within the `plot()` function to provide aesthetics to your incidence2 class objects. +We invite you to take a look at the `{incidence2}` [package vignette](https://www.reconverse.org/incidence2/articles/incidence2.html). +Find how you can use the arguments within the `plot()` function to provide aesthetics to your incidence2 class objects. ```{r} base::plot(weekly_incidence, fill = "sex") ``` -Some of them include `show_cases = TRUE`, `angle = 45`, and `n_breaks = 5`. Try them and see how they impact on the resulting plot. +Some of them include `show_cases = TRUE`, `angle = 45`, and `n_breaks = 5`. +Try them and see how they impact on the resulting plot. :::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::: challenge ## Challenge 2: Can you do it? - - **Task**: Visualize the `biweekly_incidence` object. +- **Task**: Visualize the `biweekly_incidence` object. :::::::::::::::::::::::::::::::::::::::::::::::: @@ -203,19 +218,24 @@ base::plot(cum_df) + theme_bw() ``` -Note that this function preserves grouping, i.e., if the `incidence2` object contains groups, it will accumulate the cases accordingly. - +Note that this function preserves grouping, i. +e. +, if the `incidence2` object contains groups, it will accumulate the cases accordingly. -::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::: challenge ## Challenge 3: Can you do it? - - **Task**: Visulaize the cumulative cases from the `biweekly_incidence` object. + +- **Task**: Visulaize the cumulative cases from the `biweekly_incidence` object. :::::::::::::::::::::::::::::::::::::::::::::::: -## Peak time estimation +## Peak time estimation -You can estimate the peak -- the time with the highest number of recorded cases -- using the `incidence2::estimate_peak()` function from the {incidence2} package. This function uses a bootstrapping method to determine the peak time (i.e. by resampling dates with replacement, resulting in a distribution of estimated peak times). +You can estimate the peak -- the time with the highest number of recorded cases -- using the `incidence2::estimate_peak()` function from the {incidence2} package. +This function uses a bootstrapping method to determine the peak time (i. +e. +by resampling dates with replacement, resulting in a distribution of estimated peak times). ```{r} # Estimate the peak of the daily incidence data @@ -231,21 +251,22 @@ peak <- incidence2::estimate_peak( print(peak) ``` -This example demonstrates how to estimate the peak time using the `incidence2::estimate_peak()` function at $95%$ confidence interval and using 100 bootstrap samples. +This example demonstrates how to estimate the peak time using the `incidence2::estimate_peak()` function at $95%$ confidence interval and using 100 bootstrap samples. -::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::: challenge ## Challenge 4: Can you do it? - - **Task**: Estimate the peak time from the `biweekly_incidence` object. -:::::::::::::::::::::::::::::::::::::::::::::::: +- **Task**: Estimate the peak time from the `biweekly_incidence` object. +:::::::::::::::::::::::::::::::::::::::::::::::: ## Visualization with ggplot2 - -`{incidence2}` produces basic plots for epicurves, but additional work is required to create well-annotated graphs. However, using the `{ggplot2}` package, you can generate more sophisticated epicurves, with more flexibility in annotation. -`{ggplot2}` is a comprehensive package with many functionalities. However, we will focus on three key elements for producing epicurves: histogram plots, scaling date axes and their labels, and general plot theme annotation. +`{incidence2}` produces basic plots for epicurves, but additional work is required to create well-annotated graphs. +However, using the `{ggplot2}` package, you can generate more sophisticated epicurves, with more flexibility in annotation. +`{ggplot2}` is a comprehensive package with many functionalities. +However, we will focus on three key elements for producing epicurves: histogram plots, scaling date axes and their labels, and general plot theme annotation. The example below demonstrates how to configure these three elements for a simple `{incidence2}` object. ```{r} @@ -291,7 +312,8 @@ ggplot2::ggplot(data = daily_incidence) + ) ``` -Use the `group` option in the mapping function to visualize an epicurve with different groups. If there is more than one grouping factor, use the `facet_wrap()` option, as demonstrated in the example below: +Use the `group` option in the mapping function to visualize an epicurve with different groups. +If there is more than one grouping factor, use the `facet_wrap()` option, as demonstrated in the example below: ```{r} # Plot daily incidence faceted by sex @@ -331,19 +353,18 @@ ggplot2::ggplot(data = daily_incidence_2) + "lightpink")) # custom fill colors for sex ``` - -::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::: challenge ## Challenge 5: Can you do it? - - **Task**: Produce an annotated figure for the `biweekly_incidence` object using the `{ggplot2}` package. +- **Task**: Produce an annotated figure for the `biweekly_incidence` object using the `{ggplot2}` package. :::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::: keypoints +::::::::::::::::::::::::::::::::::::: keypoints - Use `{simulist}` package to generate synthetic outbreak data -- Use `{incidence2}` package to aggregate case data based on a date event, and other variables to produce epidemic curves. -- Use `{ggplot2}` package to produce better annotated epicurves. +- Use `{incidence2}` package to aggregate case data based on a date event, and other variables to produce epidemic curves. +- Use `{ggplot2}` package to produce better annotated epicurves. :::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/simple-analysis.Rmd b/episodes/simple-analysis.Rmd index c4a36b54..bbcbb970 100644 --- a/episodes/simple-analysis.Rmd +++ b/episodes/simple-analysis.Rmd @@ -4,7 +4,7 @@ teaching: 20 exercises: 10 --- -:::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::: questions - What is the growth rate of an epidemic? - How to identify the peak time of an outbreak? @@ -21,15 +21,24 @@ exercises: 10 ## Introduction -Getting trends from surveillance data is crucial for understanding epidemic drivers and dynamics. This may include forecasting disease burden, planning future public health interventions, and assessing the effectiveness of past control measures. By analyzing trends, policymakers and public health experts can make informed decisions to mitigate the spread of diseases and preserve the public health. This episode focuses on how to perform a simple early analysis on incidence data. It uses the same dataset of **Covid-19 case data from England** that was utilized it in [Aggregate and visualize](../episodes/describe-cases.Rmd) episode. - +Getting trends from surveillance data is crucial for understanding epidemic drivers and dynamics. +This may include forecasting disease burden, planning future public health interventions, and assessing the effectiveness of past control measures. +By analyzing trends, policymakers and public health experts can make informed decisions to mitigate the spread of diseases and preserve the public health. +This episode focuses on how to perform a simple early analysis on incidence data. +It uses the same dataset of **Covid-19 case data from England** that was utilized it in [Aggregate and visualize](../episodes/describe-cases.Rmd) episode. ## Simple model -Aggregated case data over specific time units (i.e. incidence data), typically represent the number of cases that occur within that time frame. We can think of these cases as noisy observations generated by the underlying epidemic process (which we cannot directly observe). To account for randomness in the observations, we can assume that observed cases follow either `Poisson distribution` (if cases are reported at a constant average rate over time) or a `negative binomial (NB) distribution` (if there is potential variability in reporting over time). When analyzing such data, one common approach is to examine the trend over time by computing the rate of change, which can indicate whether there is exponential growth or decay in the number of cases. Exponential growth implies that the number of cases is increasing at an accelerating rate over time, while exponential decay suggests that the number of cases is decreasing at a decelerating rate. - -The `incidence2` package can interoperate with methods for modelling the trend in case data, calculating moving averages, and exponential growth or decay rate. The code chunk below computes the Covid-19 trend in UK within first 3 months using negative binomial distribution. +Aggregated case data over specific time units (i. +e. +incidence data), typically represent the number of cases that occur within that time frame. +We can think of these cases as noisy observations generated by the underlying epidemic process (which we cannot directly observe). +To account for randomness in the observations, we can assume that observed cases follow either `Poisson distribution` (if cases are reported at a constant average rate over time) or a `negative binomial (NB) distribution` (if there is potential variability in reporting over time). +When analyzing such data, one common approach is to examine the trend over time by computing the rate of change, which can indicate whether there is exponential growth or decay in the number of cases. +Exponential growth implies that the number of cases is increasing at an accelerating rate over time, while exponential decay suggests that the number of cases is decreasing at a decelerating rate. +The `incidence2` package can interoperate with methods for modelling the trend in case data, calculating moving averages, and exponential growth or decay rate. +The code chunk below computes the Covid-19 trend in UK within first 3 months using negative binomial distribution. ```{r, warning=FALSE, message=FALSE} # load packages which provides methods for modeling @@ -88,14 +97,13 @@ plot(df_incid, angle = 45) + ggplot2::labs(x = "Date", y = "Cases") ``` - -::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::: challenge ## Challenge 1: Poisson distribution Repeat the above analysis using Poisson distribution. -:::::::::::::::::::::::: solution +:::::::::::::::::::::::: solution ```{r, warning=FALSE, message=FALSE} fitted_curve_poisson <- @@ -138,9 +146,15 @@ plot(df_incid, angle = 45) + ## Exponential growth or decay rate -The exponential growth or decay rate, denoted as $r$, serves as an indicator for the trend in cases, indicating whether they are increasing (growth) or decreasing (decay) on an exponential scale. This rate is computed using the so-called **renewal equation** [(Wallinga et al. 2006)](https://royalsocietypublishing.org/doi/10.1098/rspb.2006.3754), which mechanistically links the reproductive number $R$ of new cases (i.e. the average number of people that a typical case infects) to the generation interval of the disease (i.e. the average delay from one infection to the next in a chain of transmission). This computational method is interoperable with the `{incidence2}` package. +The exponential growth or decay rate, denoted as $r$, serves as an indicator for the trend in cases, indicating whether they are increasing (growth) or decreasing (decay) on an exponential scale. +This rate is computed using the so-called **renewal equation** [(Wallinga et al. 2006)](https://royalsocietypublishing.org/doi/10.1098/rspb.2006.3754), which mechanistically links the reproductive number $R$ of new cases (i. +e. +the average number of people that a typical case infects) to the generation interval of the disease (i. +e. +the average delay from one infection to the next in a chain of transmission). +This computational method is interoperable with the `{incidence2}` package. -Below is a code snippet demonstrating how to extract the growth/decay rate from the above **negative binomial**-fitted curve using the `growth_rate()` function: +Below is a code snippet demonstrating how to extract the growth/decay rate from the above **negative binomial**\-fitted curve using the `growth_rate()` function: ```{r, message=FALSE, warning=FALSE} growth_rates <- @@ -159,18 +173,18 @@ growth_rates <- growth_rates ``` +::::::::::::::::::::::::::::::::::::: challenge -::::::::::::::::::::::::::::::::::::: challenge +## Challenge 2: Growth rates from a **Poisson**\-fitted curve -## Challenge 2: Growth rates from a **Poisson**-fitted curve - -Extract growth rates from the **Poisson**-fitted curve of **Challenge 1**? +Extract growth rates from the **Poisson**\-fitted curve of **Challenge 1**? :::::::::::::::::::::::::::::::::::::::::::::::: ## Peak time -The **peak time** is the time at which the highest number of cases is observed in the aggregated data. It can be estimated using the `incidence2::estimate_peak()` function as shown in the below code chunk, which identify peak time from the `incidenc2` object `df_incid`. +The **peak time** is the time at which the highest number of cases is observed in the aggregated data. +It can be estimated using the `incidence2::estimate_peak()` function as shown in the below code chunk, which identify peak time from the `incidenc2` object `df_incid`. ```{r, message=FALSE, warning=FALSE} peaks_nb <- incidence2::estimate_peak(df_incid, progress = FALSE) %>% @@ -179,10 +193,11 @@ peaks_nb <- incidence2::estimate_peak(df_incid, progress = FALSE) %>% base::print(peaks_nb) ``` - ## Moving average -A moving or rolling average calculates the average number of cases within a specified time period. This can be achieved by utilizing the `frollmean()` function from the `{data.table}` package on an `incidence2 object`. The following code chunk demonstrates the computation of the weekly average number of cases from the `incidence2` object `df_incid`, followed by visualization. +A moving or rolling average calculates the average number of cases within a specified time period. +This can be achieved by utilizing the `frollmean()` function from the `{data.table}` package on an `incidence2 object`. +The following code chunk demonstrates the computation of the weekly average number of cases from the `incidence2` object `df_incid`, followed by visualization. ```{r, warning=FALSE, message=FALSE} library(ggplot2) @@ -196,13 +211,13 @@ df_incid %>% ggplot2::labs(x = "Date", y = "Cases") ``` -::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::: challenge ## Challenge 3: Monthly moving average -Compute and visualize the monthly moving average of cases on `df_incid`? +Compute and visualize the monthly moving average of cases on `df_incid`? -:::::::::::::::::::::::: solution +:::::::::::::::::::::::: solution ```{r, warning=FALSE, message=FALSE} df_incid %>% @@ -217,13 +232,12 @@ df_incid %>% ::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::::::::::: keypoints +::::::::::::::::::::::::::::::::::::: keypoints - Use `{incidence2}` interoperability with other packages to: - - fit epi-curve using either **Poisson** or **negative binomial** distributions, - - calculate exponential growth or decline of the number of cases, - - find peak time of the outbreak, and + - fit epi-curve using either **Poisson** or **negative binomial** distributions, + - calculate exponential growth or decline of the number of cases, + - find peak time of the outbreak, and - compute the moving average of cases in specified time period. :::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/validate.Rmd b/episodes/validate.Rmd index 3aaa2a2b..60904ed1 100644 --- a/episodes/validate.Rmd +++ b/episodes/validate.Rmd @@ -4,8 +4,7 @@ teaching: 10 exercises: 2 --- - -:::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::: questions - How to convert a raw dataset into a `linelist` object? @@ -16,32 +15,33 @@ exercises: 2 - Demonstrate how to covert case data into `linelist` data - Demonstrate how to tag and validate data to make analysis more reliable - :::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::: prereq This episode requires you to: -- Download the [cleaned_data.csv](https://epiverse-trace.github.io/tutorials-early/data/cleaned_data.csv) file +- Download the [cleaned\_data.csv](https://epiverse-trace.github.io/tutorials-early/data/cleaned_data.csv) file - and save it in the `data/` folder. ::::::::::::::::::::: ## Introduction -In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it's essential to establish an additional fundamental layer to ensure the integrity and reliability of subsequent analyses. Otherwise you might encounter issues during the analysis process due to creation or removal of specific variables, changes in their underlying data types (like `` or ``), etc. Specifically, this additional step involves: +In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it's essential to establish an additional fundamental layer to ensure the integrity and reliability of subsequent analyses. +Otherwise you might encounter issues during the analysis process due to creation or removal of specific variables, changes in their underlying data types (like `` or ``), etc. +Specifically, this additional step involves: 1. Verifying the presence and correct data type of certain columns within -your dataset, a process commonly referred to as **tagging**; + your dataset, a process commonly referred to as **tagging**; 2. Implementing measures to make sure that these tagged columns are not inadvertently deleted during further data processing steps, known as **validation**. +This episode focuses on tagging and validating outbreak data using the [linelist](https://epiverse-trace.github.io/linelist/) package. +Let's start by loading the package `{rio}` to read data and the `{linelist}` package to create a linelist object. +We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the package `{dplyr}`. +For this reason, we will also load the {tidyverse} package. -This episode focuses on tagging and validating outbreak data using the [linelist](https://epiverse-trace.github.io/linelist/) package. Let's start by loading the package `{rio}` to read data and the `{linelist}` package -to create a linelist object. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the package `{dplyr}`. For this reason, we will also load the {tidyverse} package. - - -```{r,eval=TRUE,message=FALSE,warning=FALSE} +```{r, eval=TRUE, message=FALSE, warning=FALSE} # Load packages library(tidyverse) # to access {dplyr} functions and the pipe %>% operator # from {magrittr} @@ -54,19 +54,19 @@ library(linelist) # for tagging and validating ### The double-colon (`::`) operator -The`::`in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important -advantages including the followings: +The`::`in R lets you access functions or objects from a specific package without attaching the entire package to the search path. +It offers several important advantages including the followings: -* Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name. -* Allowing to call a function from a package without loading the whole package -with library(). +- Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name. +- Allowing to call a function from a package without loading the whole package + with library(). -For example, the command `dplyr::filter(data, condition)` means we are calling -the `filter()` function from the `{dplyr}` package. +For example, the command `dplyr::filter(data, condition)` means we are calling the `filter()` function from the `{dplyr}` package. ::::::::::::::::::: -Import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into the working environment and view its structure and content. +Import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. +This involves loading the dataset into the working environment and view its structure and content. ```{r, eval=FALSE} # Read data @@ -94,7 +94,10 @@ cleaned_data ### An unexpected change -You are in an emergency response situation. You need to generate daily situation reports. You automated your analysis to read data directly from the online server :grin:. However, the people in charge of the data collection/administration needed to **remove/rename/reformat** one variable you found helpful :disappointed:! +You are in an emergency response situation. +You need to generate daily situation reports. +You automated your analysis to read data directly from the online server :grin:. +However, the people in charge of the data collection/administration needed to **remove/rename/reformat** one variable you found helpful :disappointed:! How can you detect if the input data is **still valid** to replicate the analysis code you wrote the day before? @@ -104,7 +107,8 @@ How can you detect if the input data is **still valid** to replicate the analysi If learners do not have an experience to share, we as instructors can share one. -A scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results. +A scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. +The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results. :::::::::::::::::::::::: @@ -125,34 +129,37 @@ linelist_data <- linelist::make_linelist( linelist_data ``` -The `{linelist}` package supplies tags for common epidemiological variables -and a set of appropriate data types for each. You can view the list of available tags by the variable name and their acceptable data types using the `linelist::tags_types()` function. +The `{linelist}` package supplies tags for common epidemiological variables and a set of appropriate data types for each. +You can view the list of available tags by the variable name and their acceptable data types using the `linelist::tags_types()` function. ::::::::::::::::::::::::::::::::::::: challenge -Let's **tag** more variables. In some datasets, it is possible to encounter variable names that are different from the available tag names. In such cases, we can associate them based on how variables were defined for data collection. +Let's **tag** more variables. +In some datasets, it is possible to encounter variable names that are different from the available tag names. +In such cases, we can associate them based on how variables were defined for data collection. Now: --**Explore** the available tag names in `{linelist}`. --**Find** what other variables in the input dataset can be associated with any of these available tags. --**Tag** those variables as shown above using the `linelist::make_linelist()` -function. +\-**Explore** the available tag names in `{linelist}`. +\-**Find** what other variables in the input dataset can be associated with any of these available tags. +\-**Tag** those variables as shown above using the `linelist::make_linelist()` function. :::::::::::::::::::: hint Your can get access to the list of available tag names in `{linelist}` using: -```{r, eval = FALSE} + +```{r, eval=FALSE} # Get a list of available tags names and data types linelist::tags_types() # Get a list of names only linelist::tags_names() ``` + ::::::::::::::::::: ::::::::::::::::::::: solution -```{r, eval = FALSE} +```{r, eval=FALSE} linelist::make_linelist( x = cleaned_data, id = "case_id", @@ -164,33 +171,33 @@ linelist::make_linelist( ) ``` - Are these additional tags visible in the output? -< !--Do you want to see a display of available and tagged variables? You can explore the function `linelist::tags()` and read its [reference documentation](https://epiverse-trace.github.io/linelist/reference/tags.html).- -> +\< ! +\--Do you want to see a display of available and tagged variables? +You can explore the function `linelist::tags()` and read its [reference documentation](https://epiverse-trace.github.io/linelist/reference/tags.html). +\- -> ::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: - ## Validation -To ensure that all tagged variables are standardized and have the correct data -types, use the `linelist::validate_linelist()` function, as shown in the example below: +To ensure that all tagged variables are standardized and have the correct data types, use the `linelist::validate_linelist()` function, as shown in the example below: ```{r} linelist::validate_linelist(linelist_data) ``` -If your dataset requires a new tag other than those defined in the `{linelist}` -package, use `allow_extra = TRUE` when creating the linelist object with its -corresponding datatype using the `linelist::make_linelist()` function. - +If your dataset requires a new tag other than those defined in the `{linelist}` package, use `allow_extra = TRUE` when creating the linelist object with its corresponding datatype using the `linelist::make_linelist()` function. ::::::::::::::::::::::::: challenge -Let's assume the following scenario during an ongoing outbreak. You notice at some point that the data stream you have been relying on has a set of new entries (i.e., rows or observations), and the data type of one variable has changed. +Let's assume the following scenario during an ongoing outbreak. +You notice at some point that the data stream you have been relying on has a set of new entries (i. +e. +, rows or observations), and the data type of one variable has changed. Let's consider the example where the type `age` variable has changed from a double (``) to character (``). @@ -204,11 +211,12 @@ To simulate this situation: Describe how `linelist::validate_linelist()` reacts when there is a change in the data type of one variable of the input data. - :::::::::::::::::::::::::: hint -We can use `dplyr::mutate()` to change the variable type before tagging for validation. For example: -```{r, eval = FALSE} +We can use `dplyr::mutate()` to change the variable type before tagging for validation. +For example: + +```{r, eval=FALSE} # nolint start cleaned_data %>% @@ -222,7 +230,8 @@ cleaned_data %>% # nolint end ``` -Please run the code line by line, focusing only on the parts before the pipe (`%>%`). After each step, observe the output before moving to the next line. +Please run the code line by line, focusing only on the parts before the pipe (`%>%`). +After each step, observe the output before moving to the next line. ```{r} cleaned_data %>% @@ -238,21 +247,25 @@ cleaned_data %>% Why are we getting an `Error` message? -Should we have a `Warning` message instead? Explain why. +Should we have a `Warning` message instead? +Explain why. Explore other situations to understand this behavior by converting:-`date_onset` from `` to character (``), -`gender` character (``) to integer (``). -Then tag them into a linelist for validation. Does the `Error` message suggest a fix to the issue? +Then tag them into a linelist for validation. +Does the `Error` message suggest a fix to the issue? Why are we getting an `Error` message? -Should we have a `Warning` message instead? Explain why? +Should we have a `Warning` message instead? +Explain why? Explore other situations to understand this behavior by converting:-`date_onset` from `` to character (``), -`gender` character (``) to integer (``). -Then tag them into a linelist for validation. Does the `Error` message suggest a fix to the issue? +Then tag them into a linelist for validation. +Does the `Error` message suggest a fix to the issue? ::::::::::::::::::::::::: solution -```{r, eval = FALSE} +```{r, eval=FALSE} # Change 2 # Run this code line by line to identify changes cleaned_data %>% @@ -264,8 +277,7 @@ cleaned_data %>% linelist::validate_linelist() ``` - -```{r, eval = FALSE} +```{r, eval=FALSE} # Change 3 # Run this code line by line to identify changes cleaned_data %>% @@ -280,7 +292,8 @@ cleaned_data %>% We get `Error` messages because the default type of these variable in `linelist::tags_types()` is different from the one we set them at. -The `Error` message inform us that in order to **validate** our linelist, we must fix the input variable type to fit the expected tag type. In a data analysis script, we can do this by adding one cleaning step into the pipeline. +The `Error` message inform us that in order to **validate** our linelist, we must fix the input variable type to fit the expected tag type. +In a data analysis script, we can do this by adding one cleaning step into the pipeline. ::::::::::::::::::::::::: ::::::::::::::::::::::::::::: challenge @@ -307,25 +320,26 @@ cleaned_data %>% :::::::::::::::::::::::: - ## Safeguarding -Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below. +Safeguarding is implicitly built into the linelist objects. +If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below. ```{r, warning=TRUE} new_df <- linelist_data %>% dplyr::select(case_id, gender) ``` -This `Warning` message above is the default output option when we lose tags in a `linelist` object. However, it can be changed to an `Error` message using the `linelist::lost_tags_action()` function. - +This `Warning` message above is the default output option when we lose tags in a `linelist` object. +However, it can be changed to an `Error` message using the `linelist::lost_tags_action()` function. ::::::::::::::::::::::::::::::::::::: challenge Let's test the implications of changing the **safeguarding** configuration from a `Warning` to an `Error` message. - First, run this code to count the frequency of each category within a categorical variable: -```{r, eval = FALSE} + +```{r, eval=FALSE} linelist_data %>% dplyr::select(case_id, gender) %>% dplyr::count(gender) @@ -333,7 +347,7 @@ linelist_data %>% - Set the behavior for lost tags in a `linelist` to "error" as follows: -```{r, eval = FALSE} +```{r, eval=FALSE} # set behavior to "error" linelist::lost_tags_action(action = "error") ``` @@ -344,15 +358,20 @@ Identify: - What is the difference in the output between a `Warning` and an `Error`? -- What could be the implications of this change for your daily data analysis pipeline during an outbreak response? +- What could be the implications of this change for your daily data analysis pipeline during an outbreak response? :::::::::::::::::::::::: solution -Deciding between `Warning` or `Error` message will depend on the level of attention or flexibility you need when losing tags. One will alert you about a change but will continue running the code downstream. The other will stop your analysis pipeline and the rest will not be executed. +Deciding between `Warning` or `Error` message will depend on the level of attention or flexibility you need when losing tags. +One will alert you about a change but will continue running the code downstream. +The other will stop your analysis pipeline and the rest will not be executed. -A data reading, cleaning and validation script may require a more stable or fixed pipeline. An exploratory data analysis may require a more flexible approach. These two processes can be isolated in different scripts or repositories to adjust the safeguarding according to your needs. +A data reading, cleaning and validation script may require a more stable or fixed pipeline. +An exploratory data analysis may require a more flexible approach. +These two processes can be isolated in different scripts or repositories to adjust the safeguarding according to your needs. Before you continue, set the configuration back again to the default option of `Warning`: + ```{r} # set behavior to the default option: "warning" linelist::lost_tags_action() @@ -360,13 +379,13 @@ linelist::lost_tags_action() :::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::: + +A `linelist` object resembles a data frame but offers richer features and functionalities. +Packages that are linelist - aware can leverage these features. +For example, you can extract a data frame of only the tagged columns using the `linelist::tags_df()` function, as shown below: -A `linelist` object resembles a data frame but offers richer features -and functionalities. Packages that are linelist - aware can leverage these -features. For example, you can extract a data frame of only the tagged columns -using the `linelist::tags_df()` function, as shown below: -```{r, warning = FALSE} +```{r, warning=FALSE} linelist::tags_df(linelist_data) ``` @@ -376,18 +395,21 @@ This allows for the use of tagged variables only in downstream analysis, which w ### When should I use `{linelist}`? -Data analysis during an outbreak response or mass - gathering surveillance demands a different set of "data safeguards" if compared to usual research situations. For example, your data will change or be updated over time (e.g. new entries, new variables, renamed variables). +Data analysis during an outbreak response or mass - gathering surveillance demands a different set of "data safeguards" if compared to usual research situations. +For example, your data will change or be updated over time (e. +g. +new entries, new variables, renamed variables). -`{linelist}` is more appropriate for this type of ongoing or long - lasting analysis. Check the "Get started" vignette section about -[When I should consider using `{linelist}`? ](https://epiverse-trace.github.io/linelist/articles/linelist.html#should-i-use-linelist) for more information. +`{linelist}` is more appropriate for this type of ongoing or long - lasting analysis. +Check the "Get started" vignette section about [When I should consider using `{linelist}`? ](https://epiverse-trace.github.io/linelist/articles/linelist.html#should-i-use-linelist) for more information. :::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::: keypoints - Use the `{linelist}` package to tag, -validate, -and prepare case data for downstream analysis. + validate, + and prepare case data for downstream analysis. ::::::::::::::::::::::::::::::::::::::::::::::::