Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #388: Refactor preprocessing functionality #390

Merged
merged 33 commits into from
Nov 13, 2024
Merged

Conversation

athowes
Copy link
Collaborator

@athowes athowes commented Oct 16, 2024

Description

Important

Current review ask ready for review

This PR will close #388.

Summary of what has been done:

  • Add epidist_linelist class
  • Add function as_epidist_linelist as date based entry point to epidist_linelist class
  • Added ad-hoc function as_epidist_linelist_time as temporary time based entry point to epidist_linelist class
  • Changed validation approach to use epidist_validate_data for data classes (like epidist_linelist) and epidist_validate_model for model classes (like latent_individual)

Summary of remains to do:

  • Update vignettes to run
    • Ebola vignette
    • Approximate inference vignette
    • Get started vignette
    • FAQ
  • Update text in vignettes
    • Ebola vignette
    • Approximate inference vignette
    • Get started vignette
    • FAQ

Summary of new issues / extensions:

  1. Handling of datetimes in as_epidist_linelist #415 as_epidist_linelist could accept a user not passing an upper bound (and could assume that no upper bound means daily censoring)
  2. Add aggregate data support #412 Think about other classes like epidist_linelist
  3. Further entrypoints to linelist data class #414 Provide more entry pathways to epidist_linelist including a better thought out time entry point
  4. Handling of case column in linelist data class #413 Right now epidist_linelist doesn't include a column for the "case" -- should we enforce that users provide this? It could be useful to have to produce plots and things
  5. Refactor creating new columns in as_latent_individual #416 RIght now as_latent_individual does some making new columns. Can we refactor these into helpers as was original plan?
  6. Update pass through model constructor to use epidist_linelist #402 In the class constructor for direct_model they are called ptime and stime and use a data.frame without going via epidist_linelist -- need to sort this out to go via direct_model

Checklist

  • My PR is based on a package issue and I have explicitly linked it.
  • I have included the target issue or issues in the PR title in the for Issue(s) issue-numbers: PR title
  • I have read the contribution guidelines.
  • I have tested my changes locally.
  • I have added or updated unit tests where necessary.
  • I have updated the documentation if required.
  • My code follows the established coding standards.
  • I have added a news item linked to this PR.
  • I have reviewed CI checks for this PR and addressed them as far as I am able.

R/preprocess.R Outdated Show resolved Hide resolved
@athowes athowes force-pushed the preprocess-refactor branch from 08240f0 to 0e809ce Compare October 21, 2024 11:07
R/preprocess.R Outdated Show resolved Hide resolved
@athowes

This comment was marked as outdated.

@athowes athowes force-pushed the preprocess-refactor branch 2 times, most recently from 044a9bd to 93a4d61 Compare October 22, 2024 08:42
@seabbs

This comment was marked as off-topic.

@athowes athowes force-pushed the preprocess-refactor branch 2 times, most recently from 6ad4d9c to 5385ca5 Compare October 28, 2024 13:15
@athowes
Copy link
Collaborator Author

athowes commented Oct 28, 2024

Note that actually row_id is used in the latent_indvidual model (though of course this could be created more local to the function below, which is epidist_stancode.latent_individual):

brms::stanvar(
  block = "data",
  scode = "array[N - wN] int noverlap;",
  x = filter(data, woverlap == 0)$row_id,
  name = "noverlap"
) +

@seabbs
Copy link
Contributor

seabbs commented Oct 31, 2024

As it's a transformed variable I think we can create it in latent individual. As it's an internal variable we might want to use .row_id to avoid closes with user variables

@athowes
Copy link
Collaborator Author

athowes commented Oct 31, 2024

Finding a pain point here is changing all the tests to use a date version of simulated data (knowing that probably we will need to revert this as we add functionality to have non-date data).

Makes me think of having some temporary way to class data that's already in the right format with epidist_linelist (rather than the date entry point). Can then also run it through epidist_validate_data.epidist_linelist to make sure it's OK. We will need to work out this out properly in a future issue but I think it's the right way to get things moving.

I think solution is just a function like:

as_epidist_linelist_time <- function(data) {
  class(data) <- c("epidist_linelist", class(data))
  epidist_validate_data(data)
  return(data)
}

@athowes athowes force-pushed the preprocess-refactor branch from 7c67a82 to 3728970 Compare October 31, 2024 11:16
@athowes athowes requested a review from seabbs October 31, 2024 11:31
@athowes athowes marked this pull request as ready for review October 31, 2024 11:32
@seabbs
Copy link
Contributor

seabbs commented Oct 31, 2024

Sure but just as a helper function in test that?

@athowes
Copy link
Collaborator Author

athowes commented Oct 31, 2024

Sure but just as a helper function in test that?

It's currently a regular packaged.

It'll need to be used in some of the vignettes as well so would prefer to keep it as a packaged function.

We could also consider putting it as an internal function and using epidist::: in the vignettes.

@athowes
Copy link
Collaborator Author

athowes commented Oct 31, 2024

Sure but just as a helper function in test that?

Note that I think we are going to need this functionality eventually anyway, so if this is something we want to think about and implement here I could be open to it.

@seabbs
Copy link
Contributor

seabbs commented Nov 1, 2024

We can't use an internal function in a vignette as it isn't CRAN complaint or good practice.

My suggestion to use as a testing function is that we can implement it here quickly and think about the design in another issue as I don't think we want to rush it.

My preference would be this is part of the simulation workflow and likely works via adding dates to simulated times but as above I think it needs some thought

@athowes
Copy link
Collaborator Author

athowes commented Nov 1, 2024

In that case I'm in favour of adding it as an exported function (as is currently) so that it can be used in the vignettes.

Then that would give us chance to:

[implement] here quickly and think about the design in another issue as I don't think we want to rush it.

R/preprocess.R Outdated Show resolved Hide resolved
NAMESPACE Outdated Show resolved Hide resolved
@seabbs seabbs force-pushed the preprocess-refactor branch from 47056eb to 5ae1962 Compare November 13, 2024 12:43
@seabbs seabbs enabled auto-merge (squash) November 13, 2024 12:43
Copy link
Contributor

@seabbs seabbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

I think I would have just a single validate function and dispatch into it from models and data. In the future we should also update to use modification validation vs entry to use validation. We should also add a new_epidist_linelist method in the future as well.

@seabbs seabbs disabled auto-merge November 13, 2024 14:09
@seabbs seabbs merged commit db76be3 into main Nov 13, 2024
7 checks passed
@seabbs seabbs deleted the preprocess-refactor branch November 13, 2024 14:09
@athowes athowes mentioned this pull request Nov 13, 2024
seabbs pushed a commit that referenced this pull request Jan 10, 2025
* Sketching out refactor

* Some restructure for clarity

* Refactor along new date approach

* Clear tests

* Refactor lines to save col_names

* Refactor validation functionality

* Redocument

* Remove epidist_validate and move to _model and _data approach plus some linting

* Add documentation of as_epidist_linelist arguments

* Move assert_class into imports and use in place of "check" class

* Documentation for epidist_validate_data.epidist_linelist

* Clear up the direct model file a bit

* Add creating the row_id back in to as_latent_individual

* Passing test-direct_model

* Start working to make data use dates

* Add start of unit tests and bug fix for datetime class check

* Use .row_id rather than row_id

* Use as_epidist_linelist_time function so that tests work with time data

* Fixes to tests

* Group into preprocessing functions

* Update FAQ vignette to run

* Update get started vignette to run

* Update ebola vignette to run

* Update approximate inference vignette to run

* Add documentation

* Methods consistency

* Document ...

* Again on ...

* Remove comment moved to issue

* Include as_epidist_linelist_time ad-hoc

* Add test for datetime column

* Update text in vignettes and add note about the ad-hoc function being included in package soon

* Refactor .rename_columns

Former-commit-id: c573ba836b76170c03d5c493cbb378781db5fa23 [formerly bac50e38d758dfe0fdcfd98722dc50a5a98c0357]
Former-commit-id: 4e9d3ee55e7e0c90bd35366990f38aa4894b3439
seabbs pushed a commit that referenced this pull request Jan 21, 2025
* Sketching out refactor

* Some restructure for clarity

* Refactor along new date approach

* Clear tests

* Refactor lines to save col_names

* Refactor validation functionality

* Redocument

* Remove epidist_validate and move to _model and _data approach plus some linting

* Add documentation of as_epidist_linelist arguments

* Move assert_class into imports and use in place of "check" class

* Documentation for epidist_validate_data.epidist_linelist

* Clear up the direct model file a bit

* Add creating the row_id back in to as_latent_individual

* Passing test-direct_model

* Start working to make data use dates

* Add start of unit tests and bug fix for datetime class check

* Use .row_id rather than row_id

* Use as_epidist_linelist_time function so that tests work with time data

* Fixes to tests

* Group into preprocessing functions

* Update FAQ vignette to run

* Update get started vignette to run

* Update ebola vignette to run

* Update approximate inference vignette to run

* Add documentation

* Methods consistency

* Document ...

* Again on ...

* Remove comment moved to issue

* Include as_epidist_linelist_time ad-hoc

* Add test for datetime column

* Update text in vignettes and add note about the ad-hoc function being included in package soon

* Refactor .rename_columns

Former-commit-id: db76be3
Former-commit-id: a92a3d606c2c51fcb2ae0fb6a4c5a4db630e668b
seabbs pushed a commit that referenced this pull request Jan 21, 2025
* Sketching out refactor

* Some restructure for clarity

* Refactor along new date approach

* Clear tests

* Refactor lines to save col_names

* Refactor validation functionality

* Redocument

* Remove epidist_validate and move to _model and _data approach plus some linting

* Add documentation of as_epidist_linelist arguments

* Move assert_class into imports and use in place of "check" class

* Documentation for epidist_validate_data.epidist_linelist

* Clear up the direct model file a bit

* Add creating the row_id back in to as_latent_individual

* Passing test-direct_model

* Start working to make data use dates

* Add start of unit tests and bug fix for datetime class check

* Use .row_id rather than row_id

* Use as_epidist_linelist_time function so that tests work with time data

* Fixes to tests

* Group into preprocessing functions

* Update FAQ vignette to run

* Update get started vignette to run

* Update ebola vignette to run

* Update approximate inference vignette to run

* Add documentation

* Methods consistency

* Document ...

* Again on ...

* Remove comment moved to issue

* Include as_epidist_linelist_time ad-hoc

* Add test for datetime column

* Update text in vignettes and add note about the ad-hoc function being included in package soon

* Refactor .rename_columns

Former-commit-id: db76be3
Former-commit-id: a92a3d606c2c51fcb2ae0fb6a4c5a4db630e668b
seabbs pushed a commit that referenced this pull request Jan 21, 2025
* Sketching out refactor

* Some restructure for clarity

* Refactor along new date approach

* Clear tests

* Refactor lines to save col_names

* Refactor validation functionality

* Redocument

* Remove epidist_validate and move to _model and _data approach plus some linting

* Add documentation of as_epidist_linelist arguments

* Move assert_class into imports and use in place of "check" class

* Documentation for epidist_validate_data.epidist_linelist

* Clear up the direct model file a bit

* Add creating the row_id back in to as_latent_individual

* Passing test-direct_model

* Start working to make data use dates

* Add start of unit tests and bug fix for datetime class check

* Use .row_id rather than row_id

* Use as_epidist_linelist_time function so that tests work with time data

* Fixes to tests

* Group into preprocessing functions

* Update FAQ vignette to run

* Update get started vignette to run

* Update ebola vignette to run

* Update approximate inference vignette to run

* Add documentation

* Methods consistency

* Document ...

* Again on ...

* Remove comment moved to issue

* Include as_epidist_linelist_time ad-hoc

* Add test for datetime column

* Update text in vignettes and add note about the ad-hoc function being included in package soon

* Refactor .rename_columns

Former-commit-id: db76be3
Former-commit-id: a92a3d606c2c51fcb2ae0fb6a4c5a4db630e668b
seabbs pushed a commit that referenced this pull request Jan 21, 2025
* Sketching out refactor

* Some restructure for clarity

* Refactor along new date approach

* Clear tests

* Refactor lines to save col_names

* Refactor validation functionality

* Redocument

* Remove epidist_validate and move to _model and _data approach plus some linting

* Add documentation of as_epidist_linelist arguments

* Move assert_class into imports and use in place of "check" class

* Documentation for epidist_validate_data.epidist_linelist

* Clear up the direct model file a bit

* Add creating the row_id back in to as_latent_individual

* Passing test-direct_model

* Start working to make data use dates

* Add start of unit tests and bug fix for datetime class check

* Use .row_id rather than row_id

* Use as_epidist_linelist_time function so that tests work with time data

* Fixes to tests

* Group into preprocessing functions

* Update FAQ vignette to run

* Update get started vignette to run

* Update ebola vignette to run

* Update approximate inference vignette to run

* Add documentation

* Methods consistency

* Document ...

* Again on ...

* Remove comment moved to issue

* Include as_epidist_linelist_time ad-hoc

* Add test for datetime column

* Update text in vignettes and add note about the ad-hoc function being included in package soon

* Refactor .rename_columns

Former-commit-id: c573ba836b76170c03d5c493cbb378781db5fa23 [formerly bac50e38d758dfe0fdcfd98722dc50a5a98c0357]
Former-commit-id: 4e9d3ee55e7e0c90bd35366990f38aa4894b3439
seabbs pushed a commit that referenced this pull request Jan 21, 2025
* Sketching out refactor

* Some restructure for clarity

* Refactor along new date approach

* Clear tests

* Refactor lines to save col_names

* Refactor validation functionality

* Redocument

* Remove epidist_validate and move to _model and _data approach plus some linting

* Add documentation of as_epidist_linelist arguments

* Move assert_class into imports and use in place of "check" class

* Documentation for epidist_validate_data.epidist_linelist

* Clear up the direct model file a bit

* Add creating the row_id back in to as_latent_individual

* Passing test-direct_model

* Start working to make data use dates

* Add start of unit tests and bug fix for datetime class check

* Use .row_id rather than row_id

* Use as_epidist_linelist_time function so that tests work with time data

* Fixes to tests

* Group into preprocessing functions

* Update FAQ vignette to run

* Update get started vignette to run

* Update ebola vignette to run

* Update approximate inference vignette to run

* Add documentation

* Methods consistency

* Document ...

* Again on ...

* Remove comment moved to issue

* Include as_epidist_linelist_time ad-hoc

* Add test for datetime column

* Update text in vignettes and add note about the ad-hoc function being included in package soon

* Refactor .rename_columns

Former-commit-id: c573ba836b76170c03d5c493cbb378781db5fa23 [formerly bac50e38d758dfe0fdcfd98722dc50a5a98c0357]
Former-commit-id: 4e9d3ee55e7e0c90bd35366990f38aa4894b3439
Former-commit-id: 84a5299
seabbs pushed a commit that referenced this pull request Jan 21, 2025
* Sketching out refactor

* Some restructure for clarity

* Refactor along new date approach

* Clear tests

* Refactor lines to save col_names

* Refactor validation functionality

* Redocument

* Remove epidist_validate and move to _model and _data approach plus some linting

* Add documentation of as_epidist_linelist arguments

* Move assert_class into imports and use in place of "check" class

* Documentation for epidist_validate_data.epidist_linelist

* Clear up the direct model file a bit

* Add creating the row_id back in to as_latent_individual

* Passing test-direct_model

* Start working to make data use dates

* Add start of unit tests and bug fix for datetime class check

* Use .row_id rather than row_id

* Use as_epidist_linelist_time function so that tests work with time data

* Fixes to tests

* Group into preprocessing functions

* Update FAQ vignette to run

* Update get started vignette to run

* Update ebola vignette to run

* Update approximate inference vignette to run

* Add documentation

* Methods consistency

* Document ...

* Again on ...

* Remove comment moved to issue

* Include as_epidist_linelist_time ad-hoc

* Add test for datetime column

* Update text in vignettes and add note about the ad-hoc function being included in package soon

* Refactor .rename_columns

Former-commit-id: db76be3
Former-commit-id: a92a3d606c2c51fcb2ae0fb6a4c5a4db630e668b
Former-commit-id: 4172d205511eeb9368959a38bf52395faa2ad703 [formerly 7296830]
Former-commit-id: 90bee30a3d61abb6c565551f0fd66b09146aa790
seabbs pushed a commit that referenced this pull request Jan 21, 2025
* Sketching out refactor

* Some restructure for clarity

* Refactor along new date approach

* Clear tests

* Refactor lines to save col_names

* Refactor validation functionality

* Redocument

* Remove epidist_validate and move to _model and _data approach plus some linting

* Add documentation of as_epidist_linelist arguments

* Move assert_class into imports and use in place of "check" class

* Documentation for epidist_validate_data.epidist_linelist

* Clear up the direct model file a bit

* Add creating the row_id back in to as_latent_individual

* Passing test-direct_model

* Start working to make data use dates

* Add start of unit tests and bug fix for datetime class check

* Use .row_id rather than row_id

* Use as_epidist_linelist_time function so that tests work with time data

* Fixes to tests

* Group into preprocessing functions

* Update FAQ vignette to run

* Update get started vignette to run

* Update ebola vignette to run

* Update approximate inference vignette to run

* Add documentation

* Methods consistency

* Document ...

* Again on ...

* Remove comment moved to issue

* Include as_epidist_linelist_time ad-hoc

* Add test for datetime column

* Update text in vignettes and add note about the ad-hoc function being included in package soon

* Refactor .rename_columns

Former-commit-id: db76be3
Former-commit-id: a92a3d606c2c51fcb2ae0fb6a4c5a4db630e668b
Former-commit-id: 4172d205511eeb9368959a38bf52395faa2ad703 [formerly 7296830]
Former-commit-id: 90bee30a3d61abb6c565551f0fd66b09146aa790
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplifications and refactor of add_event_vars
2 participants