Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add aggregate data support #412

Open
athowes opened this issue Nov 1, 2024 · 13 comments
Open

Add aggregate data support #412

athowes opened this issue Nov 1, 2024 · 13 comments
Labels
high Required for next release

Comments

@athowes
Copy link
Collaborator

athowes commented Nov 1, 2024

In PR #390 we added a class for epidist_linelist data at the individual level.

We should also add class for aggregate linelist data.

It could be aggregated over delays:

Delay (days) Count
1 20
2 35
3 50
4 40
5 60
... ...

Or aggregated over aspects of the window like:

Primary Event (date) Secondary Event (date) Count
2024-10-21 2024-10-23 15
2024-10-21 2024-10-25 20
2024-10-22 2024-10-24 25
2024-10-22 2024-10-26 30
2024-10-23 2024-10-25 35
2024-10-23 2024-10-27 40
... ... ...

Note that this type of data is appearing in our work on estimating right truncation delays. We also need models for this type of data soon.

@athowes athowes added the medium Nice to have for next release label Nov 13, 2024
@seabbs
Copy link
Contributor

seabbs commented Nov 24, 2024

Noting I still think we want to do this but also that for the #221 we will need to aggregate within the epidist call (as the optimal aggregation is related to the formula given for the model and this is something that will be less error prone if we do it for the user)

@athowes
Copy link
Collaborator Author

athowes commented Jan 13, 2025

Note that in the parameter estimates work (cfa-parameter-estimates) we have aggregate data with 100K of rows. Taking it to the linelist (such that we can then aggregate within the epidist call) will make it into e.g. 10000K - 100000K of rows. As such, it would be good to have a way to provide the data to epidist already aggregated. (I might be looking into pulling out parts and writing a hacky version of this in the first instance.)

@seabbs
Copy link
Contributor

seabbs commented Jan 14, 2025

The marginal model internally reweights assuming the presence of a n column. Regardless of any data structure considerations it needs to do this for efficiency purposes. At the moment the linelist data converter always creates this column and assigns it a value of 1. A simple change is to make this not be the case i.e. it doesn't overwrite an existing n column and doc this.

There are upsides and downsides to this approach. The main downside to this is the latent or other models which then need to be coded to do something with the n variable. The alternative is to create a wrapper converter or a different data class.

@kgostic
Copy link
Collaborator

kgostic commented Jan 14, 2025

@seabbs -- is this the preprocessing issue you were referring to when we just talked face to face? If so, I do think this is high-priority.

For CFA, I think the immediate priority is to be able to pass the large data that Adam is describing into the marginal model, and run it.

To me, it seems like an epidist_aggregate class that can be passed into the epidist call, or the change to the n column that you just suggested, are the easiest solutions that help us meet this goal.

Re:

Noting I still think we want to do this but also that for the #221 we will need to aggregate within the epidist call (as the optimal aggregation is related to the formula given for the model and this is something that will be less error prone if we do it for the user)

I think this could be nice to have for a general audience, but I don't know that it's necessary for our current use case if there is some kind of validator in the dispatch process that is sufficiently flexible, and that throws an helpful error if manual aggregation is done wrong. I am a little uncomfortable requiring the aggregation to be part of the epidist call -- this feels like a separation of concerns trap that could bite us later. Maybe this is more of a decision point if the decision is to create an entirely new class than if you decide to modify the n column (which I agree could be more elegant and would require less preprocessing code).

Please chime in after having a think if you have a proposed path forward, and please let us know what your timeline looks like on this!

@seabbs
Copy link
Contributor

seabbs commented Jan 14, 2025

is this the preprocessing issue you were referring to when we just talked face to face? If so, I do think this is high-priority.

Yes

@seabbs
Copy link
Contributor

seabbs commented Jan 14, 2025

I think this could be nice to have for a general audience, but I don't know that it's necessary for our current use case if there is some kind of validator in the dispatch process that is sufficiently flexible, and that throws an helpful error if manual aggregation is done wrong

This is already implemented/in place, has a message, and some warnings built around it. If you don't do it this way experimenting with multiple models becomes very painful as you need to update the data each time. I've tested a fair few edge cases and yet to find an issue but very open to the idea there are some.

@seabbs
Copy link
Contributor

seabbs commented Jan 14, 2025

This is the messaging:

epidist_transform_data_model.epidist_marginal_model <- function(

for the internal aggregation.

@seabbs
Copy link
Contributor

seabbs commented Jan 14, 2025

This is the change required to in take user specified weights:

as you can see as this is in the marginal vs data class we can overload the linelist class with an extra optional feature if we wish. This might make it hard to use but is probably worth a go in the first instance to avoid class bloat.

Maybe this is more of a decision point if the decision is to create an entirely new class than if you decide to modify the n column (which I agree could be more elegant and would require less preprocessing code).

I think this is the safe option for defining a aggregate linelist but I am not sure how much safer it really is (i.e. is it worth the technical debt). As above I don't think there is an issue in terms of the formula aggregation as this is a modelling detail. Noting that not doing this correctly has potential performance implications of several orders of magnitude which is I think a strong argument for not making it optional.

I will take a look at adding this in the morning.

@kgostic kgostic added high Required for next release and removed medium Nice to have for next release labels Jan 15, 2025
@kgostic
Copy link
Collaborator

kgostic commented Jan 15, 2025

As above I don't think there is an issue in terms of the formula aggregation as this is a modelling detail. Noting that not doing this correctly has potential performance implications of several orders of magnitude which is I think a strong argument for not making it optional.

I'm not totally following this, but looking forward to hearing more about your thinking on implementation.

@seabbs seabbs changed the title Add aggregate data class Add aggregate data support Jan 22, 2025
@seabbs
Copy link
Contributor

seabbs commented Jan 28, 2025

Flagging #512 as a potential sub issue here. As #507 is now closed native use of counts is now possible in the marginal model with no performance penalty. Remaining work to close this issue (pending #512) is to make a aggregate data wrapper as discussed here and fleshed out in #508. The main upside to this is user safety and clarity (i.e. via docs and a single pathway for a given data source. It is possible that #508 will do enough along these lines to close #512.

@seabbs
Copy link
Contributor

seabbs commented Jan 28, 2025

Work to wrap this up is here: #513

Note I may change the recently introduce weight argument to weights to be in line with tidyr::uncounts. As this is a new and experimental feature I would do this without depereciation so anyone at the bleeding edge may see a breaking change.

@athowes
Copy link
Collaborator Author

athowes commented Jan 29, 2025

Copy of findings from testing this on real data. Topline: same results with both duplicating rows and using the weights arguments, so seems good to me.

  • Note: for the time taken, I did have to click "install tools" for both, which could have changed things a little (so I think they are really the same)

For the manual row duplication

> difftime(end, start, units = "secs")
Time difference of 132.9298 secs
>
> summary(fit_manual)
Family: marginal_lognormal
Links: mu = identity; sigma = log
Formula: delay_lwr | weights(n) + vreal(relative_obs_time, pwindow, swindow, delay_upr) ~ 1
sigma ~ 1
Data: transformed_data (Number of observations: 371)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000

Regression Coefficients:

Estimate   Est.Error   l-95% CI   u-95% CI   Rhat   Bulk_ESS   Tail_ESS
Intercept        0.32        0.02        0.29        0.36    1.00      2659        2411
sigma_Intercept -0.06       0.01       -0.09       -0.04    1.00      3486        2607

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

For the weights argument

> difftime(end, start, units = "secs")
Time difference of 125.7883 secs
>
> summary(fit)
Family: marginal_lognormal
Links: mu = identity; sigma = log
Formula: delay_lwr | weights(n) + vreal(relative_obs_time, pwindow, swindow, delay_upr) ~ 1
sigma ~ 1
Data: transformed_data (Number of observations: 371)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000

Regression Coefficients:

Estimate   Est.Error   l-95% CI   u-95% CI   Rhat   Bulk_ESS   Tail_ESS
Intercept        0.32        0.02        0.29        0.36    1.00      4076        2636
sigma_Intercept -0.06       0.01       -0.09       -0.03    1.00      3525        2616

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
image

@seabbs
Copy link
Contributor

seabbs commented Jan 29, 2025

Nice thanks for the sanity check @athowes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high Required for next release
Projects
None yet
Development

No branches or pull requests

3 participants