Add aggregate data support #412

athowes · 2024-11-01T16:00:37Z

In PR #390 we added a class for epidist_linelist data at the individual level.

We should also add class for aggregate linelist data.

It could be aggregated over delays:

Delay (days)	Count
1	20
2	35
3	50
4	40
5	60
...	...

Or aggregated over aspects of the window like:

Primary Event (date)	Secondary Event (date)	Count
2024-10-21	2024-10-23	15
2024-10-21	2024-10-25	20
2024-10-22	2024-10-24	25
2024-10-22	2024-10-26	30
2024-10-23	2024-10-25	35
2024-10-23	2024-10-27	40
...	...	...

Note that this type of data is appearing in our work on estimating right truncation delays. We also need models for this type of data soon.

The text was updated successfully, but these errors were encountered:

seabbs · 2024-11-24T21:49:53Z

Noting I still think we want to do this but also that for the #221 we will need to aggregate within the epidist call (as the optimal aggregation is related to the formula given for the model and this is something that will be less error prone if we do it for the user)

athowes · 2025-01-13T14:32:21Z

Note that in the parameter estimates work (cfa-parameter-estimates) we have aggregate data with 100K of rows. Taking it to the linelist (such that we can then aggregate within the epidist call) will make it into e.g. 10000K - 100000K of rows. As such, it would be good to have a way to provide the data to epidist already aggregated. (I might be looking into pulling out parts and writing a hacky version of this in the first instance.)

seabbs · 2025-01-14T16:28:38Z

The marginal model internally reweights assuming the presence of a n column. Regardless of any data structure considerations it needs to do this for efficiency purposes. At the moment the linelist data converter always creates this column and assigns it a value of 1. A simple change is to make this not be the case i.e. it doesn't overwrite an existing n column and doc this.

There are upsides and downsides to this approach. The main downside to this is the latent or other models which then need to be coded to do something with the n variable. The alternative is to create a wrapper converter or a different data class.

kgostic · 2025-01-14T16:37:43Z

@seabbs -- is this the preprocessing issue you were referring to when we just talked face to face? If so, I do think this is high-priority.

For CFA, I think the immediate priority is to be able to pass the large data that Adam is describing into the marginal model, and run it.

To me, it seems like an epidist_aggregate class that can be passed into the epidist call, or the change to the n column that you just suggested, are the easiest solutions that help us meet this goal.

Re:

Noting I still think we want to do this but also that for the #221 we will need to aggregate within the epidist call (as the optimal aggregation is related to the formula given for the model and this is something that will be less error prone if we do it for the user)

I think this could be nice to have for a general audience, but I don't know that it's necessary for our current use case if there is some kind of validator in the dispatch process that is sufficiently flexible, and that throws an helpful error if manual aggregation is done wrong. I am a little uncomfortable requiring the aggregation to be part of the epidist call -- this feels like a separation of concerns trap that could bite us later. Maybe this is more of a decision point if the decision is to create an entirely new class than if you decide to modify the n column (which I agree could be more elegant and would require less preprocessing code).

Please chime in after having a think if you have a proposed path forward, and please let us know what your timeline looks like on this!

seabbs · 2025-01-14T16:38:03Z

is this the preprocessing issue you were referring to when we just talked face to face? If so, I do think this is high-priority.

Yes

seabbs · 2025-01-14T16:40:55Z

I think this could be nice to have for a general audience, but I don't know that it's necessary for our current use case if there is some kind of validator in the dispatch process that is sufficiently flexible, and that throws an helpful error if manual aggregation is done wrong

This is already implemented/in place, has a message, and some warnings built around it. If you don't do it this way experimenting with multiple models becomes very painful as you need to update the data each time. I've tested a fair few edge cases and yet to find an issue but very open to the idea there are some.

seabbs · 2025-01-14T17:02:05Z

This is the messaging:

epidist/R/marginal_model.R

Line 159 in e18a77e

epidist_transform_data_model.epidist_marginal_model <- function(

for the internal aggregation.

seabbs · 2025-01-14T17:04:24Z

This is the change required to in take user specified weights:

epidist/R/marginal_model.R

Line 36 in e18a77e

n = 1

as you can see as this is in the marginal vs data class we can overload the linelist class with an extra optional feature if we wish. This might make it hard to use but is probably worth a go in the first instance to avoid class bloat.

Maybe this is more of a decision point if the decision is to create an entirely new class than if you decide to modify the n column (which I agree could be more elegant and would require less preprocessing code).

I think this is the safe option for defining a aggregate linelist but I am not sure how much safer it really is (i.e. is it worth the technical debt). As above I don't think there is an issue in terms of the formula aggregation as this is a modelling detail. Noting that not doing this correctly has potential performance implications of several orders of magnitude which is I think a strong argument for not making it optional.

I will take a look at adding this in the morning.

kgostic · 2025-01-15T19:26:15Z

As above I don't think there is an issue in terms of the formula aggregation as this is a modelling detail. Noting that not doing this correctly has potential performance implications of several orders of magnitude which is I think a strong argument for not making it optional.

I'm not totally following this, but looking forward to hearing more about your thinking on implementation.

seabbs · 2025-01-28T13:12:31Z

Flagging #512 as a potential sub issue here. As #507 is now closed native use of counts is now possible in the marginal model with no performance penalty. Remaining work to close this issue (pending #512) is to make a aggregate data wrapper as discussed here and fleshed out in #508. The main upside to this is user safety and clarity (i.e. via docs and a single pathway for a given data source. It is possible that #508 will do enough along these lines to close #512.

seabbs · 2025-01-28T18:04:53Z

Work to wrap this up is here: #513

Note I may change the recently introduce weight argument to weights to be in line with tidyr::uncounts. As this is a new and experimental feature I would do this without depereciation so anyone at the bleeding edge may see a breaking change.

athowes · 2025-01-29T13:03:57Z

Copy of findings from testing this on real data. Topline: same results with both duplicating rows and using the weights arguments, so seems good to me.

Note: for the time taken, I did have to click "install tools" for both, which could have changed things a little (so I think they are really the same)

For the manual row duplication

> difftime(end, start, units = "secs")
Time difference of 132.9298 secs
>
> summary(fit_manual)
Family: marginal_lognormal
Links: mu = identity; sigma = log
Formula: delay_lwr | weights(n) + vreal(relative_obs_time, pwindow, swindow, delay_upr) ~ 1
sigma ~ 1
Data: transformed_data (Number of observations: 371)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000

Regression Coefficients:

Estimate   Est.Error   l-95% CI   u-95% CI   Rhat   Bulk_ESS   Tail_ESS
Intercept        0.32        0.02        0.29        0.36    1.00      2659        2411
sigma_Intercept -0.06       0.01       -0.09       -0.04    1.00      3486        2607

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

For the weights argument

> difftime(end, start, units = "secs")
Time difference of 125.7883 secs
>
> summary(fit)
Family: marginal_lognormal
Links: mu = identity; sigma = log
Formula: delay_lwr | weights(n) + vreal(relative_obs_time, pwindow, swindow, delay_upr) ~ 1
sigma ~ 1
Data: transformed_data (Number of observations: 371)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000

Regression Coefficients:

Estimate   Est.Error   l-95% CI   u-95% CI   Rhat   Bulk_ESS   Tail_ESS
Intercept        0.32        0.02        0.29        0.36    1.00      4076        2636
sigma_Intercept -0.06       0.01       -0.09       -0.03    1.00      3525        2616

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

seabbs · 2025-01-29T17:39:03Z

Nice thanks for the sanity check @athowes

athowes mentioned this issue Nov 1, 2024

Issue #388: Refactor preprocessing functionality #390

Merged

17 tasks

athowes added the medium Nice to have for next release label Nov 13, 2024

kgostic added high Required for next release and removed medium Nice to have for next release labels Jan 15, 2025

seabbs changed the title ~~Add aggregate data class~~ Add aggregate data support Jan 22, 2025

seabbs mentioned this issue Jan 22, 2025

Issue #507: Add weighting support to marginal model class #509

Merged

9 tasks

seabbs mentioned this issue Jan 28, 2025

Issue #508: Aggregate data class #513

Draft

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add aggregate data support #412

Add aggregate data support #412

athowes commented Nov 1, 2024

seabbs commented Nov 24, 2024

athowes commented Jan 13, 2025

seabbs commented Jan 14, 2025 •

edited

Loading

kgostic commented Jan 14, 2025

seabbs commented Jan 14, 2025

seabbs commented Jan 14, 2025 •

edited

Loading

seabbs commented Jan 14, 2025

seabbs commented Jan 14, 2025 •

edited

Loading

kgostic commented Jan 15, 2025

seabbs commented Jan 28, 2025 •

edited

Loading

seabbs commented Jan 28, 2025

athowes commented Jan 29, 2025

seabbs commented Jan 29, 2025

Add aggregate data support #412

Add aggregate data support #412

Comments

athowes commented Nov 1, 2024

seabbs commented Nov 24, 2024

athowes commented Jan 13, 2025

seabbs commented Jan 14, 2025 • edited Loading

kgostic commented Jan 14, 2025

seabbs commented Jan 14, 2025

seabbs commented Jan 14, 2025 • edited Loading

seabbs commented Jan 14, 2025

seabbs commented Jan 14, 2025 • edited Loading

kgostic commented Jan 15, 2025

seabbs commented Jan 28, 2025 • edited Loading

seabbs commented Jan 28, 2025

athowes commented Jan 29, 2025

seabbs commented Jan 29, 2025

seabbs commented Jan 14, 2025 •

edited

Loading

seabbs commented Jan 14, 2025 •

edited

Loading

seabbs commented Jan 14, 2025 •

edited

Loading

seabbs commented Jan 28, 2025 •

edited

Loading