-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add aggregate data support #412
Comments
Noting I still think we want to do this but also that for the #221 we will need to aggregate within the |
Note that in the parameter estimates work ( |
The marginal model internally reweights assuming the presence of a n column. Regardless of any data structure considerations it needs to do this for efficiency purposes. At the moment the linelist data converter always creates this column and assigns it a value of 1. A simple change is to make this not be the case i.e. it doesn't overwrite an existing n column and doc this. There are upsides and downsides to this approach. The main downside to this is the latent or other models which then need to be coded to do something with the n variable. The alternative is to create a wrapper converter or a different data class. |
@seabbs -- is this the preprocessing issue you were referring to when we just talked face to face? If so, I do think this is high-priority. For CFA, I think the immediate priority is to be able to pass the large data that Adam is describing into the marginal model, and run it. To me, it seems like an epidist_aggregate class that can be passed into the epidist call, or the change to the n column that you just suggested, are the easiest solutions that help us meet this goal. Re:
I think this could be nice to have for a general audience, but I don't know that it's necessary for our current use case if there is some kind of validator in the dispatch process that is sufficiently flexible, and that throws an helpful error if manual aggregation is done wrong. I am a little uncomfortable requiring the aggregation to be part of the epidist call -- this feels like a separation of concerns trap that could bite us later. Maybe this is more of a decision point if the decision is to create an entirely new class than if you decide to modify the n column (which I agree could be more elegant and would require less preprocessing code). Please chime in after having a think if you have a proposed path forward, and please let us know what your timeline looks like on this! |
Yes |
This is already implemented/in place, has a message, and some warnings built around it. If you don't do it this way experimenting with multiple models becomes very painful as you need to update the data each time. I've tested a fair few edge cases and yet to find an issue but very open to the idea there are some. |
This is the messaging: Line 159 in e18a77e
for the internal aggregation. |
This is the change required to in take user specified weights: Line 36 in e18a77e
as you can see as this is in the marginal vs data class we can overload the linelist class with an extra optional feature if we wish. This might make it hard to use but is probably worth a go in the first instance to avoid class bloat.
I think this is the safe option for defining a aggregate linelist but I am not sure how much safer it really is (i.e. is it worth the technical debt). As above I don't think there is an issue in terms of the formula aggregation as this is a modelling detail. Noting that not doing this correctly has potential performance implications of several orders of magnitude which is I think a strong argument for not making it optional. I will take a look at adding this in the morning. |
I'm not totally following this, but looking forward to hearing more about your thinking on implementation. |
Flagging #512 as a potential sub issue here. As #507 is now closed native use of counts is now possible in the marginal model with no performance penalty. Remaining work to close this issue (pending #512) is to make a aggregate data wrapper as discussed here and fleshed out in #508. The main upside to this is user safety and clarity (i.e. via docs and a single pathway for a given data source. It is possible that #508 will do enough along these lines to close #512. |
Work to wrap this up is here: #513 Note I may change the recently introduce |
Nice thanks for the sanity check @athowes |
In PR #390 we added a class for
epidist_linelist
data at the individual level.We should also add class for aggregate linelist data.
It could be aggregated over delays:
Or aggregated over aspects of the window like:
Note that this type of data is appearing in our work on estimating right truncation delays. We also need models for this type of data soon.
The text was updated successfully, but these errors were encountered: