-
Notifications
You must be signed in to change notification settings - Fork 1
query for newest vintage of each variable #2
Comments
Yep. I think this is a good idea -- You're right that most people won't care about the vintages. |
I agree! I think we should probably have the current endpoints be renamed to |
This is handled in valorumdata/cmdc-tools#17 |
Live! |
Thank you for quickly rolling out this improvement :-) |
On closer inspection this didn't quite work out as I expected. I installed https://github.com/valorumdata/cmdc.py/archive/ba6f0c56f84049f3ec4b016d08b9975b74531965.zip and then did this:
I was expecting that each <fips, dt> pair will appear no more than one time. fips=6037 appears twice. Looking at one day only I see:
|
Hey @TomGoBravo I think the problem here is that some variables come from different sources. The sources may have a difference in when the data is available - thus the vintage can potentially be different. If you run your example right now you should see that for each variable only one of the rows has a non-null entry. This happens when we pivot the data from long form (vintage, dt, fips, variable, value) to wide form (variable flipped up to be column also). I don't know right now how to resolve this, other than dropping the vintage column. Given that we state this is "most recent covid data" we have, perhaps that is the right answer? I'm not totally satisfied with that answer. Any ideas @TomGoBravo and @cc7768 ? |
Agreed this is a difficult question of normalization and the granularity at which updates bundled :-) I think it makes sense to make one row for each <fips, dt> that appears in your data and that row contains the latest value for each variable, then drop the vintage column because the row may contain a mix of vintage values. Or maybe include vintage when all are equal, otherwise make it NaN. Or represent vintage as two columns with the oldest and newest vintage. |
@TomGoBravo would it work for you if the The data user in me squirms a bit at that thought because I'm very picky about knowing exactly what data I have and where it came from. However, ultimately I'm not sure if you would avoid using a data point that is one day old, for example |
I have a hunch that https://github.com/covid-projections/covid-data-public/blob/master/scripts/update_cmdc.py#L94-L96 doesn't merge the rows with different vintages correctly. The current wide form returned by the endpoint is a table where there may be more than one cell for each <dt, fips, variable>, each with a different vintage. I agree that it is nice to have the source and vintage info but that is less important than easy access to the best value. In other words: yes, please remove the vintage as a key in the wide form. Perhaps return the long form if you'd like to include the vintage per cell. |
The most picky data consumers only trust append-only logs. :-P |
@TomGoBravo this should be resolved now with the new endpoint There are two differences between the endpoints:
Once the CAN team has had a chance to migrate, we'll remove the |
Thank you for gathering and sharing this data in an easy-to-use form.
Taking a look at:
I see that different fips have different vintages of data available. For example in the data I fetched yesterday the state fips and CA/06 counties have 4 vintages 2020-05-29 to 2020-06-01 but AK/02 counties have only 2020-05-30. I get that it is handy to keep past versions of the data for debugging but suspect most clients only want the latest and greatest version of each timeseries.
I'm guessing you expect data consumers to group by ['fips', 'dt'], sort by 'vintage' then keep only the newest row. How does
df.sort_values('vintage').groupby(['fips', 'dt']).last()
look to you? Can you make fetching old vintages optional?The text was updated successfully, but these errors were encountered: