Skip to content
This repository was archived by the owner on Aug 1, 2020. It is now read-only.

query for newest vintage of each variable #2

Open
TomGoBravo opened this issue Jun 2, 2020 · 12 comments
Open

query for newest vintage of each variable #2

TomGoBravo opened this issue Jun 2, 2020 · 12 comments
Assignees

Comments

@TomGoBravo
Copy link

Thank you for gathering and sharing this data in an easy-to-use form.

Taking a look at:

c = cmdc.Client()
c.covid()
df = c.fetch()
df.groupby(['fips', 'vintage']).size()

I see that different fips have different vintages of data available. For example in the data I fetched yesterday the state fips and CA/06 counties have 4 vintages 2020-05-29 to 2020-06-01 but AK/02 counties have only 2020-05-30. I get that it is handy to keep past versions of the data for debugging but suspect most clients only want the latest and greatest version of each timeseries.

I'm guessing you expect data consumers to group by ['fips', 'dt'], sort by 'vintage' then keep only the newest row. How does df.sort_values('vintage').groupby(['fips', 'dt']).last() look to you? Can you make fetching old vintages optional?

@cc7768 cc7768 self-assigned this Jun 3, 2020
@cc7768
Copy link
Contributor

cc7768 commented Jun 3, 2020

Yep. I think this is a good idea -- You're right that most people won't care about the vintages.

@sglyon
Copy link
Member

sglyon commented Jun 3, 2020

I agree! I think we should probably have the current endpoints be renamed to covid_history or covid_historical and then have the covid endpoint only return most recent data

@sglyon
Copy link
Member

sglyon commented Jun 3, 2020

This is handled in valorumdata/cmdc-tools#17

@sglyon
Copy link
Member

sglyon commented Jun 3, 2020

Live!

@sglyon sglyon closed this as completed Jun 3, 2020
@TomGoBravo
Copy link
Author

Thank you for quickly rolling out this improvement :-)

@TomGoBravo
Copy link
Author

On closer inspection this didn't quite work out as I expected. I installed https://github.com/valorumdata/cmdc.py/archive/ba6f0c56f84049f3ec4b016d08b9975b74531965.zip and then did this:

c = cmdc.Client()
c.covid()
df = c.fetch()

df.sort_values('vintage').groupby(['fips', 'dt']).size().loc[lambda df: df > 1]
fips  dt        
6037  2020-04-01    2
      2020-04-02    2
      2020-04-03    2
      2020-04-04    2
      2020-04-05    2
                   ..
      2020-05-27    2
      2020-05-28    2
      2020-05-29    2
      2020-05-30    2
      2020-05-31    2
Length: 61, dtype: int64

I was expecting that each <fips, dt> pair will appear no more than one time. fips=6037 appears twice. Looking at one day only I see:

>>> df.loc[(df.fips==6037) & (df.dt == "2020-05-15")]
variable    vintage         dt  fips  active_total  ...  negative_tests_total  positive_tests_total  recovered_total  ventilators_in_use_covid_total
2257     2020-06-02 2020-05-15  6037           NaN  ...              405323.0               39019.0              NaN                             NaN
8531     2020-06-03 2020-05-15  6037           NaN  ...                   NaN                   NaN              NaN                             NaN

@cc7768 cc7768 reopened this Jun 3, 2020
@sglyon
Copy link
Member

sglyon commented Jun 4, 2020

Hey @TomGoBravo I think the problem here is that some variables come from different sources. The sources may have a difference in when the data is available - thus the vintage can potentially be different.

If you run your example right now you should see that for each variable only one of the rows has a non-null entry.

This happens when we pivot the data from long form (vintage, dt, fips, variable, value) to wide form (variable flipped up to be column also).

I don't know right now how to resolve this, other than dropping the vintage column. Given that we state this is "most recent covid data" we have, perhaps that is the right answer? I'm not totally satisfied with that answer.

Any ideas @TomGoBravo and @cc7768 ?

@TomGoBravo
Copy link
Author

Agreed this is a difficult question of normalization and the granularity at which updates bundled :-) I think it makes sense to make one row for each <fips, dt> that appears in your data and that row contains the latest value for each variable, then drop the vintage column because the row may contain a mix of vintage values. Or maybe include vintage when all are equal, otherwise make it NaN. Or represent vintage as two columns with the oldest and newest vintage.
For CovidActNow I don't think vintage is as much a concern as keeping immutable snapshots of what our pipeline has ingested so anybody can reproduce our results. Long term I think a model where each source produces an append-only log that represents a gradually evolving snapshot makes a lot of sense.

@sglyon
Copy link
Member

sglyon commented Jun 11, 2020

@TomGoBravo would it work for you if the covid endpoint did not return a vintage?

The data user in me squirms a bit at that thought because I'm very picky about knowing exactly what data I have and where it came from. However, ultimately I'm not sure if you would avoid using a data point that is one day old, for example

@TomGoBravo
Copy link
Author

I have a hunch that https://github.com/covid-projections/covid-data-public/blob/master/scripts/update_cmdc.py#L94-L96 doesn't merge the rows with different vintages correctly. The current wide form returned by the endpoint is a table where there may be more than one cell for each <dt, fips, variable>, each with a different vintage. I agree that it is nice to have the source and vintage info but that is less important than easy access to the best value. In other words: yes, please remove the vintage as a key in the wide form. Perhaps return the long form if you'd like to include the vintage per cell.

@TomGoBravo
Copy link
Author

The most picky data consumers only trust append-only logs. :-P

@sglyon
Copy link
Member

sglyon commented Jun 17, 2020

@TomGoBravo this should be resolved now with the new endpoint covid_us. Please migrate your codes to using that endpoint instead of covid.

There are two differences between the endpoints:

  1. covid_us does not have a vintage column -- the latest available data is returned
  2. The fips column from the covid endpoint is now named location in covid_us (in preparation for international data)

Once the CAN team has had a chance to migrate, we'll remove the covid endpoint and close this issue as well as #16

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants