query for newest vintage of each variable #2

TomGoBravo · 2020-06-02T21:32:18Z

Thank you for gathering and sharing this data in an easy-to-use form.

Taking a look at:

c = cmdc.Client()
c.covid()
df = c.fetch()
df.groupby(['fips', 'vintage']).size()

I see that different fips have different vintages of data available. For example in the data I fetched yesterday the state fips and CA/06 counties have 4 vintages 2020-05-29 to 2020-06-01 but AK/02 counties have only 2020-05-30. I get that it is handy to keep past versions of the data for debugging but suspect most clients only want the latest and greatest version of each timeseries.

I'm guessing you expect data consumers to group by ['fips', 'dt'], sort by 'vintage' then keep only the newest row. How does df.sort_values('vintage').groupby(['fips', 'dt']).last() look to you? Can you make fetching old vintages optional?

The text was updated successfully, but these errors were encountered:

cc7768 · 2020-06-03T02:58:10Z

Yep. I think this is a good idea -- You're right that most people won't care about the vintages.

sglyon · 2020-06-03T12:19:21Z

I agree! I think we should probably have the current endpoints be renamed to covid_history or covid_historical and then have the covid endpoint only return most recent data

sglyon · 2020-06-03T13:33:03Z

This is handled in valorumdata/cmdc-tools#17

sglyon · 2020-06-03T15:21:19Z

Live!

TomGoBravo · 2020-06-03T16:55:19Z

Thank you for quickly rolling out this improvement :-)

TomGoBravo · 2020-06-03T19:22:48Z

On closer inspection this didn't quite work out as I expected. I installed https://github.com/valorumdata/cmdc.py/archive/ba6f0c56f84049f3ec4b016d08b9975b74531965.zip and then did this:

c = cmdc.Client()
c.covid()
df = c.fetch()

df.sort_values('vintage').groupby(['fips', 'dt']).size().loc[lambda df: df > 1]
fips  dt        
6037  2020-04-01    2
      2020-04-02    2
      2020-04-03    2
      2020-04-04    2
      2020-04-05    2
                   ..
      2020-05-27    2
      2020-05-28    2
      2020-05-29    2
      2020-05-30    2
      2020-05-31    2
Length: 61, dtype: int64

I was expecting that each <fips, dt> pair will appear no more than one time. fips=6037 appears twice. Looking at one day only I see:

>>> df.loc[(df.fips==6037) & (df.dt == "2020-05-15")]
variable    vintage         dt  fips  active_total  ...  negative_tests_total  positive_tests_total  recovered_total  ventilators_in_use_covid_total
2257     2020-06-02 2020-05-15  6037           NaN  ...              405323.0               39019.0              NaN                             NaN
8531     2020-06-03 2020-05-15  6037           NaN  ...                   NaN                   NaN              NaN                             NaN

sglyon · 2020-06-04T12:04:31Z

Hey @TomGoBravo I think the problem here is that some variables come from different sources. The sources may have a difference in when the data is available - thus the vintage can potentially be different.

If you run your example right now you should see that for each variable only one of the rows has a non-null entry.

This happens when we pivot the data from long form (vintage, dt, fips, variable, value) to wide form (variable flipped up to be column also).

I don't know right now how to resolve this, other than dropping the vintage column. Given that we state this is "most recent covid data" we have, perhaps that is the right answer? I'm not totally satisfied with that answer.

Any ideas @TomGoBravo and @cc7768 ?

TomGoBravo · 2020-06-04T16:20:45Z

Agreed this is a difficult question of normalization and the granularity at which updates bundled :-) I think it makes sense to make one row for each <fips, dt> that appears in your data and that row contains the latest value for each variable, then drop the vintage column because the row may contain a mix of vintage values. Or maybe include vintage when all are equal, otherwise make it NaN. Or represent vintage as two columns with the oldest and newest vintage.
For CovidActNow I don't think vintage is as much a concern as keeping immutable snapshots of what our pipeline has ingested so anybody can reproduce our results. Long term I think a model where each source produces an append-only log that represents a gradually evolving snapshot makes a lot of sense.

sglyon · 2020-06-11T14:44:31Z

@TomGoBravo would it work for you if the covid endpoint did not return a vintage?

The data user in me squirms a bit at that thought because I'm very picky about knowing exactly what data I have and where it came from. However, ultimately I'm not sure if you would avoid using a data point that is one day old, for example

TomGoBravo · 2020-06-11T17:13:47Z

I have a hunch that https://github.com/covid-projections/covid-data-public/blob/master/scripts/update_cmdc.py#L94-L96 doesn't merge the rows with different vintages correctly. The current wide form returned by the endpoint is a table where there may be more than one cell for each <dt, fips, variable>, each with a different vintage. I agree that it is nice to have the source and vintage info but that is less important than easy access to the best value. In other words: yes, please remove the vintage as a key in the wide form. Perhaps return the long form if you'd like to include the vintage per cell.

TomGoBravo · 2020-06-11T17:15:08Z

The most picky data consumers only trust append-only logs. :-P

sglyon · 2020-06-17T14:59:26Z

@TomGoBravo this should be resolved now with the new endpoint covid_us. Please migrate your codes to using that endpoint instead of covid.

There are two differences between the endpoints:

covid_us does not have a vintage column -- the latest available data is returned
The fips column from the covid endpoint is now named location in covid_us (in preparation for international data)

Once the CAN team has had a chance to migrate, we'll remove the covid endpoint and close this issue as well as #16

cc7768 self-assigned this Jun 3, 2020

sglyon closed this as completed Jun 3, 2020

cc7768 reopened this Jun 3, 2020

sglyon mentioned this issue Jun 17, 2020

Change examples to use location instead of fips` #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query for newest vintage of each variable #2

query for newest vintage of each variable #2

TomGoBravo commented Jun 2, 2020

cc7768 commented Jun 3, 2020

sglyon commented Jun 3, 2020

sglyon commented Jun 3, 2020

sglyon commented Jun 3, 2020

TomGoBravo commented Jun 3, 2020

TomGoBravo commented Jun 3, 2020

sglyon commented Jun 4, 2020

TomGoBravo commented Jun 4, 2020

sglyon commented Jun 11, 2020

TomGoBravo commented Jun 11, 2020

TomGoBravo commented Jun 11, 2020

sglyon commented Jun 17, 2020

query for newest vintage of each variable #2

query for newest vintage of each variable #2

Comments

TomGoBravo commented Jun 2, 2020

cc7768 commented Jun 3, 2020

sglyon commented Jun 3, 2020

sglyon commented Jun 3, 2020

sglyon commented Jun 3, 2020

TomGoBravo commented Jun 3, 2020

TomGoBravo commented Jun 3, 2020

sglyon commented Jun 4, 2020

TomGoBravo commented Jun 4, 2020

sglyon commented Jun 11, 2020

TomGoBravo commented Jun 11, 2020

TomGoBravo commented Jun 11, 2020

sglyon commented Jun 17, 2020