Add StetsonJ statistics to LSST science pipeline#271
Conversation
|
@troyraen this module is now up to date. Ready for your review. |
troyraen
left a comment
There was a problem hiding this comment.
I know you're still making a few changes due to the schema and schema map issues. Just leaving a couple quick comments in the meantime.
| TOPIC.publish( | ||
| pittgoogle.Alert.from_dict( | ||
| {**alert_lite.dict, "variability": _calculate_stetsonJ_statistics(alert_lite)}, | ||
| attributes={**alert_lite.attributes, "value_added": "variability"}, |
There was a problem hiding this comment.
What's the reason for "value_added": "variability"? Would it be more helpful to put one or two of the actual results into the attributes?
Also, in the message attributes, I'm much more inclined to just put all the key/value pairs at the top level. My sense is that adding another level (like "value_added") will cause much more headache (when writing the stream filters) than it's worth. If we want to differentiate between original alert data and our value added stuff, maybe we can do that by adding a prefix to the keys of all our value-added attributes. Maybe just "pg_" to keep it simple? Something like:
attributes={**alert_lite.attributes, "pg_stetson_j": "<value>", "pg_stetson_mean": "<value>"},What do you think?
There was a problem hiding this comment.
What's the reason for
"value_added": "variability"?
@troyraen The intention is to have a single Pub/Sub topic that all the "value added" modules publish to. We can then set up BigQuery subscriptions to filter messages published to that topic by the type of value added (in this case, it lets users know that the value added was information pertaining to variability).
Would it be more helpful to put one or two of the actual results into the attributes?
Not for this module. The StetsonJ values are floats--it would be difficult to create a filter for these results.
What do you think?
I think this is a good idea. Given what I mentioned above, what do you think about:
attributes={**alert_lite.attributes, "pg_bq_table": "variability"}
There was a problem hiding this comment.
The intention is to have a single Pub/Sub topic that all the "value added" modules publish to.
I'm concerned about this. It seems both more confusing and more costly. What is the benefit?
The StetsonJ values are floats--it would be difficult to create a filter for these results.
I agree that we shouldn't put floats in the attributes without thinking it through first. Think about how to obtain some summary value(s) describing the results of this module that will be both feasible and informative enough for folks to filter on.
Binned floats might be really valuable to put in the attributes in some cases (e.g., for fluxes/magnitudes). Those will still be floats (in principle, though the actual types will of course be strings), but the binning will drastically reduce the total number of possible values it can take on, making it much more tractable for filtering. In other cases, we might want to reduce the values to a small number of categories. Here, that could be something like: key = 'pg_variable', value = one of ['likely', 'reasonably likely', 'unlikely'].
There was a problem hiding this comment.
What is the benefit?
I believe this was something we talked about in the past few weeks, but after some thought, I no longer think it will be useful. And you're right, it will be unnecessarily costly. I've updated the description in this PR to reflect new changes.
I've also updated the new attribute that outgoing messages will contain. The thresholds set in main.py to determine the value of this attribute was set after having a conversation with one of my collaborators who is an expert on variable stars. She is going to ask for additional input from the community.
Co-authored-by: Troy Raen <troy.raen@pitt.edu>
| pittgoogle.Alert.from_dict( | ||
| {"alert_lite": alert_lite.dict, "variability": stetsonj_stats}, | ||
| attributes={**alert_lite.attributes, **pg_variable}, | ||
| schema_name="default", | ||
| ) |
There was a problem hiding this comment.
Capturing a thought for myself for mwvgroup/pittgoogle-client#88. FYI @hernandezc1 , nothing to do for this PR, but if you have thoughts/opinions about my thoughts/plans here, let me know.
I keep looking at these outgoing messages that we have to construct with Alert.from_dict() and wanting to make it possible to just update the incoming alert with the new results. But I've been concerned about provenance and the messiness of trying to handle all the possibilities, for example, Alert.dict falling out of sync with Alert.dataframe. Here's a thought that might solve this and also some schema map issues I've been needing to work out for the above pr. Basically, Alert should keep the unique output of each module separate, only combing them when that's what's really wanted for user convenience, and then probably only as a "view".
- Make a simple container (python class) to hold the new output produced by a module. It should have
.dictand.attributesproperties, at least. - Add a new property with the above type to
Alertfor each of our broker modules. Name it using the name of the stream that is produced by the module, minus the survey name. - Probably need to make it possible for the user to add a new "module" that is not one of our modules that
Alertalready knows above. Alert.dictshould probably take you straight to the original alert data (or alert-lite). To get the value-added data, one would do (eg)Alert.variability.dict.- Need to think about
Alert.attributes. Currently it is settable (user can add dict items). I wonder if it should not be, but instead just gather and return the attributes from the modules'attributeproperties (which are settable only if data for that module is not present in the incomingAlert.msg.data?).
There was a problem hiding this comment.
@troyraen This is really well thought out and I like the approach! I've added some comments/questions below:
Probably need to make it possible for the user to add a new "module" that is not one of our modules that Alert already knows above.
Yes, this will be important.
Need to think about Alert.attributes. Currently it is settable (user can add dict items). I wonder if it should not be, but instead just gather and return the attributes from the modules' attribute properties (which are settable only if data for that module is not present in the incoming Alert.msg.data?).
Is there a reason the user shouldn't be able to set them?
- What are your thoughts on using nested-pandas (developed by the lincc-frameworks team) to create a combined dataframe object (for
alert.dataframe) where the nested data is the value added information by Pitt-Google's modules?
There was a problem hiding this comment.
Is there a reason the user shouldn't be able to set them?
If we make Alert.attributes return the collection of all modules' attributes, then I think it should not be settable (it should just be a view, with no actual data of its own). Maybe the modules' attributes can always be settable though. But, I'm not quite sure about making Alert.attributes return a view into all modules while Alert.dict does not -- currently thinking that the latter should return only the consumer or lite module's data, ie the survey's data. Maybe that inconsistency will make for confusing usage? Will see more clearly once I start actually updating code.
What are your thoughts on using nested-pandas (developed by the lincc-frameworks team) to create a combined dataframe object (for alert.dataframe) where the nested data is the value added information by Pitt-Google's modules?
nested-pandas is interesting. I've just begun to try it on a couple different use cases in the past week, so I definitely don't know the ins and outs yet. But I don't quite see how it would work for this case. Our Alert.dataframe has one row per light-curve point (sourceId, or similar) but our value-added data is (so far) associated with the alert as a whole rather than individual points, so it's not clear which row(s) the value-added info should belong to. Lmk if you have a specific use case in mind, now or in the future. That really helps when thinking through the details.
I can see us using nested-pandas for something like Object dataframes (probably independent of Alert, but maybe part of pittgoogle.bigquery?). I can see a dataframe with one row per object and sets of nested columns for things like light-curve data and value-added data.
This PR introduces a new module, variability, to the broker's LSST science pipeline that calculates StetsonJ indexes for LSST alerts.
Summary of changes:
lsst-litestream triggers a Cloud Run service (lsst-variability) that calculates the StetsonJ indexes on the DIA point source fluxes for each band. If no data is available for a particular band, the output isNaN.lsst-variability.alert_liteandvariability.alert_litevalue contains the original fields of the lite alert packet; thevariabilityvalue contains the StetsonJ statistics for each band as well as the diaObjectId, diaSourceId, and number of previous detections associated with the alert packet.lsst-variabilitytopic writes thevariabilitydata to a BigQuery table:variability(dataset:lsst_value_added).