Skip to content

Add StetsonJ statistics to LSST science pipeline#271

Merged
hernandezc1 merged 23 commits intodevelopfrom
u/ch/lsst/stetsonj
Jun 14, 2025
Merged

Add StetsonJ statistics to LSST science pipeline#271
hernandezc1 merged 23 commits intodevelopfrom
u/ch/lsst/stetsonj

Conversation

@hernandezc1
Copy link
Collaborator

@hernandezc1 hernandezc1 commented Jun 3, 2025

This PR introduces a new module, variability, to the broker's LSST science pipeline that calculates StetsonJ indexes for LSST alerts.

Summary of changes:

  • The lsst-lite stream triggers a Cloud Run service (lsst-variability) that calculates the StetsonJ indexes on the DIA point source fluxes for each band. If no data is available for a particular band, the output is NaN.
  • Outgoing alerts are published to a Pub/Sub topic: lsst-variability.
    • Messages published to this topic are organized using two top level keys: alert_lite and variability.
    • The alert_lite value contains the original fields of the lite alert packet; the variability value contains the StetsonJ statistics for each band as well as the diaObjectId, diaSourceId, and number of previous detections associated with the alert packet.
  • A BigQuery subscription attached to the lsst-variability topic writes the variability data to a BigQuery table: variability (dataset: lsst_value_added).

@hernandezc1 hernandezc1 self-assigned this Jun 3, 2025
@hernandezc1 hernandezc1 added Enhancement New feature or request Pipeline: Science Components producing science output labels Jun 3, 2025
@hernandezc1 hernandezc1 requested a review from troyraen June 4, 2025 14:14
@hernandezc1
Copy link
Collaborator Author

@troyraen this module is now up to date. Ready for your review.

Copy link
Collaborator

@troyraen troyraen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're still making a few changes due to the schema and schema map issues. Just leaving a couple quick comments in the meantime.

TOPIC.publish(
pittgoogle.Alert.from_dict(
{**alert_lite.dict, "variability": _calculate_stetsonJ_statistics(alert_lite)},
attributes={**alert_lite.attributes, "value_added": "variability"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for "value_added": "variability"? Would it be more helpful to put one or two of the actual results into the attributes?

Also, in the message attributes, I'm much more inclined to just put all the key/value pairs at the top level. My sense is that adding another level (like "value_added") will cause much more headache (when writing the stream filters) than it's worth. If we want to differentiate between original alert data and our value added stuff, maybe we can do that by adding a prefix to the keys of all our value-added attributes. Maybe just "pg_" to keep it simple? Something like:

attributes={**alert_lite.attributes, "pg_stetson_j": "<value>", "pg_stetson_mean": "<value>"},

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for "value_added": "variability"?

@troyraen The intention is to have a single Pub/Sub topic that all the "value added" modules publish to. We can then set up BigQuery subscriptions to filter messages published to that topic by the type of value added (in this case, it lets users know that the value added was information pertaining to variability).

Would it be more helpful to put one or two of the actual results into the attributes?

Not for this module. The StetsonJ values are floats--it would be difficult to create a filter for these results.

What do you think?

I think this is a good idea. Given what I mentioned above, what do you think about:

attributes={**alert_lite.attributes, "pg_bq_table": "variability"}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to have a single Pub/Sub topic that all the "value added" modules publish to.

I'm concerned about this. It seems both more confusing and more costly. What is the benefit?

The StetsonJ values are floats--it would be difficult to create a filter for these results.

I agree that we shouldn't put floats in the attributes without thinking it through first. Think about how to obtain some summary value(s) describing the results of this module that will be both feasible and informative enough for folks to filter on.

Binned floats might be really valuable to put in the attributes in some cases (e.g., for fluxes/magnitudes). Those will still be floats (in principle, though the actual types will of course be strings), but the binning will drastically reduce the total number of possible values it can take on, making it much more tractable for filtering. In other cases, we might want to reduce the values to a small number of categories. Here, that could be something like: key = 'pg_variable', value = one of ['likely', 'reasonably likely', 'unlikely'].

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit?

I believe this was something we talked about in the past few weeks, but after some thought, I no longer think it will be useful. And you're right, it will be unnecessarily costly. I've updated the description in this PR to reflect new changes.

I've also updated the new attribute that outgoing messages will contain. The thresholds set in main.py to determine the value of this attribute was set after having a conversation with one of my collaborators who is an expert on variable stars. She is going to ask for additional input from the community.

@hernandezc1 hernandezc1 changed the title Add StetsonJ statistics Add StetsonJ statistics to LSST science pipeline Jun 10, 2025
@hernandezc1 hernandezc1 requested a review from troyraen June 10, 2025 19:55
Copy link
Collaborator

@troyraen troyraen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Comment on lines +77 to +81
pittgoogle.Alert.from_dict(
{"alert_lite": alert_lite.dict, "variability": stetsonj_stats},
attributes={**alert_lite.attributes, **pg_variable},
schema_name="default",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capturing a thought for myself for mwvgroup/pittgoogle-client#88. FYI @hernandezc1 , nothing to do for this PR, but if you have thoughts/opinions about my thoughts/plans here, let me know.

I keep looking at these outgoing messages that we have to construct with Alert.from_dict() and wanting to make it possible to just update the incoming alert with the new results. But I've been concerned about provenance and the messiness of trying to handle all the possibilities, for example, Alert.dict falling out of sync with Alert.dataframe. Here's a thought that might solve this and also some schema map issues I've been needing to work out for the above pr. Basically, Alert should keep the unique output of each module separate, only combing them when that's what's really wanted for user convenience, and then probably only as a "view".

  • Make a simple container (python class) to hold the new output produced by a module. It should have .dict and .attributes properties, at least.
  • Add a new property with the above type to Alert for each of our broker modules. Name it using the name of the stream that is produced by the module, minus the survey name.
  • Probably need to make it possible for the user to add a new "module" that is not one of our modules that Alert already knows above.
  • Alert.dict should probably take you straight to the original alert data (or alert-lite). To get the value-added data, one would do (eg) Alert.variability.dict.
  • Need to think about Alert.attributes. Currently it is settable (user can add dict items). I wonder if it should not be, but instead just gather and return the attributes from the modules' attribute properties (which are settable only if data for that module is not present in the incoming Alert.msg.data?).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@troyraen This is really well thought out and I like the approach! I've added some comments/questions below:


Probably need to make it possible for the user to add a new "module" that is not one of our modules that Alert already knows above.

Yes, this will be important.

Need to think about Alert.attributes. Currently it is settable (user can add dict items). I wonder if it should not be, but instead just gather and return the attributes from the modules' attribute properties (which are settable only if data for that module is not present in the incoming Alert.msg.data?).

Is there a reason the user shouldn't be able to set them?


  • What are your thoughts on using nested-pandas (developed by the lincc-frameworks team) to create a combined dataframe object (for alert.dataframe) where the nested data is the value added information by Pitt-Google's modules?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the user shouldn't be able to set them?

If we make Alert.attributes return the collection of all modules' attributes, then I think it should not be settable (it should just be a view, with no actual data of its own). Maybe the modules' attributes can always be settable though. But, I'm not quite sure about making Alert.attributes return a view into all modules while Alert.dict does not -- currently thinking that the latter should return only the consumer or lite module's data, ie the survey's data. Maybe that inconsistency will make for confusing usage? Will see more clearly once I start actually updating code.

What are your thoughts on using nested-pandas (developed by the lincc-frameworks team) to create a combined dataframe object (for alert.dataframe) where the nested data is the value added information by Pitt-Google's modules?

nested-pandas is interesting. I've just begun to try it on a couple different use cases in the past week, so I definitely don't know the ins and outs yet. But I don't quite see how it would work for this case. Our Alert.dataframe has one row per light-curve point (sourceId, or similar) but our value-added data is (so far) associated with the alert as a whole rather than individual points, so it's not clear which row(s) the value-added info should belong to. Lmk if you have a specific use case in mind, now or in the future. That really helps when thinking through the details.

I can see us using nested-pandas for something like Object dataframes (probably independent of Alert, but maybe part of pittgoogle.bigquery?). I can see a dataframe with one row per object and sets of nested columns for things like light-curve data and value-added data.

@hernandezc1 hernandezc1 merged commit 399e5f8 into develop Jun 14, 2025
4 checks passed
@hernandezc1 hernandezc1 deleted the u/ch/lsst/stetsonj branch June 14, 2025 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement New feature or request Pipeline: Science Components producing science output

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants