Add StetsonJ statistics to LSST science pipeline by hernandezc1 · Pull Request #271 · mwvgroup/Pitt-Google-Broker

hernandezc1 · 2025-06-03T14:40:37Z

This PR introduces a new module, variability, to the broker's LSST science pipeline that calculates StetsonJ indexes for LSST alerts.

Summary of changes:

The lsst-lite stream triggers a Cloud Run service (lsst-variability) that calculates the StetsonJ indexes on the DIA point source fluxes for each band. If no data is available for a particular band, the output is NaN.
- The code used to calculate these indexes were obtained from the LSST diaCalculationPlugins Python script
Outgoing alerts are published to a Pub/Sub topic: lsst-variability.
- Messages published to this topic are organized using two top level keys: alert_lite and variability.
- The alert_lite value contains the original fields of the lite alert packet; the variability value contains the StetsonJ statistics for each band as well as the diaObjectId, diaSourceId, and number of previous detections associated with the alert packet.
A BigQuery subscription attached to the lsst-variability topic writes the variability data to a BigQuery table: variability (dataset: lsst_value_added).

hernandezc1 · 2025-06-06T19:56:38Z

@troyraen this module is now up to date. Ready for your review.

troyraen

I know you're still making a few changes due to the schema and schema map issues. Just leaving a couple quick comments in the meantime.

broker/cloud_run/lsst/variability/main.py

troyraen · 2025-06-06T20:17:19Z

broker/cloud_run/lsst/variability/main.py

+    TOPIC.publish(
+        pittgoogle.Alert.from_dict(
+            {**alert_lite.dict, "variability": _calculate_stetsonJ_statistics(alert_lite)},
+            attributes={**alert_lite.attributes, "value_added": "variability"},


What's the reason for "value_added": "variability"? Would it be more helpful to put one or two of the actual results into the attributes?

Also, in the message attributes, I'm much more inclined to just put all the key/value pairs at the top level. My sense is that adding another level (like "value_added") will cause much more headache (when writing the stream filters) than it's worth. If we want to differentiate between original alert data and our value added stuff, maybe we can do that by adding a prefix to the keys of all our value-added attributes. Maybe just "pg_" to keep it simple? Something like:

attributes={**alert_lite.attributes, "pg_stetson_j": "<value>", "pg_stetson_mean": "<value>"},

What do you think?

What's the reason for "value_added": "variability"?

@troyraen The intention is to have a single Pub/Sub topic that all the "value added" modules publish to. We can then set up BigQuery subscriptions to filter messages published to that topic by the type of value added (in this case, it lets users know that the value added was information pertaining to variability).

Would it be more helpful to put one or two of the actual results into the attributes?

Not for this module. The StetsonJ values are floats--it would be difficult to create a filter for these results.

What do you think?

I think this is a good idea. Given what I mentioned above, what do you think about:

attributes={**alert_lite.attributes, "pg_bq_table": "variability"}

The intention is to have a single Pub/Sub topic that all the "value added" modules publish to.

I'm concerned about this. It seems both more confusing and more costly. What is the benefit?

The StetsonJ values are floats--it would be difficult to create a filter for these results.

I agree that we shouldn't put floats in the attributes without thinking it through first. Think about how to obtain some summary value(s) describing the results of this module that will be both feasible and informative enough for folks to filter on.

Binned floats might be really valuable to put in the attributes in some cases (e.g., for fluxes/magnitudes). Those will still be floats (in principle, though the actual types will of course be strings), but the binning will drastically reduce the total number of possible values it can take on, making it much more tractable for filtering. In other cases, we might want to reduce the values to a small number of categories. Here, that could be something like: key = 'pg_variable', value = one of ['likely', 'reasonably likely', 'unlikely'].

What is the benefit?

I believe this was something we talked about in the past few weeks, but after some thought, I no longer think it will be useful. And you're right, it will be unnecessarily costly. I've updated the description in this PR to reflect new changes.

I've also updated the new attribute that outgoing messages will contain. The thresholds set in main.py to determine the value of this attribute was set after having a conversation with one of my collaborators who is an expert on variable stars. She is going to ask for additional input from the community.

broker/cloud_run/lsst/variability/main.py

Co-authored-by: Troy Raen <troy.raen@pitt.edu>

troyraen

Great!

troyraen · 2025-06-13T19:16:15Z

broker/cloud_run/lsst/variability/main.py

+        pittgoogle.Alert.from_dict(
+            {"alert_lite": alert_lite.dict, "variability": stetsonj_stats},
+            attributes={**alert_lite.attributes, **pg_variable},
+            schema_name="default",
+        )


Capturing a thought for myself for mwvgroup/pittgoogle-client#88. FYI @hernandezc1 , nothing to do for this PR, but if you have thoughts/opinions about my thoughts/plans here, let me know.

I keep looking at these outgoing messages that we have to construct with Alert.from_dict() and wanting to make it possible to just update the incoming alert with the new results. But I've been concerned about provenance and the messiness of trying to handle all the possibilities, for example, Alert.dict falling out of sync with Alert.dataframe. Here's a thought that might solve this and also some schema map issues I've been needing to work out for the above pr. Basically, Alert should keep the unique output of each module separate, only combing them when that's what's really wanted for user convenience, and then probably only as a "view".

Make a simple container (python class) to hold the new output produced by a module. It should have .dict and .attributes properties, at least.

Add a new property with the above type to Alert for each of our broker modules. Name it using the name of the stream that is produced by the module, minus the survey name.

Probably need to make it possible for the user to add a new "module" that is not one of our modules that Alert already knows above.

Alert.dict should probably take you straight to the original alert data (or alert-lite). To get the value-added data, one would do (eg) Alert.variability.dict.

Need to think about Alert.attributes. Currently it is settable (user can add dict items). I wonder if it should not be, but instead just gather and return the attributes from the modules' attribute properties (which are settable only if data for that module is not present in the incoming Alert.msg.data?).

@troyraen This is really well thought out and I like the approach! I've added some comments/questions below:

Probably need to make it possible for the user to add a new "module" that is not one of our modules that Alert already knows above.

Yes, this will be important.

Need to think about Alert.attributes. Currently it is settable (user can add dict items). I wonder if it should not be, but instead just gather and return the attributes from the modules' attribute properties (which are settable only if data for that module is not present in the incoming Alert.msg.data?).

Is there a reason the user shouldn't be able to set them?

What are your thoughts on using nested-pandas (developed by the lincc-frameworks team) to create a combined dataframe object (for alert.dataframe) where the nested data is the value added information by Pitt-Google's modules?

Is there a reason the user shouldn't be able to set them?

If we make Alert.attributes return the collection of all modules' attributes, then I think it should not be settable (it should just be a view, with no actual data of its own). Maybe the modules' attributes can always be settable though. But, I'm not quite sure about making Alert.attributes return a view into all modules while Alert.dict does not -- currently thinking that the latter should return only the consumer or lite module's data, ie the survey's data. Maybe that inconsistency will make for confusing usage? Will see more clearly once I start actually updating code.

What are your thoughts on using nested-pandas (developed by the lincc-frameworks team) to create a combined dataframe object (for alert.dataframe) where the nested data is the value added information by Pitt-Google's modules?

nested-pandas is interesting. I've just begun to try it on a couple different use cases in the past week, so I definitely don't know the ins and outs yet. But I don't quite see how it would work for this case. Our Alert.dataframe has one row per light-curve point (sourceId, or similar) but our value-added data is (so far) associated with the alert as a whole rather than individual points, so it's not clear which row(s) the value-added info should belong to. Lmk if you have a specific use case in mind, now or in the future. That really helps when thinking through the details.

I can see us using nested-pandas for something like Object dataframes (probably independent of Alert, but maybe part of pittgoogle.bigquery?). I can see a dataframe with one row per object and sets of nested columns for things like light-curve data and value-added data.

add directory and files for this module

39fdaf1

hernandezc1 self-assigned this Jun 3, 2025

hernandezc1 added Enhancement New feature or request Pipeline: Science Components producing science output labels Jun 3, 2025

hernandezc1 added 6 commits June 3, 2025 11:08

update variable names in main.py

e44e5cf

address codacy issues in main.py

df92ce0

create BigQuery table and implement BQ Subscriptions

72a2763

update deployment scripts

b0537ef

update format of outgoing messages to accomodate BQ table schema

73fc489

update variable names

eaa550e

hernandezc1 requested a review from troyraen June 4, 2025 14:14

hernandezc1 added 5 commits June 6, 2025 15:20

module accomodates incoming alerts containg a top-level alert key

2bbd5c9

import pandas

8638f44

includes the number of detections for each band

6856667

update variable names for consistency

f623f09

update top-level key name

8157479

troyraen reviewed Jun 6, 2025

View reviewed changes

hernandezc1 and others added 7 commits June 6, 2025 16:35

Merge branch 'develop' into u/ch/lsst/stetsonj

adbc945

Update broker/cloud_run/lsst/variability/main.py

79b0f03

Co-authored-by: Troy Raen <troy.raen@pitt.edu>

add signatures to kwargs

9ac7f4c

rename GCP resources

0a7b935

update documentation

5d75481

update function used to name GCP resources

e2343ff

add Pub/Sub message attribute: pg_variable

6589200

hernandezc1 changed the title ~~Add StetsonJ statistics~~ Add StetsonJ statistics to LSST science pipeline Jun 10, 2025

update Pub/Sub message attribute

8ccba13

hernandezc1 requested a review from troyraen June 10, 2025 19:55

hernandezc1 and others added 2 commits June 11, 2025 11:46

add top-level alert_lite key

0d63327

Merge branch 'develop' into u/ch/lsst/stetsonj

6c410fa

troyraen approved these changes Jun 13, 2025

View reviewed changes

Merge branch 'develop' into u/ch/lsst/stetsonj

6696780

hernandezc1 merged commit 399e5f8 into develop Jun 14, 2025
4 checks passed

hernandezc1 deleted the u/ch/lsst/stetsonj branch June 14, 2025 15:12

troyraen mentioned this pull request Jun 17, 2025

Support ZTF lite schema mwvgroup/pittgoogle-client#88

Open

6 tasks

troyraen mentioned this pull request Nov 5, 2025

Support keyword arguments when creating Pub/Sub subscriptions mwvgroup/pittgoogle-client#116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add StetsonJ statistics to LSST science pipeline#271

Add StetsonJ statistics to LSST science pipeline#271
hernandezc1 merged 23 commits intodevelopfrom
u/ch/lsst/stetsonj

hernandezc1 commented Jun 3, 2025 •

edited

Loading

Uh oh!

hernandezc1 commented Jun 6, 2025

Uh oh!

troyraen left a comment

Uh oh!

Uh oh!

troyraen Jun 6, 2025

Uh oh!

hernandezc1 Jun 6, 2025

Uh oh!

troyraen Jun 6, 2025

Uh oh!

hernandezc1 Jun 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

troyraen left a comment

Uh oh!

troyraen Jun 13, 2025

Uh oh!

hernandezc1 Jun 14, 2025

Uh oh!

troyraen Jun 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hernandezc1 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hernandezc1 commented Jun 6, 2025

Uh oh!

troyraen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

troyraen Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

hernandezc1 Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

troyraen Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

hernandezc1 Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

troyraen left a comment

Choose a reason for hiding this comment

Uh oh!

troyraen Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

hernandezc1 Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

troyraen Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hernandezc1 commented Jun 3, 2025 •

edited

Loading