Skip to content

Conversation

@juaristi22
Copy link
Collaborator

@juaristi22 juaristi22 commented Jul 29, 2025

Fix #20
Fix #13
Fix #14
Fix #18
Fix #1

This matrix creation will now integrate geography hierarchical information as long as the database contains ucgid_str as its variable and "in" as the constraint operation @baogorek

@codecov
Copy link

codecov bot commented Jul 29, 2025

Codecov Report

❌ Patch coverage is 85.60606% with 19 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@37690e7). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...engine_data/calibration/metrics_matrix_creation.py 85.49% 19 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #17   +/-   ##
=======================================
  Coverage        ?   42.85%           
=======================================
  Files           ?       10           
  Lines           ?      679           
  Branches        ?        0           
=======================================
  Hits            ?      291           
  Misses          ?      388           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@juaristi22 juaristi22 requested a review from baogorek July 29, 2025 17:07
Copy link
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't call this a review yet, just some comments. The code feels very performant, which is great. I didn't fully go through all the logic. Some of my comments from the scaling PR will apply here too, so perhaps it would make sense to handle that one first and then return.

@juaristi22 juaristi22 requested a review from baogorek July 30, 2025 11:55
Copy link
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really excited to be in this repo. Great stuff here, especially the functions for handling the database constraints.

As I'm trying to understand the code, I'm looking for one end-to-end example that incorporates the state stacking and uses it. I'm in the Jupyter notebook and perhaps getting close with:

sim = Microsimulation(dataset = national_level_calibrated_dataset)
np.unique(sim.calculate('ucgid_str').values, return_counts=True)

But I still see that California has more samples, and I thought that state stacking meant we'd sample the same amount from every state.

Might want to turn down the logger in the python notebook:

import logging

# Option 1: just lower that logger
logger = logging.getLogger("microcalibrate.calibration")
logger.setLevel(logging.ERROR)

Also, I think this was triggered by running the notebook code. Any idea what happened?

baogorek@T14-Gen6:~/devl/policyengine-data$ ls                                                                                                            
 Alabama_calibration.csv                        Iowa_calibration_sparse.csv             'North Dakota_calibration_sparse.csv'                             
 Alabama_calibration_sparse.csv                 Kansas_calibration.csv                   Ohio_calibration.csv                                             
 Alaska_calibration.csv                         Kansas_calibration_sparse.csv            Ohio_calibration_sparse.csv                                      
 Alaska_calibration_sparse.csv                  Kentucky_calibration.csv                 Oklahoma_calibration.csv                                         
 Arizona_calibration.csv                        Kentucky_calibration_sparse.csv          Oklahoma_calibration_sparse.csv                                  
 Arizona_calibration_sparse.csv                 LICENSE                                  Oregon_calibration.csv                                           
 Arkansas_calibration.csv                       Louisiana_calibration.csv                Oregon_calibration_sparse.csv                                    
 Arkansas_calibration_sparse.csv                Louisiana_calibration_sparse.csv         Pennsylvania_calibration.csv                                     
 California_calibration.csv                     Maine_calibration.csv                    Pennsylvania_calibration_sparse.csv                              
 California_calibration_sparse.csv              Maine_calibration_sparse.csv             pyproject.toml
 changelog_entry.yaml                           Makefile                                 README.md
 CHANGELOG.md                                   Maryland_calibration.csv                'Rhode Island_calibration.csv'
 changelog.yaml                                 Maryland_calibration_sparse.csv         'Rhode Island_calibration_sparse.csv'
 Colorado_calibration.csv                       Massachusetts_calibration.csv           'South Carolina_calibration.csv'
 Colorado_calibration_sparse.csv                Massachusetts_calibration_sparse.csv    'South Carolina_calibration_sparse.csv'
 Connecticut_calibration.csv                    Michigan_calibration.csv                'South Dakota_calibration.csv'
 Connecticut_calibration_sparse.csv             Michigan_calibration_sparse.csv         'South Dakota_calibration_sparse.csv'
 Dataset_stacked.h5                             Minnesota_calibration.csv                src
 Dataset_state_level.h5                         Minnesota_calibration_sparse.csv         Tennessee_calibration.csv
 Delaware_calibration.csv                       Mississippi_calibration.csv              Tennessee_calibration_sparse.csv
 Delaware_calibration_sparse.csv                Mississippi_calibration_sparse.csv       tests
'District of Columbia_calibration.csv'          Missouri_calibration.csv                 Texas_calibration.csv
'District of Columbia_calibration_sparse.csv'   Missouri_calibration_sparse.csv          Texas_calibration_sparse.csv
 docs                                           Montana_calibration.csv                 'United States_calibration.csv'
 download                                       Montana_calibration_sparse.csv           Untitled.ipynb
 Florida_calibration.csv                        Nebraska_calibration.csv                 Utah_calibration.csv
 Florida_calibration_sparse.csv                 Nebraska_calibration_sparse.csv          Utah_calibration_sparse.csv
 full_calibration.csv                           Nevada_calibration.csv                   uv.lock
 full_calibration_sparse.csv                    Nevada_calibration_sparse.csv            Vermont_calibration.csv
 Georgia_calibration.csv                       'New Hampshire_calibration.csv'           Vermont_calibration_sparse.csv
 Georgia_calibration_sparse.csv                'New Hampshire_calibration_sparse.csv'    Virginia_calibration.csv
 Hawaii_calibration.csv                        'New Jersey_calibration.csv'              Virginia_calibration_sparse.csv
 Hawaii_calibration_sparse.csv                 'New Jersey_calibration_sparse.csv'       Washington_calibration.csv
 Idaho_calibration.csv                         'New Mexico_calibration.csv'              Washington_calibration_sparse.csv
 Idaho_calibration_sparse.csv                  'New Mexico_calibration_sparse.csv'      'West Virginia_calibration.csv'
 Illinois_calibration.csv                      'New York_calibration.csv'               'West Virginia_calibration_sparse.csv'
 Illinois_calibration_sparse.csv               'New York_calibration_sparse.csv'         Wisconsin_calibration.csv
 Indiana_calibration.csv                       'North Carolina_calibration.csv'          Wisconsin_calibration_sparse.csv
 Indiana_calibration_sparse.csv                'North Carolina_calibration_sparse.csv'   Wyoming_calibration.csv
 Iowa_calibration.csv                          'North Dakota_calibration.csv'            Wyoming_calibration_sparse.csv

ucgid_values = sim.calculate("ucgid").values
# The system returns enum names as strings, so compare with the name
assert all(val == california_ucgid.name for val in ucgid_values)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just feels wierd to calculate ucgid and have the value be "CA" onscreen.

In [77]: sim.calculate('ucgid')
Out[77]: 
      value       weight
0        CA  4709.080566
1        CA  4709.080566
2        CA  4709.080566

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I thought making ucgid an Enum was a great idea but maybe it simply introduces unnecessary complexity... maybe the alternative geography identifying variables should be simple strings instead of Enums?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's run this by Max tomorrow. I know he wanted us to move away from in and I imagine just a string will suffice. However, I'm currently using the Enum to build the initial hierarchy for the strata (which I thought was a nice source of truth), so I'll have to get it somewhere else if it's no longer an Enum.

@juaristi22
Copy link
Collaborator Author

juaristi22 commented Aug 18, 2025

Any idea what happened?

Yep, microcalibrate's Calibration function has a parameter called csv_path which, when is not None, saves the calibration log to it. If regularization with L0 is enabled, it will save and additional _sparse.csv file. Because we iterate through every state, with 51, will save 102 csvs. I was doing this to explore how each state was calibrating with the dashboard, but I get it can be a lot. I disabled it by default so your directory doesnt clutter.

@juaristi22 juaristi22 requested a review from baogorek August 18, 2025 19:06
Copy link
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Jupyter notebook is working really nicely now. There are still a few outstanding issues that we need to figure out, but I see no reason to block the merge. Nice work!

ucgid_values = sim.calculate("ucgid").values
# The system returns enum names as strings, so compare with the name
assert all(val == california_ucgid.name for val in ucgid_values)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's run this by Max tomorrow. I know he wanted us to move away from in and I imagine just a string will suffice. However, I'm currently using the Enum to build the initial hierarchy for the strata (which I thought was a nice source of truth), so I'll have to get it somewhere else if it's no longer an Enum.

@juaristi22 juaristi22 merged commit ea40502 into main Aug 20, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants