Skip to content

Improve Abstraction Strategy for Cross-Country Code Sharing #24

@baogorek

Description

@baogorek

Improve Abstraction Strategy for Cross-Country Code Sharing

Summary

The policyengine-data package aims to share calibration functionality between country-specific implementations (US, UK, etc.). However, the current approach creates tight coupling with US-specific implementations while attempting to appear generic. This issue proposes refactoring to achieve true abstraction that will scale gracefully as more countries are added.

Current Situation

The metrics_matrix_creation.py module is intended to be country-agnostic but contains several abstraction issues:

  1. Database schema assumptions - SQL queries assume specific table structures and column names
  2. Entity hierarchy assumptions - Code assumes household is a universal top-level entity
  3. Variable validation - The database models use a USVariable enum that validates against US-specific variables
  4. Cross-repository dependencies - Generic code imports from and depends on US-specific implementations

Types of Abstraction Issues

Leaky Abstraction

Implementation details "leak through" the abstraction boundary. For example:

  • The "generic" function assumes all countries have a household entity
  • SQL queries hard-code specific column names that may not exist in other countries
  • The assumption that reform_id = 0 means "baseline" across all countries

False Abstraction

Code that appears abstract but actually only works with one concrete implementation:

  • A "generic" package that imports US-specific database models
  • Functions that take a microsimulation_class parameter but still assume US-specific structure

Premature Abstraction

Extracting shared code before understanding what's truly common:

  • Moving code to a shared package before implementing multiple countries
  • Guessing at what will be common rather than discovering it empirically

Proposed Solutions

Option 1: Strategy Pattern with Country Adapters (Recommended)

Create a clear protocol that each country implements:

# In policyengine-data
from typing import Protocol

class CountryAdapter(Protocol):
    def fetch_targets(self, engine, period: int, **filters) -> pd.DataFrame:
        """Fetch targets in a standardized format"""
        ...
    
    def get_entity_hierarchy(self) -> List[str]:
        """Return country's entity hierarchy"""
        ...
    
    def apply_constraints(self, sim, constraints, target_entity: str) -> np.ndarray:
        """Apply country-specific constraint logic"""
        ...

Each country provides its own adapter, and the generic code works only with the abstract interface.

Option 2: Share Only True Commonalities

Identify what's actually generic (likely just the mathematical optimization) and share only that:

# In policyengine-data - pure math, no country assumptions
def calibrate_weights(
    metrics_matrix: np.ndarray,
    target_values: np.ndarray,
    initial_weights: np.ndarray
) -> np.ndarray:
    """Pure mathematical optimization"""
    ...

Let each country handle its own data preparation and matrix construction.

Option 3: Delayed Abstraction

Consider moving this code back to policyengine-us-data for now, and extract truly generic parts only after implementing UK calibration. This would:

  • Eliminate cross-repo dependencies
  • Make the code easier to understand and test
  • Allow natural abstraction patterns to emerge

Benefits of Refactoring

  1. Cleaner separation - Country-specific code stays in country repos
  2. Easier testing - Can test generic code with mock implementations
  3. Better scalability - Adding new countries won't require modifying "generic" code
  4. Clearer ownership - Each country team owns their full implementation

Next Steps

  1. Audit current code to identify truly generic components (likely just mathematical operations)
  2. Design adapter interface based on actual country differences
  3. Refactor in stages:
    • First, move country-specific code back to country repos
    • Then, extract truly generic mathematical functions
    • Finally, implement adapter pattern if needed

How We Can Help

I'm happy to help with:

  • Analyzing the codebase to identify true commonalities
  • Designing the adapter interfaces
  • Creating a migration plan that doesn't break existing functionality
  • Implementing the refactoring in manageable chunks

This refactoring will make the codebase more maintainable and prepare it for international expansion. The key insight is that good abstraction comes from understanding multiple concrete implementations, not from trying to predict future needs.

Discussion Questions

  1. What functionality is already implemented for UK calibration?
  2. Are there specific requirements or constraints we should consider?
  3. Would the team prefer gradual refactoring or a clean-slate redesign?

Let's make this codebase as clean and maintainable as the important work it supports deserves!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions