Improve Abstraction Strategy for Cross-Country Code Sharing

# Improve Abstraction Strategy for Cross-Country Code Sharing

## Summary

The `policyengine-data` package aims to share calibration functionality between country-specific implementations (US, UK, etc.). However, the current approach creates tight coupling with US-specific implementations while attempting to appear generic. This issue proposes refactoring to achieve true abstraction that will scale gracefully as more countries are added.

## Current Situation

The `metrics_matrix_creation.py` module is intended to be country-agnostic but contains several abstraction issues:

1. **Database schema assumptions** - SQL queries assume specific table structures and column names
2. **Entity hierarchy assumptions** - Code assumes `household` is a universal top-level entity
3. **Variable validation** - The database models use a `USVariable` enum that validates against US-specific variables
4. **Cross-repository dependencies** - Generic code imports from and depends on US-specific implementations

## Types of Abstraction Issues

### Leaky Abstraction
Implementation details "leak through" the abstraction boundary. For example:
- The "generic" function assumes all countries have a `household` entity
- SQL queries hard-code specific column names that may not exist in other countries
- The assumption that `reform_id = 0` means "baseline" across all countries

### False Abstraction
Code that appears abstract but actually only works with one concrete implementation:
- A "generic" package that imports US-specific database models
- Functions that take a `microsimulation_class` parameter but still assume US-specific structure

### Premature Abstraction
Extracting shared code before understanding what's truly common:
- Moving code to a shared package before implementing multiple countries
- Guessing at what will be common rather than discovering it empirically

## Proposed Solutions

### Option 1: Strategy Pattern with Country Adapters (Recommended)

Create a clear protocol that each country implements:

```python
# In policyengine-data
from typing import Protocol

class CountryAdapter(Protocol):
    def fetch_targets(self, engine, period: int, **filters) -> pd.DataFrame:
        """Fetch targets in a standardized format"""
        ...
    
    def get_entity_hierarchy(self) -> List[str]:
        """Return country's entity hierarchy"""
        ...
    
    def apply_constraints(self, sim, constraints, target_entity: str) -> np.ndarray:
        """Apply country-specific constraint logic"""
        ...
```

Each country provides its own adapter, and the generic code works only with the abstract interface.

### Option 2: Share Only True Commonalities

Identify what's actually generic (likely just the mathematical optimization) and share only that:

```python
# In policyengine-data - pure math, no country assumptions
def calibrate_weights(
    metrics_matrix: np.ndarray,
    target_values: np.ndarray,
    initial_weights: np.ndarray
) -> np.ndarray:
    """Pure mathematical optimization"""
    ...
```

Let each country handle its own data preparation and matrix construction.

### Option 3: Delayed Abstraction

Consider moving this code back to `policyengine-us-data` for now, and extract truly generic parts only after implementing UK calibration. This would:
- Eliminate cross-repo dependencies
- Make the code easier to understand and test
- Allow natural abstraction patterns to emerge

## Benefits of Refactoring

1. **Cleaner separation** - Country-specific code stays in country repos
2. **Easier testing** - Can test generic code with mock implementations
3. **Better scalability** - Adding new countries won't require modifying "generic" code
4. **Clearer ownership** - Each country team owns their full implementation

## Next Steps

1. Audit current code to identify truly generic components (likely just mathematical operations)
2. Design adapter interface based on actual country differences
3. Refactor in stages:
   - First, move country-specific code back to country repos
   - Then, extract truly generic mathematical functions
   - Finally, implement adapter pattern if needed

## How We Can Help

I'm happy to help with:
- Analyzing the codebase to identify true commonalities
- Designing the adapter interfaces
- Creating a migration plan that doesn't break existing functionality
- Implementing the refactoring in manageable chunks

This refactoring will make the codebase more maintainable and prepare it for international expansion. The key insight is that **good abstraction comes from understanding multiple concrete implementations**, not from trying to predict future needs.

## Discussion Questions

1. What functionality is already implemented for UK calibration?
2. Are there specific requirements or constraints we should consider?
3. Would the team prefer gradual refactoring or a clean-slate redesign?

Let's make this codebase as clean and maintainable as the important work it supports deserves!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve Abstraction Strategy for Cross-Country Code Sharing #24

Improve Abstraction Strategy for Cross-Country Code Sharing

Summary

Current Situation

Types of Abstraction Issues

Leaky Abstraction

False Abstraction

Premature Abstraction

Proposed Solutions

Option 1: Strategy Pattern with Country Adapters (Recommended)

Option 2: Share Only True Commonalities

Option 3: Delayed Abstraction

Benefits of Refactoring

Next Steps

How We Can Help

Discussion Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve Abstraction Strategy for Cross-Country Code Sharing #24

Description

Improve Abstraction Strategy for Cross-Country Code Sharing

Summary

Current Situation

Types of Abstraction Issues

Leaky Abstraction

False Abstraction

Premature Abstraction

Proposed Solutions

Option 1: Strategy Pattern with Country Adapters (Recommended)

Option 2: Share Only True Commonalities

Option 3: Delayed Abstraction

Benefits of Refactoring

Next Steps

How We Can Help

Discussion Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions