Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
221 changes: 221 additions & 0 deletions NUMPY_API_SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
# NumPy-Compatible API Layer - Implementation Summary

## Overview
This document summarizes the implementation of the NumPy-compatible API layer for the Paper framework. The goal was to provide a familiar NumPy interface while leveraging Paper's out-of-core capabilities for datasets larger than memory.

**Current Limitation:** The implementation currently supports **2D matrices only** (NumPy's ndarray supports N dimensions). This is a foundational version focused on the most common use case for matrix operations. Future enhancements may add full N-dimensional support.

## Key Components Implemented

### 1. Core API Module (`paper/numpy_api.py`)

#### ndarray Class
The main array class that provides NumPy-like interface with lazy evaluation:

**Properties:**
- `shape` - Tuple representing array dimensions
- `dtype` - NumPy data type of elements
- `ndim` - Number of dimensions (always 2 for matrices)
- `size` - Total number of elements
- `T` - Transpose property

**Public Methods:**
- `to_numpy()` - Convert to NumPy array (loads into memory)
- `compute()` - Execute lazy computation plan
- `__add__()` - Element-wise addition operator
- `__mul__()` - Scalar multiplication operator
- `__matmul__()` - Matrix multiplication operator

**Internal Methods:**
- `_materialize()` - Internal method for materialization
- `_from_plan()` - Class method for creating lazy arrays

#### Array Creation Functions
- `array(data, dtype)` - Create from data (list, tuple, numpy array)
- `zeros(shape, dtype)` - Create zeros array
- `ones(shape, dtype)` - Create ones array
- `eye(n, dtype)` - Create identity matrix
- `random_rand(shape, dtype)` - Create random array

#### I/O Functions
- `load(filepath, shape, dtype)` - Load from binary file
- `save(filepath, array)` - Save to binary file

#### Helper Functions
- `dot(a, b)` - Matrix multiplication (alias for @)
- `add(a, b)` - Addition (alias for +)
- `multiply(a, b)` - Multiplication (alias for *)

### 2. Bug Fix (`paper/plan.py`)

**Issue:** During implementation, we discovered a pre-existing bug in the scalar multiplication logic where matrices from EagerNodes were being closed prematurely, causing segmentation faults on reuse. This bug existed in the original codebase but became apparent when implementing reusable lazy operations in the NumPy API.

**Fix:** Added check to only close intermediate computed results:
```python
# Only close TMP if it's not from an EagerNode
if not isinstance(self.left, EagerNode):
TMP.close()
```

### 3. Test Suite (`tests/test_numpy_api.py`)

Comprehensive testing with **26 tests** covering:

- **Array Creation (9 tests)**
- Creation from lists, numpy arrays
- Different dtypes support
- Special arrays (zeros, ones, eye, random)
- Array properties
- Public API (to_numpy method)

- **Operations (6 tests)**
- Addition
- Scalar multiplication (both left and right)
- Matrix multiplication
- Chained operations
- Transpose

- **Error Handling (3 tests)**
- Shape mismatch in addition
- Dimension mismatch in matmul
- Invalid array creation

- **File I/O (3 tests)**
- Loading arrays
- Saving arrays
- Saving lazy arrays

- **Helper Functions (3 tests)**
- dot() function
- add() function
- multiply() function

- **Large Arrays (2 tests)**
- Large array creation (1000x1000)
- Large matrix multiplication (500x600 @ 600x400)

### 4. Examples

#### Simple Demo (`examples/simple_demo.py`)
Quick-start example demonstrating basic usage:
- Array creation
- Operation chaining
- Lazy evaluation
- Result computation

#### Comprehensive Examples (`examples/numpy_api_example.py`)
9 detailed examples covering:
1. Basic array creation and operations
2. Scalar multiplication
3. Matrix multiplication
4. Chained operations with lazy evaluation
5. Array creation functions
6. Transpose operation
7. File I/O operations
8. Large arrays (out-of-core)
9. NumPy compatibility comparison

### 5. Documentation (`README.md`)

Updated README with:
- NumPy API overview
- Quick start guide
- Key features list
- Supported operations
- Examples reference

## Key Features

✅ **Familiar Interface** - NumPy-like syntax for easy migration
✅ **Lazy Evaluation** - Build computation plans without immediate execution
✅ **Automatic Optimization** - Operator fusion applied automatically
✅ **Out-of-Core Support** - Handle datasets larger than memory
✅ **Public API** - Clean separation of public and internal methods
✅ **Comprehensive Testing** - 62 total tests (26 new + 36 existing)

## Usage Examples

### Basic Usage
```python
from paper import numpy_api as pnp
import numpy as np

# Create arrays
a = pnp.array([[1, 2], [3, 4]], dtype=np.float32)
b = pnp.array([[5, 6], [7, 8]], dtype=np.float32)

# Build computation plan (lazy)
c = (a + b) * 2

# Execute and get result
result = c.compute()
numpy_result = result.to_numpy() # Convert to NumPy array
```

### Matrix Multiplication
```python
# Create matrices
A = pnp.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32) # 2x3
B = pnp.array([[7, 8], [9, 10], [11, 12]], dtype=np.float32) # 3x2

# Matrix multiplication
C = A @ B # Lazy - 2x2 result

# Compute
result = C.compute()
print(result.to_numpy())
```

### Large Arrays
```python
# Create large random arrays (out-of-core)
a = pnp.random_rand((10000, 10000))
b = pnp.random_rand((10000, 10000))

# Operations don't load entire arrays into memory
c = (a + b) * 0.5

# Computation uses disk-backed matrices efficiently
result = c.compute()
```

## Test Results

All tests passing:
```
----------------------------------------------------------------------
Ran 62 tests in 0.644s

OK

==================================================
Tests run: 62
Failures: 0
Errors: 0
Skipped: 0
All tests passed! ✓
```

## Benefits for Users

1. **Easy Migration**: Minimal code changes required when migrating from NumPy
2. **Familiar Syntax**: Use the same operations and methods as NumPy
3. **Scalability**: Handle datasets that don't fit in memory
4. **Performance**: Automatic optimizations like operator fusion
5. **Lazy Evaluation**: Build complex computation plans efficiently
6. **Clean API**: Well-documented public methods with internal implementation hidden

## Future Enhancements

Potential areas for future development:
- More NumPy operations (subtraction, division, power, etc.)
- Broadcasting support
- Slicing operations
- Reduction operations (sum, mean, max, min)
- Element-wise functions (sin, cos, exp, log)
- Lazy transpose implementation
- Multi-dimensional array support (currently 2D only)

## Conclusion

The NumPy-compatible API layer successfully provides a familiar interface for Paper framework users while maintaining all the benefits of out-of-core matrix operations. The implementation is well-tested, documented, and ready for production use.
64 changes: 64 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,70 @@ Paper a lightweight Python framework for performing matrix computations on datas

The architecture is inspired by modern data systems and academic research (e.g., PreVision), with a clear separation between the logical plan, the physical execution backend, and an intelligent optimizer.

## NumPy-Compatible API

Paper now includes a **NumPy-compatible API layer** that provides a familiar interface for users migrating from NumPy or other array libraries. This makes it easy to leverage Paper's out-of-core capabilities with minimal code changes.

### Quick Start

```python
# Import Paper's NumPy-compatible API
from paper import numpy_api as pnp
import numpy as np

# Create arrays (similar to NumPy)
a = pnp.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32)
b = pnp.array([[7, 8, 9], [10, 11, 12]], dtype=np.float32)

# Perform operations with lazy evaluation
c = (a + b) * 2

# Execute the computation plan
result = c.compute()
print(result.to_numpy())
```

### Key Features

- **Familiar NumPy Interface**: Use the same syntax as NumPy for array creation and operations
- **Lazy Evaluation**: Build computation plans without executing until `.compute()` is called
- **Automatic Optimization**: Operator fusion and intelligent caching applied automatically
- **Out-of-Core Support**: Handle datasets larger than memory seamlessly
- **Matrix Operations**: Support for addition, scalar multiplication, and matrix multiplication (@)

### Supported Operations

**Array Creation:**
- `pnp.array(data)` - Create array from data
- `pnp.zeros(shape)` - Create zeros array
- `pnp.ones(shape)` - Create ones array
- `pnp.eye(n)` - Create identity matrix
- `pnp.random_rand(shape)` - Create random array

**Operations:**
- `a + b` - Element-wise addition
- `a * scalar` - Scalar multiplication
- `a @ b` - Matrix multiplication
- `a.T` - Transpose

**I/O:**
- `pnp.load(filepath, shape)` - Load array from file
- `pnp.save(filepath, array)` - Save array to file

### Examples

See `examples/numpy_api_example.py` for comprehensive examples demonstrating:
- Basic array operations
- Chained operations with lazy evaluation
- Matrix multiplication
- File I/O
- Large array handling (out-of-core)

Run the examples:
```bash
python examples/numpy_api_example.py
```

### Architecture


Expand Down
Loading
Loading