Skip to content

Latest commit

 

History

History
349 lines (253 loc) · 11.1 KB

File metadata and controls

349 lines (253 loc) · 11.1 KB

Homerun Predictor: Neural Networks with Newton Trust-Region Optimization

A custom implementation of a matrix-free Newton Trust-Region optimizer with Conjugate Gradient (Steihaug) for neural network training, applied to MLB home run prediction. This project demonstrates that second-order optimization methods can achieve 4.3× faster training than SGD while maintaining comparable accuracy.

Author: Ryan Lutz (Duke University, ME 555 - Numerical Optimization)

Overview

Neural networks typically rely on first-order optimization methods (SGD, Adam) that use only gradient information. This project implements a second-order Newton Trust-Region optimizer that leverages curvature information through Hessian-vector products, avoiding the O(N²) computational cost of computing the full Hessian matrix.

Key Innovation: Matrix-free Newton method using Hessian-vector products computed in O(N) time through automatic differentiation, enabling practical second-order optimization for neural networks.

Problem Statement

Task: Binary classification of batted balls in MLB games

  • Input: 10 Statcast features (launch speed, launch angle, pitch characteristics, etc.)
  • Output: Home run (1) or other outcome (0)
  • Challenge: Train a neural network using custom second-order optimization instead of standard first-order methods

Methodology

Neural Network Architecture

Structure: 4-layer feedforward network

  • Input layer: 10 features
  • Hidden layer 1: 64 neurons (ReLU activation)
  • Hidden layer 2: 32 neurons (ReLU activation)
  • Output layer: 1 neuron (Sigmoid activation)
  • Total parameters: 2,817 weights

Dataset

Source: MLB Statcast data via PyBaseball library
Period: June 2024 MLB season
Size: ~14,000 samples
Features: 10 engineered parameters

Feature Description
Release Speed Pitch velocity (mph)
Batter Stance Right/Left handed
Pitcher Arm Right/Left handed
Batter Advantage Count favorability (1=ahead, 0=behind/tied)
Launch Speed Exit velocity off bat
Launch Angle Vertical departure angle
Effective Speed Effective velocity at contact
Spin Rate Ball rotation (rpm)
Pitch Number Count in at-bat
Pitch Type Fastball/Breaking/Offspeed/Other

Newton Trust-Region CG Optimizer

Core Algorithm: Trust-Region method with Conjugate Gradient subproblem solver

At each iteration k, solve the trust-region subproblem:

min_{p ∈ ℝⁿ} m_k(p) = f_k + g_k^T p + (1/2) p^T B_k p
subject to: ||p|| ≤ Δ_k

Where:

  • f_k = current loss
  • g_k = gradient
  • B_k = Hessian approximation
  • Δ_k = trust-region radius

Key Innovation - Hessian-Free Implementation:

Instead of computing the full Hessian B_k (O(N²) operations), we compute Hessian-vector products v^T H in O(N) time using the relationship:

v^T H = v^T ∇(∇E) = ℛ{∇E}

Where ℛ{·} denotes the directional derivative operator. This is implemented through:

  1. Forward propagation with ℛ operator:

    ℛ{a_j} = Σ w_ji ℛ{x_i}
    ℛ{z_j} = h'(a_j) ℛ{a_j}
    ℛ{y_k} = Σ w_kj ℛ{z_j} + Σ w_kj z_j
    
  2. Backward propagation with ℛ operator:

    ℛ{δ_k} = ℛ{y_k}
    ℛ{δ_j} = h''(a_j)ℛ{a_j}Σw_kjδ_k + h'(a_j)Σw_kjℛ{δ_k}
    
  3. Hessian-vector product elements:

    ℛ{∂E/∂w_kj} = ℛ{δ_k}z_j + δ_k ℛ{z_j}
    ℛ{∂E/∂w_ji} = x_i ℛ{δ_j}
    

CG-Steihaug Truncation: The conjugate gradient subproblem is truncated when:

  • Trust-region boundary is reached: ||p_k|| > Δ_k
  • Negative curvature detected: p^T B_k p < 0

Adaptive Trust-Region: Radius Δ_k is adjusted based on the ratio:

ρ_k = [f(x_k) - f(x_k + p_k)] / [m(0) - m_k(p_k)]
  • If ρ_k ≤ 0: Reject step, decrease Δ_k
  • If ρ_k ≈ 1: Accept step, expand Δ_k
  • Otherwise: Accept step, maintain Δ_k

Results

Performance Comparison

Training Time (100 epochs):

Optimizer Time Speedup
Newton TR-CG 4.42 seconds 4.3× faster
SGD 18.83 seconds baseline

Accuracy Metrics:

Metric Newton TR-CG SGD Difference
Accuracy 95.6% 96.2% -0.6%
L2 Error 22.432 19.3 +16.2%
MSE 0.0362 0.2685 -86.5%

Convergence Behavior

Newton TR-CG:

  • Rapid initial convergence (first 10 epochs)
  • Noticeable fluctuations due to adaptive trust-region sizing
  • Final loss: ~0.07

SGD:

  • Smooth, monotonic decrease
  • Consistent convergence throughout training
  • Final loss: ~0.01

Key Findings

  1. Computational Efficiency: Newton TR-CG achieves 4.3× speedup over SGD by exploiting second-order curvature information through efficient Hessian-vector products

  2. Comparable Accuracy: Only 0.6 percentage point accuracy difference (95.6% vs 96.2%), demonstrating that second-order methods can match first-order performance

  3. Convergence Characteristics:

    • Newton TR-CG shows faster early convergence but more oscillations
    • SGD exhibits smoother, more stable convergence
    • Trust-region adaptation causes characteristic fluctuations in Newton method
  4. Practical Tradeoff: For larger-scale problems, the 4.3× speedup may justify the slight accuracy decrease, especially when training time is a bottleneck

Implementation Details

File Structure

Homerun-Predictor/
├── NewtonTR_project.py          # Main Newton TR-CG implementation
├── NN_with_TR.py                # Neural network with Trust-Region optimizer
├── Num_Opt_SGD.py               # SGD baseline implementation
├── Preprocessing_baseball.py    # Statcast data preprocessing
├── baseball_processing.py       # Feature engineering
└── README.md

Dependencies

pip install torch numpy pandas
pip install pybaseball  # For MLB Statcast data

Usage

Train with Newton TR-CG:

python NewtonTR_project.py

Train with SGD baseline:

python Num_Opt_SGD.py

Preprocess data:

python Preprocessing_baseball.py

Technical Highlights

Mathematical Foundation

Trust-Region Formulation:

  • Quadratic model approximation of loss landscape
  • Adaptive region sizing based on model quality
  • Guaranteed convergence to local minimum

Hessian-Free Optimization:

  • O(N) Hessian-vector products vs. O(N²) full Hessian
  • Enables second-order methods for large-scale problems
  • Leverages PyTorch automatic differentiation

CG-Steihaug Solver:

  • Iteratively solves trust-region subproblem
  • Truncates at boundary or negative curvature
  • Combines benefits of Newton's method with trust-region safety

Activation Functions

ReLU (Hidden Layers):

f(x) = max(0, x)
  • Computationally efficient
  • Maintains non-linearity
  • Mitigates vanishing gradient

Sigmoid (Output Layer):

f(x) = 1 / (1 + e^(-x))
  • Maps to [0,1] range
  • Ideal for binary classification
  • Provides probabilistic interpretation

Applications

This second-order optimization approach is particularly valuable for:

When to Use Newton TR-CG

  • Limited training budget: When faster convergence matters more than final accuracy
  • Expensive forward passes: When gradient computation dominates training time
  • Small to medium networks: Where Hessian-vector products remain tractable
  • Well-conditioned problems: Where second-order information improves convergence

When to Use SGD

  • Very large networks: Where even O(N) Hessian-vector products become expensive
  • Maximum accuracy required: When that extra 0.5-1% matters
  • Noisy gradients: When mini-batch variance dominates
  • Established pipelines: When using proven, battle-tested optimizers

Extensions and Future Work

1. Quasi-Newton Methods

  • Implement L-BFGS with limited memory Hessian approximation
  • Compare computational efficiency with full Newton TR-CG
  • Evaluate accuracy-speed tradeoffs

2. Stochastic Newton Methods

  • Extend to mini-batch training with subsampled Hessians
  • Implement variance reduction techniques
  • Balance per-iteration cost with convergence speed

3. Hybrid Approaches

  • Warm start with SGD, finish with Newton TR-CG
  • Switch optimizers based on gradient norm
  • Adaptive optimizer selection

4. Larger Datasets

  • Scale to full MLB season (150,000+ samples)
  • Evaluate performance on deeper networks
  • Test computational scalability

5. Alternative Applications

  • Image classification (MNIST, CIFAR-10)
  • Regression tasks (housing prices, stock prediction)
  • Multi-class classification

Theoretical Background

Second-Order Optimization

First-order methods (SGD, Adam):

  • Use only gradient information: ∇f(x)
  • Update: x_{k+1} = x_k - α∇f(x_k)
  • Linear convergence near optimum

Second-order methods (Newton):

  • Use gradient and curvature: ∇f(x), ∇²f(x)
  • Update: x_{k+1} = x_k - [∇²f(x_k)]^(-1) ∇f(x_k)
  • Quadratic convergence near optimum
  • Traditional bottleneck: O(N³) Hessian inversion

Trust-Region Newton:

  • Avoids Hessian inversion through subproblem solving
  • Adaptive step sizing via trust-region radius
  • Guaranteed descent with sufficient model quality

Why Hessian-Vector Products Work

The key insight is that we never need the full Hessian matrix B_k. We only need to compute products of the form B_k v for various vectors v during CG iterations.

Using automatic differentiation:

∇(∇f · v) = [∇²f] v = Hv

This can be computed with one forward pass and one backward pass, making it only ~2× the cost of a gradient computation instead of N× (where N is the number of parameters).

Limitations

Current Implementation:

  • Single dataset (MLB June 2024)
  • Fixed network architecture
  • No hyperparameter tuning for Newton method
  • Limited comparison (only vs. vanilla SGD)

Newton TR-CG Challenges:

  • More complex implementation than SGD
  • Hyperparameters (initial Δ, expansion/contraction factors) require tuning
  • CG subproblem may require many iterations
  • Memory overhead for CG solver state

References

  1. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer New York.

    • Theoretical foundation for neural networks and optimization
  2. Nocedal, J., & Wright, S. J. (2006). Numerical Optimization. Springer.

    • Trust-region methods and conjugate gradient algorithms
  3. Cornell Optimization Wiki - Trust-Region Methods

  4. PyBaseball Library - MLB Statcast Data Access

License

This project is available under the MIT License.

Contact

Ryan Lutz
ryanjohnlutz@gmail.com Duke University - Mechanical Engineering and Materials Science
GitHub

This work was completed as a final project for ME 555 (Numerical Optimization) at Duke University, demonstrating the practical application of advanced optimization theory to machine learning problems.


Acknowledgments: Special thanks to the ME 555 course staff for guidance on trust-region methods and the PyBaseball community for providing accessible MLB data.