Homerun Predictor: Neural Networks with Newton Trust-Region Optimization

A custom implementation of a matrix-free Newton Trust-Region optimizer with Conjugate Gradient (Steihaug) for neural network training, applied to MLB home run prediction. This project demonstrates that second-order optimization methods can achieve 4.3× faster training than SGD while maintaining comparable accuracy.

Author: Ryan Lutz (Duke University, ME 555 - Numerical Optimization)

Overview

Neural networks typically rely on first-order optimization methods (SGD, Adam) that use only gradient information. This project implements a second-order Newton Trust-Region optimizer that leverages curvature information through Hessian-vector products, avoiding the O(N²) computational cost of computing the full Hessian matrix.

Key Innovation: Matrix-free Newton method using Hessian-vector products computed in O(N) time through automatic differentiation, enabling practical second-order optimization for neural networks.

Problem Statement

Task: Binary classification of batted balls in MLB games

Input: 10 Statcast features (launch speed, launch angle, pitch characteristics, etc.)
Output: Home run (1) or other outcome (0)
Challenge: Train a neural network using custom second-order optimization instead of standard first-order methods

Methodology

Neural Network Architecture

Structure: 4-layer feedforward network

Input layer: 10 features
Hidden layer 1: 64 neurons (ReLU activation)
Hidden layer 2: 32 neurons (ReLU activation)
Output layer: 1 neuron (Sigmoid activation)
Total parameters: 2,817 weights

Dataset

Source: MLB Statcast data via PyBaseball library
Period: June 2024 MLB season
Size: ~14,000 samples
Features: 10 engineered parameters

Feature	Description
Release Speed	Pitch velocity (mph)
Batter Stance	Right/Left handed
Pitcher Arm	Right/Left handed
Batter Advantage	Count favorability (1=ahead, 0=behind/tied)
Launch Speed	Exit velocity off bat
Launch Angle	Vertical departure angle
Effective Speed	Effective velocity at contact
Spin Rate	Ball rotation (rpm)
Pitch Number	Count in at-bat
Pitch Type	Fastball/Breaking/Offspeed/Other

Newton Trust-Region CG Optimizer

Core Algorithm: Trust-Region method with Conjugate Gradient subproblem solver

At each iteration k, solve the trust-region subproblem:

min_{p ∈ ℝⁿ} m_k(p) = f_k + g_k^T p + (1/2) p^T B_k p
subject to: ||p|| ≤ Δ_k

Where:

f_k = current loss
g_k = gradient
B_k = Hessian approximation
Δ_k = trust-region radius

Key Innovation - Hessian-Free Implementation:

Instead of computing the full Hessian B_k (O(N²) operations), we compute Hessian-vector products v^T H in O(N) time using the relationship:

v^T H = v^T ∇(∇E) = ℛ{∇E}

Where ℛ{·} denotes the directional derivative operator. This is implemented through:

Forward propagation with ℛ operator:

ℛ{a_j} = Σ w_ji ℛ{x_i}
ℛ{z_j} = h'(a_j) ℛ{a_j}
ℛ{y_k} = Σ w_kj ℛ{z_j} + Σ w_kj z_j

Backward propagation with ℛ operator:

ℛ{δ_k} = ℛ{y_k}
ℛ{δ_j} = h''(a_j)ℛ{a_j}Σw_kjδ_k + h'(a_j)Σw_kjℛ{δ_k}

Hessian-vector product elements:

ℛ{∂E/∂w_kj} = ℛ{δ_k}z_j + δ_k ℛ{z_j}
ℛ{∂E/∂w_ji} = x_i ℛ{δ_j}

CG-Steihaug Truncation: The conjugate gradient subproblem is truncated when:

Trust-region boundary is reached: ||p_k|| > Δ_k
Negative curvature detected: p^T B_k p < 0

Adaptive Trust-Region: Radius Δ_k is adjusted based on the ratio:

ρ_k = [f(x_k) - f(x_k + p_k)] / [m(0) - m_k(p_k)]

If ρ_k ≤ 0: Reject step, decrease Δ_k
If ρ_k ≈ 1: Accept step, expand Δ_k
Otherwise: Accept step, maintain Δ_k

Results

Performance Comparison

Training Time (100 epochs):

Optimizer	Time	Speedup
Newton TR-CG	4.42 seconds	4.3× faster
SGD	18.83 seconds	baseline

Accuracy Metrics:

Metric	Newton TR-CG	SGD	Difference
Accuracy	95.6%	96.2%	-0.6%
L2 Error	22.432	19.3	+16.2%
MSE	0.0362	0.2685	-86.5%

Convergence Behavior

Newton TR-CG:

Rapid initial convergence (first 10 epochs)
Noticeable fluctuations due to adaptive trust-region sizing
Final loss: ~0.07

SGD:

Smooth, monotonic decrease
Consistent convergence throughout training
Final loss: ~0.01

Key Findings

Computational Efficiency: Newton TR-CG achieves 4.3× speedup over SGD by exploiting second-order curvature information through efficient Hessian-vector products
Comparable Accuracy: Only 0.6 percentage point accuracy difference (95.6% vs 96.2%), demonstrating that second-order methods can match first-order performance
Convergence Characteristics:
- Newton TR-CG shows faster early convergence but more oscillations
- SGD exhibits smoother, more stable convergence
- Trust-region adaptation causes characteristic fluctuations in Newton method
Practical Tradeoff: For larger-scale problems, the 4.3× speedup may justify the slight accuracy decrease, especially when training time is a bottleneck

Implementation Details

File Structure

Homerun-Predictor/
├── NewtonTR_project.py          # Main Newton TR-CG implementation
├── NN_with_TR.py                # Neural network with Trust-Region optimizer
├── Num_Opt_SGD.py               # SGD baseline implementation
├── Preprocessing_baseball.py    # Statcast data preprocessing
├── baseball_processing.py       # Feature engineering
└── README.md

Dependencies

pip install torch numpy pandas
pip install pybaseball  # For MLB Statcast data

Usage

Train with Newton TR-CG:

python NewtonTR_project.py

Train with SGD baseline:

python Num_Opt_SGD.py

Preprocess data:

python Preprocessing_baseball.py

Technical Highlights

Mathematical Foundation

Trust-Region Formulation:

Quadratic model approximation of loss landscape
Adaptive region sizing based on model quality
Guaranteed convergence to local minimum

Hessian-Free Optimization:

O(N) Hessian-vector products vs. O(N²) full Hessian
Enables second-order methods for large-scale problems
Leverages PyTorch automatic differentiation

CG-Steihaug Solver:

Iteratively solves trust-region subproblem
Truncates at boundary or negative curvature
Combines benefits of Newton's method with trust-region safety

Activation Functions

ReLU (Hidden Layers):

f(x) = max(0, x)

Computationally efficient
Maintains non-linearity
Mitigates vanishing gradient

Sigmoid (Output Layer):

f(x) = 1 / (1 + e^(-x))

Maps to [0,1] range
Ideal for binary classification
Provides probabilistic interpretation

Applications

This second-order optimization approach is particularly valuable for:

When to Use Newton TR-CG

Limited training budget: When faster convergence matters more than final accuracy
Expensive forward passes: When gradient computation dominates training time
Small to medium networks: Where Hessian-vector products remain tractable
Well-conditioned problems: Where second-order information improves convergence

When to Use SGD

Very large networks: Where even O(N) Hessian-vector products become expensive
Maximum accuracy required: When that extra 0.5-1% matters
Noisy gradients: When mini-batch variance dominates
Established pipelines: When using proven, battle-tested optimizers

Extensions and Future Work

1. Quasi-Newton Methods

Implement L-BFGS with limited memory Hessian approximation
Compare computational efficiency with full Newton TR-CG
Evaluate accuracy-speed tradeoffs

2. Stochastic Newton Methods

Extend to mini-batch training with subsampled Hessians
Implement variance reduction techniques
Balance per-iteration cost with convergence speed

3. Hybrid Approaches

Warm start with SGD, finish with Newton TR-CG
Switch optimizers based on gradient norm
Adaptive optimizer selection

4. Larger Datasets

Scale to full MLB season (150,000+ samples)
Evaluate performance on deeper networks
Test computational scalability

5. Alternative Applications

Image classification (MNIST, CIFAR-10)
Regression tasks (housing prices, stock prediction)
Multi-class classification

Theoretical Background

Second-Order Optimization

First-order methods (SGD, Adam):

Use only gradient information: ∇f(x)
Update: x_{k+1} = x_k - α∇f(x_k)
Linear convergence near optimum

Second-order methods (Newton):

Use gradient and curvature: ∇f(x), ∇²f(x)
Update: x_{k+1} = x_k - [∇²f(x_k)]^(-1) ∇f(x_k)
Quadratic convergence near optimum
Traditional bottleneck: O(N³) Hessian inversion

Trust-Region Newton:

Avoids Hessian inversion through subproblem solving
Adaptive step sizing via trust-region radius
Guaranteed descent with sufficient model quality

Why Hessian-Vector Products Work

The key insight is that we never need the full Hessian matrix B_k. We only need to compute products of the form B_k v for various vectors v during CG iterations.

Using automatic differentiation:

∇(∇f · v) = [∇²f] v = Hv

This can be computed with one forward pass and one backward pass, making it only ~2× the cost of a gradient computation instead of N× (where N is the number of parameters).

Limitations

Current Implementation:

Single dataset (MLB June 2024)
Fixed network architecture
No hyperparameter tuning for Newton method
Limited comparison (only vs. vanilla SGD)

Newton TR-CG Challenges:

More complex implementation than SGD
Hyperparameters (initial Δ, expansion/contraction factors) require tuning
CG subproblem may require many iterations
Memory overhead for CG solver state

References

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer New York.
- Theoretical foundation for neural networks and optimization
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization. Springer.
- Trust-region methods and conjugate gradient algorithms
Cornell Optimization Wiki - Trust-Region Methods
- https://optimization.cbe.cornell.edu/index.php?title=Trust-region_methods
PyBaseball Library - MLB Statcast Data Access
- https://github.com/jldbc/pybaseball

License

This project is available under the MIT License.

Contact

Ryan Lutz
ryanjohnlutz@gmail.com Duke University - Mechanical Engineering and Materials Science
GitHub

This work was completed as a final project for ME 555 (Numerical Optimization) at Duke University, demonstrating the practical application of advanced optimization theory to machine learning problems.

Acknowledgments: Special thanks to the ME 555 course staff for guidance on trust-region methods and the PyBaseball community for providing accessible MLB data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Homerun Predictor: Neural Networks with Newton Trust-Region Optimization

Overview

Problem Statement

Methodology

Neural Network Architecture

Dataset

Newton Trust-Region CG Optimizer

Results

Performance Comparison

Convergence Behavior

Key Findings

Implementation Details

File Structure

Dependencies

Usage

Technical Highlights

Mathematical Foundation

Activation Functions

Applications

When to Use Newton TR-CG

When to Use SGD

Extensions and Future Work

Theoretical Background

Second-Order Optimization

Why Hessian-Vector Products Work

Limitations

References

License

Contact

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Homerun Predictor: Neural Networks with Newton Trust-Region Optimization

Overview

Problem Statement

Methodology

Neural Network Architecture

Dataset

Newton Trust-Region CG Optimizer

Results

Performance Comparison

Convergence Behavior

Key Findings

Implementation Details

File Structure

Dependencies

Usage

Technical Highlights

Mathematical Foundation

Activation Functions

Applications

When to Use Newton TR-CG

When to Use SGD

Extensions and Future Work

Theoretical Background

Second-Order Optimization

Why Hessian-Vector Products Work

Limitations

References

License

Contact