Skip to content

MicroZHY/FlashMP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FlashMP: Fast Discrete Transform-Based Solver for Preconditioning Maxwell's Equations on GPUs

arXiv Conference License

πŸ“– Overview

FlashMP is a high-performance preconditioning system designed for efficiently solving Maxwell's equations on GPUs. This project implements a novel discrete transform-based subdomain exact solver that achieves significant performance improvements for large-scale electromagnetic simulations.

πŸ† Key Achievements

  • Performance: Up to 16Γ— reduction in iteration counts on AMD MI60 GPU clusters
  • Speedup: 2.5Γ— to 4.9Γ— speedup compared to state-of-the-art libraries like Hypre
  • Scalability: 84.1% parallel efficiency on 1000 GPUs

πŸš€ Core Features

Algorithm Innovation

  • Discrete Transform Subdomain Solver: Based on SVD decomposition of forward difference operators
  • Low-Rank Boundary Correction: Using Woodbury formula for boundary condition handling
  • Tensor Product Optimization: Efficient 3D tensor transformation operations

GPU Acceleration

  • AMD GPU Support: Optimized for ROCm/HIP platform
  • Memory Management: Pre-allocated GPU memory to avoid allocation overhead
  • Sparse Matrix Optimization: Efficient sparse operations using hipsparse

Parallel Computing

  • PETSc Integration: Full distributed computing support
  • Domain Decomposition: Geometric domain decomposition preconditioner
  • MPI Optimization: Efficient inter-process communication

πŸ“ Project Structure

fdtd/
β”œβ”€β”€ app/                                    # Main applications
β”‚   β”œβ”€β”€ solveFDTD-DMDA-mg-geoAsm.cpp       # Main solver with FlashMP
β”‚   β”œβ”€β”€ solveFDTD-DMDA.cpp                 # Simplified solver
β”‚   β”œβ”€β”€ makefile                           # Build configuration
β”‚   β”œβ”€β”€ subcast.sbatch                     # Job submission script
β”‚   └── geoasm_subcast.sbatch              # Geometric ASM job script
β”œβ”€β”€ inc/                                   # Header files
β”‚   β”œβ”€β”€ fast_solve.h                       # FlashMP core algorithm
β”‚   β”œβ”€β”€ KSPSolve_GMRES_GPU.h              # GPU GMRES solver
β”‚   β”œβ”€β”€ KSPSolve_GMRES_CPU.h              # CPU GMRES solver
β”‚   └── CudaTimer.h                       # GPU timer utilities
β”œβ”€β”€ src/                                   # Source code
β”‚   β”œβ”€β”€ fast_solve.cpp                    # FlashMP implementation
β”‚   β”œβ”€β”€ KSPSolve_GMRES_GPU.cpp            # GPU GMRES implementation
β”‚   β”œβ”€β”€ KSPSolve_GMRES_CPU.c              # CPU GMRES implementation
β”‚   β”œβ”€β”€ geoasm.c                          # Geometric ASM preconditioner
β”‚   β”œβ”€β”€ pbilu_fact_impl.cpp               # Point-block ILU factorization
β”‚   β”œβ”€β”€ pre_ilu_impl.cpp                  # Pre-ILU implementation
β”‚   β”œβ”€β”€ precond_impl.cpp                  # Preconditioner implementation
β”‚   └── CudaTimer.cpp                     # GPU timer implementation
└── obj/                                   # Compiled object files

πŸ› οΈ Requirements

Hardware Requirements

  • GPU: AMD MI60 or compatible ROCm GPU
  • CPU: Multi-core processor (32+ cores recommended)
  • Memory: 128GB+ recommended
  • Network: High-speed interconnect for multi-node parallel execution

Software Dependencies

  • OS: Linux (CentOS 7.6+ recommended)
  • Compilers: GCC 7.0+, HIPCC (ROCm)
  • MPI: OpenMPI 4.0+
  • Math Libraries:
    • PETSc 3.14+
    • ROCm 4.0+
    • hipblas, hipsparse
    • rocblas

πŸ“¦ Installation

1. Environment Setup

# Install ROCm (example for ROCm 4.0)
wget https://repo.radeon.com/rocm/apt/4.0/pool/main/r/rocm-dkms/rocm-dkms_4.0.0.40100-1_all.deb
sudo dpkg -i rocm-dkms_4.0.0.40100-1_all.deb

# Install PETSc
wget https://ftp.mcs.anl.gov/pub/petsc/release-snapshots/petsc-lite-3.14.0.tar.gz
tar -xzf petsc-lite-3.14.0.tar.gz
cd petsc-3.14.0
./configure --with-hip=1 --with-hipc=hipcc --with-cc=mpicc --with-cxx=mpicxx
make all

2. Build FlashMP

# Clone the repository
git clone https://github.com/yourusername/flashmp.git
cd flashmp

# Update paths in makefile
vim app/makefile
# Update PETSC_DIR, HIP_BASE_PATH, and other paths

# Compile the project
cd app
make clean
make

# Run a test
mpirun -np 4 ./solveFDTD-DMDA-mg-geoAsm -nnz 35460 -nsize 8 -nn 10 -dt 2.0 -npx 2 -npy 2 -npz 1

3. Verify Installation

# Check GPU availability
rocm-smi

# Run simple test
mpirun -np 1 ./solveFDTD-DMDA-mg-geoAsm -nsize 16 -nn 18 -dt 16.0

🎯 Usage

Basic Usage

# Single GPU execution
mpirun -np 1 ./solveFDTD-DMDA-mg-geoAsm \
    -nsize 32 -nn 34 -dt 16.0 \
    -fD D-34-boundp.txt -fx_g x-64-subd8order.txt \
    -ksp_type gmres -pc_type geoasm

# Multi-GPU parallel execution
mpirun -np 8 ./solveFDTD-DMDA-mg-geoAsm \
    -nsize 32 -nn 34 -dt 16.0 \
    -npx 2 -npy 2 -npz 2 \
    -fD D-34-boundp.txt -fx_g x-64-subd8order.txt \
    -ksp_type gmres -pc_type geoasm -geoasm_overlap 1

Parameter Description

Parameter Description Example Value
-nsize Subdomain size 32
-nn Total grid size 34
-dt Time step size 16.0
-npx, -npy, -npz Process distribution 2, 2, 2
-geoasm_overlap ASM overlap layers 1-3
-ksp_type Solver type gmres, bcgs
-pc_type Preconditioner type geoasm

Advanced Configuration

# Using FlashMP preconditioner
mpirun -np 4 ./solveFDTD-DMDA-mg-geoAsm \
    -nsize 32 -nn 34 -dt 16.0 \
    -npx 2 -npy 2 -npz 1 \
    -ksp_type gmres -pc_type geoasm \
    -geoasm_overlap 2 \
    -ksp_rtol 1.E-12 \
    -ksp_monitor_true_residual

πŸ“Š Performance Benchmarks

Test Environment

  • Hardware: AMD MI60 GPU cluster
  • Software: ROCm 4.0, PETSc 3.14
  • Scale: 32Β³ to 1000 GPUs

Performance Results

Configuration Iterations Speedup Parallel Efficiency
NOPRE 193 1.0Γ— 63.4%
FlashMP (overlap=1) 20 3.05Γ— 77.8%
FlashMP (overlap=2) 15 4.06Γ— 81.4%
FlashMP (overlap=3) 12 4.56Γ— 84.1%

Running Benchmarks

# Weak scalability test
for np in 8 64 216 512 1000; do
    mpirun -np $np ./solveFDTD-DMDA-mg-geoAsm \
        -nsize 32 -nn 34 -dt 16.0 \
        -npx $((np/4)) -npy 2 -npz 2 \
        -ksp_type gmres -pc_type geoasm \
        -geoasm_overlap 2
done

πŸ”¬ Algorithm Principles

FlashMP Core Algorithm

FlashMP achieves efficient solving through four main steps:

  1. Component Transformation: Using SVD decomposition of forward difference operators

    D^f = U S V^T
    
  2. Point-wise Field Solving: Decoupling 3nΒ³Γ—3nΒ³ system into nΒ³ 3Γ—3 small systems

    B_ijk * [e_x, e_y, e_z]^T = [r_x, r_y, r_z]^T
    
  3. Component Inverse Transformation: Restoring original variables

  4. Boundary Error Correction: Using Woodbury formula for boundary conditions

Complexity Analysis

  • Computational Complexity: O(n⁴) vs O(n⁢) for traditional methods
  • Memory Complexity: O(n⁴) vs O(n⁢) for traditional methods
  • Actual Speedup: 128Γ— computation reduction, 322Γ— memory reduction

πŸ“š Related Papers

🀝 Contributing

We welcome contributions of all kinds!

How to Contribute

  1. Fork the project
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Reporting Issues

  • Use GitHub Issues to report bugs
  • Provide detailed reproduction steps
  • Include system information and error logs

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Thanks to the Strategic Priority Research Program of Chinese Academy of Sciences (Grant NO.XDB0500101)
  • Thanks to AMD for providing GPU hardware support
  • Thanks to the PETSc development team for technical support
  • Thanks to all contributors and users for feedback

πŸ“ž Contact

πŸ”— Related Links


⭐ If this project helps you, please give us a star!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published