Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the wrong results bug in the GPU backend. #139

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

kabicm
Copy link
Collaborator

@kabicm kabicm commented Feb 7, 2024

As @simonpintarelli reported, some of the unit tests arising from the RPA simulation were failing with the GPU backend:

 OMP_NUM_THREADS=1 CRAY_CUDA_MPS=1  srun -u -N 1 -n 8 ./miniapp/pxgemm_miniapp -m 43417 -k 2170 -n 217  --test --transpose NN -r 1

Running PDGEMM on the following problem:
=============================
      GLOBAL MAT. SIZES
=============================
A = 43417 x 2170
B = 2170 x 217
C = 43417 x 217
=============================
        SUBMATRICES
=============================
(ia, ja) = (1, 1)
(ib, jb) = (1, 1)
(ic, jc) = (1, 1)
=============================
      SUBMATRIX SIZES
=============================
m = 43417
n = 217
k = 2170
=============================
      ADDITIONAL OPTIONS
=============================
alpha = 1
beta = 0
trans_a = N
trans_b = N
=============================
         PROC GRID
=============================
grid = 1 x 8
grid order = R
=============================
         PROC SRCS
=============================
P_SRC(A) = (0, 0)
P_SRC(B) = (0, 0)
P_SRC(C) = (0, 0)
=============================
          BLOCK SIZES
=============================
Blocks(A) = (128, 128)
Blocks(B) = (128, 128)
Blocks(C) = (128, 128)
=============================
          LEADING DIMS
=============================
lld_a = 43417
lld_b = 2170
lld_c = 43417
=============================

epsilon = 1e-06, v1 = 42.5759, which is != 528.075
epsilon = 1e-06, v1 = 43.1292, which is != 528.41
COSMA TIMES [ms] = 484
SCALAPACK TIMES [ms] = 571
Result is NOT CORRECT!

The bug was only occurring when the GPU backend is used. After a careful analysis, @simonpintarelli and I realized this problem boils down to the following local multiplications, executed multiple times:

m = 5428, n = 217, k = 2170 alpha = 1, beta = 0, copy_c_back = T, tile sizes  = 5000
m = 5427, n = 217, k = 2170 alpha = 1, beta = 0, copy_c_back = T, tile sizes = 5000

This bug was occurring in the GPU backend only when the matrix dimensions were slightly larger than the GPU tile sizes, as described here.

We fixed this bug in the GPU backend in the latest PR.

After updating the Tiled-MM submodule to the latest version, we verified the problem is resolved:

OMP_NUM_THREADS=1 CRAY_CUDA_MPS=1  srun -u -N 1 -n 8 ./miniapp/pxgemm_miniapp -m 43417 -k 2170 -n 217  --test --transpose NN -r 1

Running PDGEMM on the following problem:
=============================
      GLOBAL MAT. SIZES
=============================
A = 43417 x 2170
B = 2170 x 217
C = 43417 x 217
=============================
        SUBMATRICES
=============================
(ia, ja) = (1, 1)
(ib, jb) = (1, 1)
(ic, jc) = (1, 1)
=============================
      SUBMATRIX SIZES
=============================
m = 43417
n = 217
k = 2170
=============================
      ADDITIONAL OPTIONS
=============================
alpha = 1
beta = 0
trans_a = N
trans_b = N
=============================
         PROC GRID
=============================
grid = 1 x 8
grid order = R
=============================
         PROC SRCS
=============================
P_SRC(A) = (0, 0)
P_SRC(B) = (0, 0)
P_SRC(C) = (0, 0)
=============================
          BLOCK SIZES
=============================
Blocks(A) = (128, 128)
Blocks(B) = (128, 128)
Blocks(C) = (128, 128)
=============================
          LEADING DIMS
=============================
lld_a = 43417
lld_b = 2170
lld_c = 43417
=============================

COSMA TIMES [ms] = 304
SCALAPACK TIMES [ms] = 444
Result is CORRECT!

This has been tested on the RTX3090 GPUs.

@simonpintarelli
Copy link
Member

simonpintarelli commented Feb 23, 2024

cscs-ci run P100

1 similar comment
@simonpintarelli
Copy link
Member

cscs-ci run P100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants