Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPENBLAS error in cuda_4.3.3.sif #820

Open
xc308 opened this issue May 17, 2024 · 24 comments
Open

OPENBLAS error in cuda_4.3.3.sif #820

xc308 opened this issue May 17, 2024 · 24 comments
Labels
help wanted Extra attention is needed needs more info Further information is requested question

Comments

@xc308
Copy link

xc308 commented May 17, 2024

Container image name

rocker/cuda:4.3.3

Container image digest

No response

What operating system are you seeing the problem on?

Linux

System information

  1. Linux bask-pg-login01.cluster.baskerville.ac.uk 4.18.0-513.11.1.el8_9.x86_64 GPU versioning #1 SMP Thu Dec 7 03:06:13 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

  2. [fwzp1184@bask-pg-login01 XC_Work]$ lscpu
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 144
    On-line CPU(s) list: 0-143
    Thread(s) per core: 2
    Core(s) per socket: 36
    Socket(s): 2
    NUMA node(s): 2
    Vendor ID: GenuineIntel
    CPU family: 6
    Model: 106
    Model name: Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz
    Stepping: 6
    CPU MHz: 2400.000
    BogoMIPS: 4800.00
    Virtualization: VT-x
    L1d cache: 48K
    L1i cache: 32K
    L2 cache: 1280K
    L3 cache: 55296K
    NUMA node0 CPU(s): 0-35,72-107
    NUMA node1 CPU(s): 36-71,108-143

  3. [fwzp1184@bask-pg-login01 XC_Work]$ cat /proc/meminfo
    MemTotal: 527954288 kB
    MemFree: 502250764 kB
    MemAvailable: 499490632 kB
    Buffers: 5284 kB
    Cached: 5043180 kB
    SwapCached: 21844 kB
    Active: 4296360 kB
    Inactive: 9620012 kB
    Active(anon): 3348860 kB
    Inactive(anon): 9000292 kB
    Active(file): 947500 kB
    Inactive(file): 619720 kB
    Unevictable: 4207544 kB
    Mlocked: 4207544 kB
    SwapTotal: 33554428 kB
    SwapFree: 32450556 kB
    Dirty: 188 kB
    Writeback: 0 kB
    AnonPages: 13041088 kB
    Mapped: 3214484 kB
    Shmem: 3476772 kB
    KReclaimable: 1094028 kB
    Slab: 2450656 kB
    SReclaimable: 1094028 kB
    SUnreclaim: 1356628 kB
    KernelStack: 62560 kB
    PageTables: 193820 kB
    NFS_Unstable: 0 kB
    Bounce: 0 kB
    WritebackTmp: 0 kB
    CommitLimit: 297531572 kB
    Committed_AS: 14419992 kB
    VmallocTotal: 13743895347199 kB
    VmallocUsed: 3079888 kB
    VmallocChunk: 0 kB
    Percpu: 372672 kB
    HardwareCorrupted: 0 kB
    AnonHugePages: 7432192 kB
    ShmemHugePages: 0 kB
    ShmemPmdMapped: 0 kB
    FileHugePages: 0 kB
    FilePmdMapped: 0 kB
    HugePages_Total: 0
    HugePages_Free: 0
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    Hugetlb: 0 kB
    DirectMap4k: 4634816 kB
    DirectMap2M: 144955392 kB
    DirectMap1G: 389021696 kB

Bug description

I recently encountered a strange error when I submitted my job to HPC, saying,
"OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
With a larger NUM_THREADS value or, set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more CPU cores than what OpenBLAS was configured to handle.
"

I have never encountered such an error when I use simulation data of size 200*5, and the precision matrices are 200*5 by 200*5. But I get this error when I use actual data of size around 3800*5 by 3800*5.

My code offloads giant matrices multiplications to 1 GPU node and will only return the neg-log likelihood scalar, whose calculation processes are all on GPU, back to CPU for the following optimization.

After encountering such an error, I followed the instructions of the error and set the environment variable at the beginning of my R scripts. I have tried to set
Sys.setenv(OPENBLAS_NUM_THREADS = "126")
Sys.setenv(OPENBLAS_NUM_THREADS = "1")
but they all gave me exactly the same error as those mentioned above.
When I tried Sys.getenv("OPENBLAS_NUM_THREADS"), I got an empty result, [1] "".

So, I'm wondering whether the OPENBLAS library enclosed in the cuda/4.3.3.sif will ever honour the environment variable OPENBLAS_NUM_THREADS. It gave me a feeling that OPENBLAS won't change its threads no matter how small I set the environment variable.

In the terminal, I typed echo $OPENBLAS_NUM_THREADS and got 120.

In my slurm job description, I set job allocation parameters as below:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-gpu=36
And the Rscript run command is:
apptainer exec --nv ../cuda_4.3.3.sif Rscript 064a_Optm_GPU_Lon_Strip_1.R

How to reproduce this bug?

To reproduce the error, 
the code submitted to HPC can be found here: https://github.com/xc308/XC_Work/blob/main/064a_Optm_GPU_Lon_Strip_1.R
The data used in the code is df_Lon_Strip_1_Sort.rds, and is in the repository.

The code with simulated data that has run successfully without such an error can be found here: 
https://github.com/xc308/XC_Work/blob/main/060_2D_Inf_neg_logL_CAR_GPU.R

The simulation data used is df_2D_TW_CAMS.rds, and is in the repository as well.
@xc308 xc308 added the bug Something isn't working label May 17, 2024
@xc308
Copy link
Author

xc308 commented May 17, 2024

Full error output is below:

r: 2
OpenBLAS warning: precompiled NUM_THREADS exceeded, adding auxiliary array for thread metadata.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
a sufficiently small number. This error typically occurs when the software that relies on
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
cpu cores than what OpenBLAS was configured to handle.
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.

*** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
1: (function (self) { .Call(_torch_cpp_torch_namespace_linalg_eig_self_Tensor, self)})(self = <pointer: 0x561f81ed45c0>)
2: do.call(fun, args)
3: do_call(f, args)
4: call_c_function(fun_name = "linalg_eig", args = args, expected_types = expected_types, nd_args = nd_args, return_types = return_types, fun_type = "namespace")
5: torch_linalg_eig(A)
6: torch::linalg_eig(x@gm)
7: .local(x)
8: eigen(cov_mat, symmetric = T, only.values = T)
9: eigen(cov_mat, symmetric = T, only.values = T)
10: check_pd_gpu(SG_inv_gpu)
11: TST12_SG_SGInv_CAR_2D_GPU(p = p, data = data_str, A_mat = all_pars_lst[[1]], dsp_lon_mat = dsp_lon_mat, dsp_lat_mat = dsp_lat_mat, dlt_lon_mat = all_pars_lst[[2]], dlt_lat_mat = all_pars_lst[[3]], b = b, phi = phi, H_adj = H_adj, sig2_mat = all_pars_lst[[4]], reg_ini = 1e-09, thres_ini = 0.001)
12: fn(par, ...)
13: (function (par) fn(par, ...))(c(0.2, 0.2, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.1, 0.1, 0.1, 0.1, 0.1))
14: optim(par = all_ini_Vals, fn = neg_logL_CAR_2D_GPU, p = p, data_str = hierarchy_data_CAMS, all_pars_lst = all_pars_lst_CAR_2D_CMS, dsp_lon_mat = DSP[, , 1], dsp_lat_mat = DSP[, , 2], b = "Tri-Wave", phi = phi, H_adj = H_adj, df = df_Lon_Strp_1_Srt, method = "L-BFGS-B", lower = lower_bound, control = list(maxit = 200, factr = 0.01/.Machine$double.eps))
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job742090/slurm_script: line 14: 3946568 Segmentation fault (core dumped) apptainer exec --nv ../cuda_4.3.3.sif Rscript 064a_Optm_GPU_Lon_Strip_1.R

@benz0li
Copy link
Contributor

benz0li commented May 17, 2024

@xc308 Can you try to limit by setting OMP_NUM_THREADS?

See also https://scikit-learn.org/stable/computing/parallelism.html#lower-level-parallelism-with-openmp

@benz0li
Copy link
Contributor

benz0li commented May 17, 2024

and search for similar issues at https://github.com/OpenMathLib/OpenBLAS/issues.

@benz0li
Copy link
Contributor

benz0li commented May 17, 2024

ℹ️ https://github.com/OpenMathLib/OpenBLAS?#setting-the-number-of-threads-using-environment-variables

Most likely, PyTorch1 is using an OpenMP-enabled OpenBLAS library [which is not the system's OpenBLAS library].

Footnotes

  1. _torch_cpp_torch_namespace_linalg_eig_self_Tensor points to PyTorch

@xc308
Copy link
Author

xc308 commented May 17, 2024

@benz0li Hi, I use R not python. I run my Rscript using apptainer
apptainer exec --nv ../cuda_4.3.3.sif Rscript 064a_Optm_GPU_Lon_Strip_1.R. Do you know how to change the environment variable for the apptainer?

@benz0li
Copy link
Contributor

benz0li commented May 17, 2024

Do you know how to change the environment variable for the apptainer?

https://apptainer.org/docs/user/main/environment_and_metadata.html

@benz0li
Copy link
Contributor

benz0li commented May 17, 2024

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.

What R packages are you using?

@xc308
Copy link
Author

xc308 commented May 17, 2024

Do you know how to change the environment variable for the apptainer?

https://apptainer.org/docs/user/main/environment_and_metadata.html

Yeah, this is the what I was just reading, and I think I managed to solve the problem.
I modified my env variable for the apptainer by adding flag --env.

apptainer exec --nv --env OPENBLAS_NUM_THREADS=1 ../cuda_4.3.3.sif Rscript hello.R

Now the code has been running for almost 1 hours and no error so far.

@xc308
Copy link
Author

xc308 commented May 17, 2024

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.

What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

@xc308
Copy link
Author

xc308 commented May 17, 2024

I also tried to set the OPENBLAS_NUM_THREADS to 5, 10, but all got the same errors. Do you know why only OPENBLAS_NUM_THREADS=1 works? And what will be the impact of setting it to 1?

@benz0li
Copy link
Contributor

benz0li commented May 18, 2024

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.
What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then.

But what packages are you loading with library in your R script?

@xc308
Copy link
Author

xc308 commented May 18, 2024

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.
What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then.

But what packages are you loading with library in your R script?

I load
library(Matrix)
library(torch)
library(GPUmatrix)

@benz0li
Copy link
Contributor

benz0li commented May 18, 2024

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.
What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then.
But what packages are you loading with library in your R script?

I load
library(Matrix)
library(torch)
library(GPUmatrix)

library(torch), i.e. package torch, uses the 'libtorch' library.

See https://torch.mlverse.org/docs/reference/threads about setting/getting the number of threads in your R script.

@xc308
Copy link
Author

xc308 commented May 18, 2024

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.
What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then.
But what packages are you loading with library in your R script?

I load
library(Matrix)
library(torch)
library(GPUmatrix)

library(torch), i.e. package torch, uses the 'libtorch' library.

See https://torch.mlverse.org/docs/reference/threads about setting/getting the number of threads in your R script.

Ah, thank you very much about this useful information!

I use the torch_get_num_interop_threads()

torch_get_num_threads() and obtained the 72 for inter op threads and 36 for intra op threads.

However, I'm not entirely understand given my slurm parameter settings:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-gpu=36

When inter op has grabbed all the threads I requested (72 threads), why would the intra op still have 36 threads?

@xc308
Copy link
Author

xc308 commented May 18, 2024

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.
What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then.
But what packages are you loading with library in your R script?

I load
library(Matrix)
library(torch)
library(GPUmatrix)

library(torch), i.e. package torch, uses the 'libtorch' library.
See https://torch.mlverse.org/docs/reference/threads about setting/getting the number of threads in your R script.

Ah, thank you very much about this useful information!

I use the torch_get_num_interop_threads()

torch_get_num_threads() and obtained the 72 for inter op threads and 36 for intra op threads.

However, I'm not entirely sure that I understand given my slurm parameter settings: #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-task=1 #SBATCH --cpus-per-gpu=36

When inter op has grabbed all the threads I requested (72 threads), why would the intra op still have 36 threads?

Each CPU has 2 threads on my HPC by the way.

@xc308
Copy link
Author

xc308 commented May 18, 2024

In additon, I'm thinking if it's the problem of the 72 inter op threads.

Suppose my algorithm has 50 steps, and the first 25 steps are large matrices multiplication done only on CPU, while the rest of the 25 steps are offloaded to GPU.

I'm not sure whether the OPENBLAS error is because the inter op has grabbed all the available threads (72) I requested, and so there is no threads left for OPENBLAS to do the routine matrix multiplications parallelized over different CPUs.

If this understanding is correct, then it looks like instead of force OPENBLAS to work on a single CPU by setting env variable OPENBLAS_NUM_THREADS=1, which will be very slow for the first 25 steps of large matrices multiplications done on CPU, I could less the inter op threads, as there are not many tasks to be parallelized (ntasks-per-node=1), and spare more CPUs for OPENBLAS.

Please kindly advise. Thank you very much in advance!

@xc308
Copy link
Author

xc308 commented May 18, 2024

In additon, I'm thinking if it's the problem of the 72 inter op threads.

Suppose my algorithm has 50 steps, and the first 25 steps are large matrices multiplication done only on CPU, while the rest of the 25 steps are offloaded to GPU.

I'm not sure whether the OPENBLAS error is because the inter op has grabbed all the available threads (72) I requested, and so there is no threads left for OPENBLAS to do the routine matrix multiplications parallelized over different CPUs.

If this understanding is correct, then it looks like instead of force OPENBLAS to work on a single CPU by setting env variable OPENBLAS_NUM_THREADS=1, which will be very slow for the first 25 steps of large matrices multiplications done on CPU, I could less the inter op threads, as there are not many tasks to be parallelized (ntasks-per-node=1), and spare more CPUs for OPENBLAS.

Please kindly advise. Thank you very much in advance!

I decreased the number of interop threads to 2 (default is 72), and intra op threads to 18 (default is 36), and use set env variable OPENBLAS_NUM_THREADS = 2, but still got the same error.

@xc308
Copy link
Author

xc308 commented May 18, 2024

In additon, I'm thinking if it's the problem of the 72 inter op threads.
Suppose my algorithm has 50 steps, and the first 25 steps are large matrices multiplication done only on CPU, while the rest of the 25 steps are offloaded to GPU.
I'm not sure whether the OPENBLAS error is because the inter op has grabbed all the available threads (72) I requested, and so there is no threads left for OPENBLAS to do the routine matrix multiplications parallelized over different CPUs.
If this understanding is correct, then it looks like instead of force OPENBLAS to work on a single CPU by setting env variable OPENBLAS_NUM_THREADS=1, which will be very slow for the first 25 steps of large matrices multiplications done on CPU, I could less the inter op threads, as there are not many tasks to be parallelized (ntasks-per-node=1), and spare more CPUs for OPENBLAS.
Please kindly advise. Thank you very much in advance!

I decreased the number of interop threads to 2 (default is 72), and intra op threads to 18 (default is 36), and use set env variable OPENBLAS_NUM_THREADS = 2, but still got the same error.

I also set OMP_NUM_THREADS=2 on top of OPENBLAS_NUM_THREADS = 2 given interop threads = 2, intra threads = 18, but still got the same error.

@xc308
Copy link
Author

xc308 commented May 18, 2024

In additon, I'm thinking if it's the problem of the 72 inter op threads.
Suppose my algorithm has 50 steps, and the first 25 steps are large matrices multiplication done only on CPU, while the rest of the 25 steps are offloaded to GPU.
I'm not sure whether the OPENBLAS error is because the inter op has grabbed all the available threads (72) I requested, and so there is no threads left for OPENBLAS to do the routine matrix multiplications parallelized over different CPUs.
If this understanding is correct, then it looks like instead of force OPENBLAS to work on a single CPU by setting env variable OPENBLAS_NUM_THREADS=1, which will be very slow for the first 25 steps of large matrices multiplications done on CPU, I could less the inter op threads, as there are not many tasks to be parallelized (ntasks-per-node=1), and spare more CPUs for OPENBLAS.
Please kindly advise. Thank you very much in advance!

I decreased the number of interop threads to 2 (default is 72), and intra op threads to 18 (default is 36), and use set env variable OPENBLAS_NUM_THREADS = 2, but still got the same error.

I also set OMP_NUM_THREADS=2 on top of OPENBLAS_NUM_THREADS = 2 given interop threads = 2, intra threads = 18, but still got the same error.

I set interop threads = 2, intra threads =2, OMP_NUM_THREADS=2, OPENBLAS_NUM_THREADS = 2, but got the same error. FYI.

@eitsupi
Copy link
Member

eitsupi commented May 21, 2024

@cboettig Could you take a look at this?

@eitsupi eitsupi added question needs more info Further information is requested help wanted Extra attention is needed and removed bug Something isn't working labels May 21, 2024
@xc308
Copy link
Author

xc308 commented May 21, 2024

I also tried this library(RhpcBLASctl)
blas_get_num_procs() # 36
blas_set_num_threads(48)
and modify the slurm parameters to SBATCH --cpus-per-gpu=48
but still got the same error. FYI.

@cboettig
Copy link
Member

@xc308 can you try this on rocker/rstudio or similar image from the versioned stack for comparison?

I'm unclear why you are using the cuda images here. The cuda images should indeed have support for NVBLAS (you have to opt into it and not extensively tested), if you do want to leverage GPU. But unless I'm missing something it seems you are just using CPU with openblas, which should work out of the box and the standard rocker/r-ver , rocker/rstudio series.

Can you show the output of sessionInfo() as well? Also, please test if openblas is working for you on some standard linear algebra before we worry about the torch bindings.

I recommend these examples (which also indicate how to opt in for NVBLAS if you want GPU-accelerated linear algebra -- note that it is not always faster, depends on both your hardware and the overhead in copying data onto GPU...)
https://github.com/rocker-org/ml/blob/master/examples/test_blas.R

@xc308
Copy link
Author

xc308 commented May 21, 2024

"But unless I'm missing something it seems you are just using CPU with openblas, "

No, if my algorithm has 50 steps, the first 25 steps are done on CPU, but the rest of 25 steps are offloaded to GPU, so I do need the cuda image here.

"Can you show the output of sessionInfo() as well?"

Check Current BLAS Library
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] RhpcBLASctl_0.23-42 torch_0.12.0 Matrix_1.6-5

loaded via a namespace (and not attached):
[1] processx_3.8.4 bit_4.0.5 compiler_4.3.3 magrittr_2.0.3 cli_3.6.2
[6] Rcpp_1.0.12 bit64_4.0.5 coro_1.0.4 grid_4.3.3 callr_3.7.6
[11] ps_1.7.6 rlang_1.1.3 lattice_0.22-5

"please test if openblas is working for you on some standard linear algebra "

I did test on the openblas, it GPU node is required, the blas threads will automatically be 36 (the same as intra op threads). In such case, I have to set the env var OPENBLAS_NUM_THREADS to 1, any other number will throw me the same error as reported above.

"if you want GPU-accelerated linear algebra"

Since the first 25 steps of algorithm involves few loops, so it's not most ideal to offload them to GPU but instead leave them stay on CPU. That's why I'm thinking to increase the BLAS threads to try to speed up the calculation of this part.

@cboettig
Copy link
Member

@xc308 thanks. I understand you are running a complex algorithm with many steps and it is not working as expected. When trying to debug code, it is helpful to try and reproduce the problem with a minimal example rather than attempt to debug a complex algorithm with many steps and interleaved CPU & GPU dispatch. Please see the simple matrix multiplication examples in the tests I linked above, and see if they are working as expected. If they are not, we can try and debug. If they are working as expected for you on both standard and cuda images, then we will need to further isolate the issue, as it is not specifically an issue with openblas configuration. If that is the case, then please proceed to identify a minimal reproducible example that we can run to generate the behavior you are seeing. Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed needs more info Further information is requested question
Projects
None yet
Development

No branches or pull requests

4 participants