OPENBLAS error in cuda_4.3.3.sif #820

xc308 · 2024-05-17T15:41:55Z

Container image name

rocker/cuda:4.3.3

Container image digest

No response

What operating system are you seeing the problem on?

Linux

System information

Linux bask-pg-login01.cluster.baskerville.ac.uk 4.18.0-513.11.1.el8_9.x86_64 GPU versioning #1 SMP Thu Dec 7 03:06:13 EST 2023 x86_64 x86_64 x86_64 GNU/Linux
[fwzp1184@bask-pg-login01 XC_Work]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 144
On-line CPU(s) list: 0-143
Thread(s) per core: 2
Core(s) per socket: 36
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz
Stepping: 6
CPU MHz: 2400.000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 48K
L1i cache: 32K
L2 cache: 1280K
L3 cache: 55296K
NUMA node0 CPU(s): 0-35,72-107
NUMA node1 CPU(s): 36-71,108-143
[fwzp1184@bask-pg-login01 XC_Work]$ cat /proc/meminfo
MemTotal: 527954288 kB
MemFree: 502250764 kB
MemAvailable: 499490632 kB
Buffers: 5284 kB
Cached: 5043180 kB
SwapCached: 21844 kB
Active: 4296360 kB
Inactive: 9620012 kB
Active(anon): 3348860 kB
Inactive(anon): 9000292 kB
Active(file): 947500 kB
Inactive(file): 619720 kB
Unevictable: 4207544 kB
Mlocked: 4207544 kB
SwapTotal: 33554428 kB
SwapFree: 32450556 kB
Dirty: 188 kB
Writeback: 0 kB
AnonPages: 13041088 kB
Mapped: 3214484 kB
Shmem: 3476772 kB
KReclaimable: 1094028 kB
Slab: 2450656 kB
SReclaimable: 1094028 kB
SUnreclaim: 1356628 kB
KernelStack: 62560 kB
PageTables: 193820 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 297531572 kB
Committed_AS: 14419992 kB
VmallocTotal: 13743895347199 kB
VmallocUsed: 3079888 kB
VmallocChunk: 0 kB
Percpu: 372672 kB
HardwareCorrupted: 0 kB
AnonHugePages: 7432192 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 4634816 kB
DirectMap2M: 144955392 kB
DirectMap1G: 389021696 kB

Bug description

I recently encountered a strange error when I submitted my job to HPC, saying,
"OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
With a larger NUM_THREADS value or, set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more CPU cores than what OpenBLAS was configured to handle."

I have never encountered such an error when I use simulation data of size 200*5, and the precision matrices are 200*5 by 200*5. But I get this error when I use actual data of size around 3800*5 by 3800*5.

My code offloads giant matrices multiplications to 1 GPU node and will only return the neg-log likelihood scalar, whose calculation processes are all on GPU, back to CPU for the following optimization.

After encountering such an error, I followed the instructions of the error and set the environment variable at the beginning of my R scripts. I have tried to set
Sys.setenv(OPENBLAS_NUM_THREADS = "126")
Sys.setenv(OPENBLAS_NUM_THREADS = "1")
but they all gave me exactly the same error as those mentioned above.
When I tried Sys.getenv("OPENBLAS_NUM_THREADS"), I got an empty result, [1] "".

So, I'm wondering whether the OPENBLAS library enclosed in the cuda/4.3.3.sif will ever honour the environment variable OPENBLAS_NUM_THREADS. It gave me a feeling that OPENBLAS won't change its threads no matter how small I set the environment variable.

In the terminal, I typed echo $OPENBLAS_NUM_THREADS and got 120.

In my slurm job description, I set job allocation parameters as below:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-gpu=36
And the Rscript run command is:
apptainer exec --nv ../cuda_4.3.3.sif Rscript 064a_Optm_GPU_Lon_Strip_1.R

How to reproduce this bug?

To reproduce the error, 
the code submitted to HPC can be found here: https://github.com/xc308/XC_Work/blob/main/064a_Optm_GPU_Lon_Strip_1.R
The data used in the code is df_Lon_Strip_1_Sort.rds, and is in the repository.

The code with simulated data that has run successfully without such an error can be found here: 
https://github.com/xc308/XC_Work/blob/main/060_2D_Inf_neg_logL_CAR_GPU.R

The simulation data used is df_2D_TW_CAMS.rds, and is in the repository as well.

xc308 · 2024-05-17T16:09:22Z

Full error output is below:

r: 2
OpenBLAS warning: precompiled NUM_THREADS exceeded, adding auxiliary array for thread metadata.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
a sufficiently small number. This error typically occurs when the software that relies on
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
cpu cores than what OpenBLAS was configured to handle.
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.
This library was built to support a maximum of 128 threads - either rebuild OpenBLAS
with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to
a sufficiently small number. This error typically occurs when the software that relies on
OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more
cpu cores than what OpenBLAS was configured to handle.

*** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
1: (function (self) { .Call(_torch_cpp_torch_namespace_linalg_eig_self_Tensor, self)})(self = <pointer: 0x561f81ed45c0>)
2: do.call(fun, args)
3: do_call(f, args)
4: call_c_function(fun_name = "linalg_eig", args = args, expected_types = expected_types, nd_args = nd_args, return_types = return_types, fun_type = "namespace")
5: torch_linalg_eig(A)
6: torch::linalg_eig(x@gm)
7: .local(x)
8: eigen(cov_mat, symmetric = T, only.values = T)
9: eigen(cov_mat, symmetric = T, only.values = T)
10: check_pd_gpu(SG_inv_gpu)
11: TST12_SG_SGInv_CAR_2D_GPU(p = p, data = data_str, A_mat = all_pars_lst[[1]], dsp_lon_mat = dsp_lon_mat, dsp_lat_mat = dsp_lat_mat, dlt_lon_mat = all_pars_lst[[2]], dlt_lat_mat = all_pars_lst[[3]], b = b, phi = phi, H_adj = H_adj, sig2_mat = all_pars_lst[[4]], reg_ini = 1e-09, thres_ini = 0.001)
12: fn(par, ...)
13: (function (par) fn(par, ...))(c(0.2, 0.2, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.1, 0.1, 0.1, 0.1, 0.1))
14: optim(par = all_ini_Vals, fn = neg_logL_CAR_2D_GPU, p = p, data_str = hierarchy_data_CAMS, all_pars_lst = all_pars_lst_CAR_2D_CMS, dsp_lon_mat = DSP[, , 1], dsp_lat_mat = DSP[, , 2], b = "Tri-Wave", phi = phi, H_adj = H_adj, df = df_Lon_Strp_1_Srt, method = "L-BFGS-B", lower = lower_bound, control = list(maxit = 200, factr = 0.01/.Machine$double.eps))
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job742090/slurm_script: line 14: 3946568 Segmentation fault (core dumped) apptainer exec --nv ../cuda_4.3.3.sif Rscript 064a_Optm_GPU_Lon_Strip_1.R

benz0li · 2024-05-17T17:40:16Z

@xc308 Can you try to limit by setting OMP_NUM_THREADS?

See also https://scikit-learn.org/stable/computing/parallelism.html#lower-level-parallelism-with-openmp

benz0li · 2024-05-17T17:47:30Z

and search for similar issues at https://github.com/OpenMathLib/OpenBLAS/issues.

benz0li · 2024-05-17T17:50:11Z

ℹ️ https://github.com/OpenMathLib/OpenBLAS?#setting-the-number-of-threads-using-environment-variables

Most likely, PyTorch¹ is using an OpenMP-enabled OpenBLAS library [which is not the system's OpenBLAS library].

_torch_cpp_torch_namespace_linalg_eig_self_Tensor points to PyTorch ↩

xc308 · 2024-05-17T18:17:46Z

@benz0li Hi, I use R not python. I run my Rscript using apptainer
apptainer exec --nv ../cuda_4.3.3.sif Rscript 064a_Optm_GPU_Lon_Strip_1.R. Do you know how to change the environment variable for the apptainer?

benz0li · 2024-05-17T18:19:55Z

Do you know how to change the environment variable for the apptainer?

https://apptainer.org/docs/user/main/environment_and_metadata.html

benz0li · 2024-05-17T18:26:03Z

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.

What R packages are you using?

xc308 · 2024-05-17T19:26:23Z

Do you know how to change the environment variable for the apptainer?

https://apptainer.org/docs/user/main/environment_and_metadata.html

Yeah, this is the what I was just reading, and I think I managed to solve the problem.
I modified my env variable for the apptainer by adding flag --env.

apptainer exec --nv --env OPENBLAS_NUM_THREADS=1 ../cuda_4.3.3.sif Rscript hello.R

Now the code has been running for almost 1 hours and no error so far.

xc308 · 2024-05-17T19:27:09Z

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.

What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

xc308 · 2024-05-17T19:28:43Z

I also tried to set the OPENBLAS_NUM_THREADS to 5, 10, but all got the same errors. Do you know why only OPENBLAS_NUM_THREADS=1 works? And what will be the impact of setting it to 1?

benz0li · 2024-05-18T03:56:28Z

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.
What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then.

But what packages are you loading with library in your R script?

xc308 · 2024-05-18T11:38:09Z

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.
What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then.

But what packages are you loading with library in your R script?

I load
library(Matrix)
library(torch)
library(GPUmatrix)

benz0li · 2024-05-18T14:39:06Z

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.
What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then.
But what packages are you loading with library in your R script?

I load
library(Matrix)
library(torch)
library(GPUmatrix)

library(torch), i.e. package torch, uses the 'libtorch' library.

See https://torch.mlverse.org/docs/reference/threads about setting/getting the number of threads in your R script.

xc308 · 2024-05-18T21:43:50Z

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.
What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then.
But what packages are you loading with library in your R script?

I load
library(Matrix)
library(torch)
library(GPUmatrix)

library(torch), i.e. package torch, uses the 'libtorch' library.

See https://torch.mlverse.org/docs/reference/threads about setting/getting the number of threads in your R script.

Ah, thank you very much about this useful information!

I use the torch_get_num_interop_threads()

torch_get_num_threads() and obtained the 72 for inter op threads and 36 for intra op threads.

However, I'm not entirely understand given my slurm parameter settings:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-gpu=36

When inter op has grabbed all the threads I requested (72 threads), why would the intra op still have 36 threads?

xc308 · 2024-05-18T21:45:06Z

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.
What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then.
But what packages are you loading with library in your R script?

I load
library(Matrix)
library(torch)
library(GPUmatrix)

library(torch), i.e. package torch, uses the 'libtorch' library.
See https://torch.mlverse.org/docs/reference/threads about setting/getting the number of threads in your R script.

Ah, thank you very much about this useful information!

I use the torch_get_num_interop_threads()

torch_get_num_threads() and obtained the 72 for inter op threads and 36 for intra op threads.

However, I'm not entirely sure that I understand given my slurm parameter settings: #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-task=1 #SBATCH --cpus-per-gpu=36

When inter op has grabbed all the threads I requested (72 threads), why would the intra op still have 36 threads?

Each CPU has 2 threads on my HPC by the way.

xc308 · 2024-05-18T22:03:44Z

In additon, I'm thinking if it's the problem of the 72 inter op threads.

Suppose my algorithm has 50 steps, and the first 25 steps are large matrices multiplication done only on CPU, while the rest of the 25 steps are offloaded to GPU.

I'm not sure whether the OPENBLAS error is because the inter op has grabbed all the available threads (72) I requested, and so there is no threads left for OPENBLAS to do the routine matrix multiplications parallelized over different CPUs.

If this understanding is correct, then it looks like instead of force OPENBLAS to work on a single CPU by setting env variable OPENBLAS_NUM_THREADS=1, which will be very slow for the first 25 steps of large matrices multiplications done on CPU, I could less the inter op threads, as there are not many tasks to be parallelized (ntasks-per-node=1), and spare more CPUs for OPENBLAS.

Please kindly advise. Thank you very much in advance!

xc308 · 2024-05-18T22:45:25Z

In additon, I'm thinking if it's the problem of the 72 inter op threads.

Suppose my algorithm has 50 steps, and the first 25 steps are large matrices multiplication done only on CPU, while the rest of the 25 steps are offloaded to GPU.

I'm not sure whether the OPENBLAS error is because the inter op has grabbed all the available threads (72) I requested, and so there is no threads left for OPENBLAS to do the routine matrix multiplications parallelized over different CPUs.

If this understanding is correct, then it looks like instead of force OPENBLAS to work on a single CPU by setting env variable OPENBLAS_NUM_THREADS=1, which will be very slow for the first 25 steps of large matrices multiplications done on CPU, I could less the inter op threads, as there are not many tasks to be parallelized (ntasks-per-node=1), and spare more CPUs for OPENBLAS.

Please kindly advise. Thank you very much in advance!

I decreased the number of interop threads to 2 (default is 72), and intra op threads to 18 (default is 36), and use set env variable OPENBLAS_NUM_THREADS = 2, but still got the same error.

xc308 · 2024-05-18T23:03:25Z

In additon, I'm thinking if it's the problem of the 72 inter op threads.
Suppose my algorithm has 50 steps, and the first 25 steps are large matrices multiplication done only on CPU, while the rest of the 25 steps are offloaded to GPU.
I'm not sure whether the OPENBLAS error is because the inter op has grabbed all the available threads (72) I requested, and so there is no threads left for OPENBLAS to do the routine matrix multiplications parallelized over different CPUs.
If this understanding is correct, then it looks like instead of force OPENBLAS to work on a single CPU by setting env variable OPENBLAS_NUM_THREADS=1, which will be very slow for the first 25 steps of large matrices multiplications done on CPU, I could less the inter op threads, as there are not many tasks to be parallelized (ntasks-per-node=1), and spare more CPUs for OPENBLAS.
Please kindly advise. Thank you very much in advance!

I decreased the number of interop threads to 2 (default is 72), and intra op threads to 18 (default is 36), and use set env variable OPENBLAS_NUM_THREADS = 2, but still got the same error.

I also set OMP_NUM_THREADS=2 on top of OPENBLAS_NUM_THREADS = 2 given interop threads = 2, intra threads = 18, but still got the same error.

xc308 · 2024-05-18T23:15:16Z

In additon, I'm thinking if it's the problem of the 72 inter op threads.
Suppose my algorithm has 50 steps, and the first 25 steps are large matrices multiplication done only on CPU, while the rest of the 25 steps are offloaded to GPU.
I'm not sure whether the OPENBLAS error is because the inter op has grabbed all the available threads (72) I requested, and so there is no threads left for OPENBLAS to do the routine matrix multiplications parallelized over different CPUs.
If this understanding is correct, then it looks like instead of force OPENBLAS to work on a single CPU by setting env variable OPENBLAS_NUM_THREADS=1, which will be very slow for the first 25 steps of large matrices multiplications done on CPU, I could less the inter op threads, as there are not many tasks to be parallelized (ntasks-per-node=1), and spare more CPUs for OPENBLAS.
Please kindly advise. Thank you very much in advance!

I decreased the number of interop threads to 2 (default is 72), and intra op threads to 18 (default is 36), and use set env variable OPENBLAS_NUM_THREADS = 2, but still got the same error.

I also set OMP_NUM_THREADS=2 on top of OPENBLAS_NUM_THREADS = 2 given interop threads = 2, intra threads = 18, but still got the same error.

I set interop threads = 2, intra threads =2, OMP_NUM_THREADS=2, OPENBLAS_NUM_THREADS = 2, but got the same error. FYI.

eitsupi · 2024-05-21T14:20:23Z

@cboettig Could you take a look at this?

xc308 · 2024-05-21T14:27:52Z

I also tried this library(RhpcBLASctl)
blas_get_num_procs() # 36
blas_set_num_threads(48)
and modify the slurm parameters to SBATCH --cpus-per-gpu=48
but still got the same error. FYI.

cboettig · 2024-05-21T15:47:38Z

@xc308 can you try this on rocker/rstudio or similar image from the versioned stack for comparison?

I'm unclear why you are using the cuda images here. The cuda images should indeed have support for NVBLAS (you have to opt into it and not extensively tested), if you do want to leverage GPU. But unless I'm missing something it seems you are just using CPU with openblas, which should work out of the box and the standard rocker/r-ver , rocker/rstudio series.

Can you show the output of sessionInfo() as well? Also, please test if openblas is working for you on some standard linear algebra before we worry about the torch bindings.

I recommend these examples (which also indicate how to opt in for NVBLAS if you want GPU-accelerated linear algebra -- note that it is not always faster, depends on both your hardware and the overhead in copying data onto GPU...)
https://github.com/rocker-org/ml/blob/master/examples/test_blas.R

xc308 · 2024-05-21T22:13:37Z

"But unless I'm missing something it seems you are just using CPU with openblas, "

No, if my algorithm has 50 steps, the first 25 steps are done on CPU, but the rest of 25 steps are offloaded to GPU, so I do need the cuda image here.

"Can you show the output of sessionInfo() as well?"

Check Current BLAS Library
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] RhpcBLASctl_0.23-42 torch_0.12.0 Matrix_1.6-5

loaded via a namespace (and not attached):
[1] processx_3.8.4 bit_4.0.5 compiler_4.3.3 magrittr_2.0.3 cli_3.6.2
[6] Rcpp_1.0.12 bit64_4.0.5 coro_1.0.4 grid_4.3.3 callr_3.7.6
[11] ps_1.7.6 rlang_1.1.3 lattice_0.22-5

"please test if openblas is working for you on some standard linear algebra "

I did test on the openblas, it GPU node is required, the blas threads will automatically be 36 (the same as intra op threads). In such case, I have to set the env var OPENBLAS_NUM_THREADS to 1, any other number will throw me the same error as reported above.

"if you want GPU-accelerated linear algebra"

Since the first 25 steps of algorithm involves few loops, so it's not most ideal to offload them to GPU but instead leave them stay on CPU. That's why I'm thinking to increase the BLAS threads to try to speed up the calculation of this part.

cboettig · 2024-05-21T22:41:52Z

@xc308 thanks. I understand you are running a complex algorithm with many steps and it is not working as expected. When trying to debug code, it is helpful to try and reproduce the problem with a minimal example rather than attempt to debug a complex algorithm with many steps and interleaved CPU & GPU dispatch. Please see the simple matrix multiplication examples in the tests I linked above, and see if they are working as expected. If they are not, we can try and debug. If they are working as expected for you on both standard and cuda images, then we will need to further isolate the issue, as it is not specifically an issue with openblas configuration. If that is the case, then please proceed to identify a minimal reproducible example that we can run to generate the behavior you are seeing. Hope this helps.

eitsupi · 2025-02-08T02:20:33Z

We are deciding to stop supporting anything like rocker/ml here any more (#903), so I am closing this.
Thank you for your understanding.

cboettig · 2025-02-10T04:59:55Z

Just for clarity -- by "here" I believe @eitsupi means "rocker-versioned2" and not "rocker-org" -- as per #903 thread we have a new setup in rocker/ml seeking to support these use cases. The discussion in #903 also mentions several other approaches outside the rocker project that might also be suitable.

xc308 added the bug Something isn't working label May 17, 2024

eitsupi added question needs more info Further information is requested help wanted Extra attention is needed and removed bug Something isn't working labels May 21, 2024

eitsupi closed this as not planned Won't fix, can't repro, duplicate, stale Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENBLAS error in cuda_4.3.3.sif #820

OPENBLAS error in cuda_4.3.3.sif #820

xc308 commented May 17, 2024

xc308 commented May 17, 2024

benz0li commented May 17, 2024 •

edited

Loading

benz0li commented May 17, 2024

benz0li commented May 17, 2024 •

edited

Loading

xc308 commented May 17, 2024

benz0li commented May 17, 2024

benz0li commented May 17, 2024

xc308 commented May 17, 2024

xc308 commented May 17, 2024

xc308 commented May 17, 2024

benz0li commented May 18, 2024

xc308 commented May 18, 2024

benz0li commented May 18, 2024

xc308 commented May 18, 2024

xc308 commented May 18, 2024

xc308 commented May 18, 2024

xc308 commented May 18, 2024

xc308 commented May 18, 2024

xc308 commented May 18, 2024

eitsupi commented May 21, 2024

xc308 commented May 21, 2024

cboettig commented May 21, 2024

xc308 commented May 21, 2024

cboettig commented May 21, 2024

eitsupi commented Feb 8, 2025

cboettig commented Feb 10, 2025

OPENBLAS error in cuda_4.3.3.sif #820

OPENBLAS error in cuda_4.3.3.sif #820

Comments

xc308 commented May 17, 2024

Container image name

Container image digest

What operating system are you seeing the problem on?

System information

Bug description

How to reproduce this bug?

xc308 commented May 17, 2024

benz0li commented May 17, 2024 • edited Loading

benz0li commented May 17, 2024

benz0li commented May 17, 2024 • edited Loading

Footnotes

xc308 commented May 17, 2024

benz0li commented May 17, 2024

benz0li commented May 17, 2024

xc308 commented May 17, 2024

xc308 commented May 17, 2024

xc308 commented May 17, 2024

benz0li commented May 18, 2024

xc308 commented May 18, 2024

benz0li commented May 18, 2024

xc308 commented May 18, 2024

xc308 commented May 18, 2024

xc308 commented May 18, 2024

xc308 commented May 18, 2024

xc308 commented May 18, 2024

xc308 commented May 18, 2024

eitsupi commented May 21, 2024

xc308 commented May 21, 2024

cboettig commented May 21, 2024

xc308 commented May 21, 2024

cboettig commented May 21, 2024

eitsupi commented Feb 8, 2025

cboettig commented Feb 10, 2025

benz0li commented May 17, 2024 •

edited

Loading

benz0li commented May 17, 2024 •

edited

Loading