llama-cpp-python bindings not working for multiple GPUs 

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [Yes] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [Yes] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).
- [Yes] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [Yes] I reviewed the [Discussions](https://github.com/abetlen/llama-cpp-python/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

I have two RTX 2070s. llama.cpp and llama-cpp-python work when using the CPU, but I want to use both GPUs to perform inference (that is, split larger models between the two). Having followed the instructions for CUDA GPU build for llama.cpp and llama-cpp-python, and having written a Python script that explicitly enables GPU usage, I expect llama-cpp-python to use the GPUs for inference.

# Current Behavior

I have given detailed/extensive descriptions of the current behavior for various scenarios in [this llama.cpp issue](https://github.com/ggerganov/llama.cpp/issues/6360). Basically, it seems that, no matter what I do, the GPUs aren't being used.

# Environment and Context

I have given the details of my setup in the aforementioned llama.cpp issue, including environment variables, terminal commands for setup, the Python script being used, full outputs, etc.

`$ lscpu`

```
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 7 2700X Eight-Core Processor
    CPU family:          23
    Model:               8
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            2
    Frequency boost:     enabled
    CPU max MHz:         3700.0000
    CPU min MHz:         2200.0000
    BogoMIPS:            7385.27
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr
                          sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl non
                         stop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 s
                         se4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy a
                         bm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb b
                         pext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rds
                         eed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt
                          lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfth
                         reshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization features: 
  Virtualisation:        AMD-V
Caches (sum of all):     
  L1d:                   256 KiB (8 instances)
  L1i:                   512 KiB (8 instances)
  L2:                    4 MiB (8 instances)
  L3:                    16 MiB (2 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Mitigation; untrained return thunk; SMT vulnerable
  Spec rstack overflow:  Mitigation; Safe RET
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not a
                         ffected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
```

`$ uname -a`

`Linux me-System-Product-Name 6.5.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2 x86_64 x86_64 x86_64 GNU/Linux`

* SDK version, e.g. for Linux:

```
$ python3 --version
$ make --version
$ g++ --version
```

`Python 3.10.12`

```
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
```

```
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
```


In the aforementioned llama.cpp issue, we have so far been unable to fix the problem, and one of the respondents recommended that I open an issue here (llama-cpp-python). I would greatly appreciate it if people would help me fix this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-cpp-python bindings not working for multiple GPUs #1310

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

llama-cpp-python bindings not working for multiple GPUs #1310

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions