GPU config help #1058

thebrahman · 2023-09-14T14:15:56Z

thebrahman
Sep 14, 2023

I am struggling to get models to run on my 4090. my OS is windows and I am running docker. It recognises my gpu, but doesn't offload any layers.

I followed this to setup:
https://localai.io/howtos/easy-setup-docker-gpu/

have this yaml file in the models folder:

# Model name.
# The model name is used to identify the model in the API calls.
name: llama7b

# Default model parameters.
# These options can also be specified in the API calls
parameters:
  # Relative to the models path
  model: llama-2-7b-chat.ggmlv3.q4_K_M.bin
  # temperature
  temperature: 0.3
  # all the OpenAI request options here..
  top_k: 
  top_p: 
  max_tokens:
  batch:
  f16: true
  ignore_eos: true
  n_keep: 10
  seed: 
  mode: 
  step:
  negative_prompt:
  typical_p:
  tfz:
  frequency_penalty:
  mirostat_eta:
  mirostat_tau:
  mirostat: 
  rope_freq_base:
  rope_freq_scale:
  negative_prompt_scale:

# Default context size
context_size: 512
# Default number of threads
threads: 2
# Define a backend (optional). By default it will try to guess the backend the first time the model is interacted with.
backend: llama # available: llama, stablelm, gpt2, gptj rwkv




system_prompt:
rms_norm_eps:
# Set it to 8 for llama2 70b
ngqa: 4
## LLAMA specific options
# Enable F16 if backend supports it
f16: true
# Enable debugging
debug: true
# Enable embeddings
embeddings: true
# Mirostat configuration (llama.cpp only)
mirostat_eta: 0.8
mirostat_tau: 0.9
mirostat: 1
# GPU Layers (only used when built with cublas)
gpu_layers: 22
# Enable memory lock
mmlock: true
# GPU setting to split the tensor in multiple parts and define a main GPU
# see llama.cpp for usage
tensor_split: ""
main_gpu: "0"
# Define a prompt cache path (relative to the models)
prompt_cache_path: "prompt-cache"
# Cache all the prompts
prompt_cache_all: true
# Read only
prompt_cache_ro: false
# Enable mmap
mmap: true
# Enable low vram mode (GPU only)
#low_vram: true
# Set NUMA mode (CPU only)
numa: true
# Lora settings
lora_adapter: "/path/to/lora/adapter"
lora_base: "/path/to/lora/base"
# Disable mulmatq (CUDA)
no_mulmatq: true

terminal:
2023-09-15 00:05:22 localai-api-1 | 2:05PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:llama-2-7b-chat.ggmlv3.q4_K_M.bin ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:2 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/llama-2-7b-chat.ggmlv3.q4_K_M.bin Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false AudioPath:}

2023-09-15 00:05:22 localai-api-1 | 2:05PM DBG GRPC(llama-2-7b-chat.ggmlv3.q4_K_M.bin-127.0.0.1:43703): stderr ggml_init_cublas: found 1 CUDA devices: 2023-09-15 00:05:22 localai-api-1 | 2:05PM DBG GRPC(llama-2-7b-chat.ggmlv3.q4_K_M.bin-127.0.0.1:43703): stderr Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9

03): stderr llama_model_load_internal: ggml ctx size = 3891.33 MB 2023-09-15 00:05:24 localai-api-1 | 2:05PM DBG GRPC(llama-2-7b-chat.ggmlv3.q4_K_M.bin-127.0.0.1:43703): stderr WARNING: failed to allocate 3891.33 MB of pinned memory: out of memory 2023-09-15 00:05:24 localai-api-1 | 2:05PM DBG GRPC(llama-2-7b-chat.ggmlv3.q4_K_M.bin-127.0.0.1:43703): stderr llama_model_load_internal: using CUDA for GPU acceleration 2023-09-15 00:05:24 localai-api-1 | 2:05PM DBG GRPC(llama-2-7b-chat.ggmlv3.q4_K_M.bin-127.0.0.1:43703): stderr llama_model_load_internal: mem required = 4193.33 MB (+ 512.00 MB per state) 2023-09-15 00:05:24 localai-api-1 | 2:05PM DBG GRPC(llama-2-7b-chat.ggmlv3.q4_K_M.bin-127.0.0.1:43703): stderr llama_model_load_internal: offloading 0 repeating layers to GPU 2023-09-15 00:05:24 localai-api-1 | 2:05PM DBG GRPC(llama-2-7b-chat.ggmlv3.q4_K_M.bin-127.0.0.1:43703): stderr llama_model_load_internal: offloaded 0/35 layers to GPU 2023-09-15 00:05:24 localai-api-1 | 2:05PM DBG GRPC(llama-2-7b-chat.ggmlv3.q4_K_M.bin-127.0.0.1:43703): stderr llama_model_load_internal: total VRAM used: 288 MB 2023-09-15 00:05:29 localai-api-1 | 2:05PM DBG GRPC(llama-2-7b-chat.ggmlv3.q4_K_M.bin-127.0.0.1:43703): stderr llama_new_context_with_model: kv self size = 512.00 MB

lunamidori5 · 2023-09-14T14:51:04Z

lunamidori5
Sep 14, 2023
Collaborator

That yaml file would not work, it needs to be formatted like - https://localai.io/howtos/easy-model-import-downloaded/
Also check that you have cuda / docker cuda installed on the host

2 replies

olariuromeo Feb 25, 2024

the page not exist

lunamidori5 Feb 25, 2024
Collaborator

yea times change sorry here in the new link @olariuromeo - https://io.midori-ai.xyz/howtos/by_hand/easy-model/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU config help #1058

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

GPU config help #1058

thebrahman Sep 14, 2023

Replies: 1 comment · 2 replies

lunamidori5 Sep 14, 2023 Collaborator

olariuromeo Feb 25, 2024

lunamidori5 Feb 25, 2024 Collaborator

thebrahman
Sep 14, 2023

Replies: 1 comment 2 replies

lunamidori5
Sep 14, 2023
Collaborator

lunamidori5 Feb 25, 2024
Collaborator