You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am attempting to deploy the multilingual-e5-base embedding model for local inference on Windows 11 using LocalAI via Docker Compose with NVIDIA GPU acceleration (RTX 1660 SUPER, CUDA 12).
Despite configuring the model via a YAML file and manually placing a compatible GGUF file, I encounter inconsistent behavior depending on how the model is referenced in the API call.
When calling the embeddings API using the model name specified in the YAML (multilingual-e5-base), the request fails with a backend not found error, specifically referencing llama-embeddings.
When calling the embeddings API directly using the GGUF filename (multilingual-e5-base-Q8_0.gguf), the model loads successfully via the llama-cpp backend and utilizes the GPU, but the returned embedding vector is consistently empty ([]), with logs indicating embedding disabled.
This suggests an issue with the integration or routing of the llama-embeddings backend within the Docker image builds for CUDA 12, or potentially a parameter passing issue when using the underlying llama-cpp library directly.
Steps to Reproduce:
Environment Setup:
Operating System: Windows 11
Docker Desktop installed and running.
NVIDIA GPU: GeForce GTX 1660 SUPER
NVIDIA Driver: Compatible with CUDA 12 (logs showed CUDA Version: 12.7).
LocalAI deployed using Docker Compose.
docker-compose.yaml Configuration:
Used a standard docker-compose.yaml obtained from the LocalAI GitHub repository.
Modified the image: to use CUDA 12 compatible tags (tested master-cublas-cuda12 and master-aio-gpu-nvidia-cuda-12). The logs provided below are from master-aio-gpu-nvidia-cuda-12.
Added deploy: section for NVIDIA GPU.
Ensured volumes: maps ./models to /models:cached.
Ensured environment: includes MODELS_PATH=/models and DEBUG=true.
Crucially, removed or commented out the default command: line.
Removed or commented out DOWNLOAD_MODELS=true.
# Relevant parts of docker-compose.yamlservices:
api:
image: quay.io/go-skynet/local-ai:master-aio-gpu-nvidia-cuda-12 # Or master-cublas-cuda12deploy:
resources:
reservations:
devices:
- driver: nvidiacount: 1capabilities: [gpu]ports:
- 8080:8080environment:
- MODELS_PATH=/models
- DEBUG=true# - DOWNLOAD_MODELS=true # Removedvolumes:
- ./models:/models:cached# command: # Removed or commented out# - some-model
Model File and Configuration Setup:
Manually downloaded the multilingual-e5-base-Q8_0.gguf file from https://huggingface.co/yixuan-chia/multilingual-e5-base-gguf.
Created the ./models/ directory in the LocalAI project root.
Placed the downloaded multilingual-e5-base-Q8_0.gguf file in the ./models/ directory.
Created the multilingual-e5-base.yaml file in the ./models/ directory with the following content:
# ./models/multilingual-e5-base.yamlname: multilingual-e5-basebackend: llama-embeddings # Specify backendembeddings: true # Mark as embeddings modelparameters:
model: multilingual-e5-base-Q8_0.gguf # File name relative to MODELS_PATHn_gpu_layers: -1# Attempt to offload all layers to GPUembedding: true # Explicitly set embedding parameterf16: true
Deploy LocalAI:
Open PowerShell in the directory containing docker-compose.yaml.
Run docker-compose down.
Run docker-compose pull <selected_image_tag>.
Run docker-compose up -d.
Attempt Embeddings API Calls: Wait for LocalAI to start (check logs or /readyz).
Attempt 1 (Using YAML name):
curl -X POST http://localhost:8080/v1/embeddings `-H "Content-Type: application/json"`-d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base"}'`# Use YAML name-v
Attempt 2 (Using GGUF filename):
curl -X POST http://localhost:8080/v1/embeddings `-H "Content-Type: application/json"`-d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base-Q8_0.gguf"}'`# Use GGUF filename-v
(Note: Adding "embeddings": true to the JSON body in Attempt 2 yielded the same result).
Expected Behavior:
Both Attempt 1 and Attempt 2 should return a 200 OK response with a JSON body containing a data array, where each element has a non-empty embedding list (the vector).
Logs should indicate successful loading and use of the model, preferably utilizing the GPU.
Observed Behavior:
Attempt 1 (Using YAML name): Returns 500 Internal Server Error with the message "failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings". (See Curl Output 1 below).
Attempt 2 (Using GGUF filename): Returns 200 OK status, but the embedding list in the JSON response is empty ([]). (See Curl Output 2 below). Docker logs show the model is loaded but embedding is disabled.
Environment Information:
OS: Windows 11
Docker Desktop Version: (Please specify your version, e.g., 4.29.0)
GPU: NVIDIA GeForce GTX 1660 SUPER
NVIDIA Driver Version: (Please specify your driver version)
CUDA Version (as reported by nvidia-smi in logs): 12.7
LocalAI Docker Image Tags Tested: quay.io/go-skynet/local-ai:master-cublas-cuda12, quay.io/go-skynet/local-ai:master-aio-gpu-nvidia-cuda-12, potentially others from sha-*-cuda12. All tested tags exhibiting the "backend not found" error when using the YAML name.
LocalAI Version (as reported in logs): 4076ea0 (from the master branch)
Relevant Logs:
Curl Output 1 (Attempt 1 - calling with YAML name):
(base) PS E:\AI\LocalAI> curl -X POST http://localhost:8080/v1/embeddings `
>> -H "Content-Type: application/json" `
>> -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base"}' ` # <-- Use YAML name
{"error":{"code":500,"message":"failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings","type":""}}
... (rest of curl -v output showing 500 Internal Server Error) ...
Curl Output 2 (Attempt 2 - calling with GGUF filename):
(The output is the same when adding "embeddings": true to the request body).
Docker Logs (Excerpt showing "backend not found" for YAML name call):
... (startup logs) ...
8:59AM INF Preloading models from /models # LocalAI finds the YAML and GGUF
Model name: multilingual-e5-base
8:59AM DBG Model: multilingual-e5-base (config: {... parameters:{model:multilingual-e5-base-Q8_0.gguf ... Backend:llama-embeddings Embeddings:true ...}}) # Correct config loaded
... (user sends curl request with model: "multilingual-e5-base") ...
8:59AM INF BackendLoader starting backend=llama-embeddings modelID=multilingual-e5-base o.model=multilingual-e5-base-Q8_0.gguf # Attempting to load via backend name
8:59AM DBG Loading model in memory from file: /models/multilingual-e5-base-Q8_0.gguf # Attempting to load file
8:59AM DBG Loading Model multilingual-e5-base with gRPC (file: /models/multilingual-e5-base-Q8_0.gguf) (backend: llama-embeddings): {...}
8:59AM ERR Server error error="failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings" ip=172.19.0.1 latency=2m22.975112253s method=POST status=500 url=/v1/embeddings # Backend executable not found
...
Docker Logs (Excerpt showing model loaded but embedding disabled for GGUF filename call):
... (user sends curl request with model: "multilingual-e5-base-Q8_0.gguf") ...
9:04AM DBG Model file loaded: multilingual-e5-base-Q8_0.gguf architecture=bert bosTokenID=0 eosTokenID=2 modelName= # File identified
...
9:04AM INF Trying to load the model 'multilingual-e5-base-Q8_0.gguf' with the backend '[llama-cpp llama-cpp-fallback ...]' # Tries multiple backends, including llama-cpp
9:04AM INF [llama-cpp] Attempting to load
...
9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stderr llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1660 SUPER) - 5134 MiB free # GPU detected and used
...
9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stderr llama_model_loader: loaded meta data with 35 key-value pairs ... from /models/multilingual-e5-base-Q8_0.gguf (version GGUF V3 (latest)) # GGUF loaded successfully
...
9:04AM INF [llama-cpp] Loads OK # Model loaded successfully by llama-cpp
...
9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stdout {"timestamp":...,"level":"WARNING","function":"send_embedding","line":1368,"message":"embedding disabled","params.embedding":false} # Embedding is explicitly disabled
...
9:04AM DBG Response: {"created":...,"object":"list","id":...,"model":"multilingual-e5-base-Q8_0.gguf","data":[{"embedding":[],"index":0,"object":"embedding"}],"usage":{...}} # Empty embedding returned
...
Additional Context:
The text-embedding-ada-002 model, which also uses the llama-cpp backend (based on its YAML configuration in LocalAI's AIO image), successfully loads and returns embedding vectors using the same LocalAI Docker image and the /v1/embeddings endpoint. This confirms that the core llama-cpp library and the general embeddings functionality are working correctly within the container and with the GPU.
This issue seems specific to how the multilingual-e5-base model (perhaps due to its architecture being "bert" as shown in logs, or differences in its GGUF structure) interacts with LocalAI's llama-embeddings backend abstraction, or how parameters (like embeddings: true) are passed to llama-cpp in different loading scenarios.
I have tried different CUDA 12 master branch tags (master-cublas-cuda12, master-aio-gpu-nvidia-cuda-12) and they all exhibit the same "backend not found" error when calling by YAML name.
This detailed information should help the LocalAI developers diagnose the specific issue within their build or model loading logic for llama-embeddings with this type of model/GGUF.
The text was updated successfully, but these errors were encountered:
Problem Description:
I am attempting to deploy the
multilingual-e5-base
embedding model for local inference on Windows 11 using LocalAI via Docker Compose with NVIDIA GPU acceleration (RTX 1660 SUPER, CUDA 12).Despite configuring the model via a YAML file and manually placing a compatible GGUF file, I encounter inconsistent behavior depending on how the model is referenced in the API call.
multilingual-e5-base
), the request fails with abackend not found
error, specifically referencingllama-embeddings
.multilingual-e5-base-Q8_0.gguf
), the model loads successfully via thellama-cpp
backend and utilizes the GPU, but the returned embedding vector is consistently empty ([]
), with logs indicatingembedding disabled
.This suggests an issue with the integration or routing of the
llama-embeddings
backend within the Docker image builds for CUDA 12, or potentially a parameter passing issue when using the underlyingllama-cpp
library directly.Steps to Reproduce:
Environment Setup:
docker-compose.yaml
Configuration:docker-compose.yaml
obtained from the LocalAI GitHub repository.image:
to use CUDA 12 compatible tags (testedmaster-cublas-cuda12
andmaster-aio-gpu-nvidia-cuda-12
). The logs provided below are frommaster-aio-gpu-nvidia-cuda-12
.deploy:
section for NVIDIA GPU.volumes:
maps./models
to/models:cached
.environment:
includesMODELS_PATH=/models
andDEBUG=true
.command:
line.DOWNLOAD_MODELS=true
.Model File and Configuration Setup:
multilingual-e5-base-Q8_0.gguf
file fromhttps://huggingface.co/yixuan-chia/multilingual-e5-base-gguf
../models/
directory in the LocalAI project root.multilingual-e5-base-Q8_0.gguf
file in the./models/
directory.multilingual-e5-base.yaml
file in the./models/
directory with the following content:Deploy LocalAI:
docker-compose.yaml
.docker-compose down
.docker-compose pull <selected_image_tag>
.docker-compose up -d
.Attempt Embeddings API Calls: Wait for LocalAI to start (check logs or
/readyz
)."embeddings": true
to the JSON body in Attempt 2 yielded the same result).Expected Behavior:
200 OK
response with a JSON body containing adata
array, where each element has a non-emptyembedding
list (the vector).Observed Behavior:
500 Internal Server Error
with the message"failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings"
. (See Curl Output 1 below).200 OK
status, but theembedding
list in the JSON response is empty ([]
). (See Curl Output 2 below). Docker logs show the model is loaded but embedding is disabled.Environment Information:
nvidia-smi
in logs): 12.7quay.io/go-skynet/local-ai:master-cublas-cuda12
,quay.io/go-skynet/local-ai:master-aio-gpu-nvidia-cuda-12
, potentially others fromsha-*-cuda12
. All tested tags exhibiting the "backend not found" error when using the YAML name.4076ea0
(from the master branch)Relevant Logs:
Curl Output 1 (Attempt 1 - calling with YAML name):
Curl Output 2 (Attempt 2 - calling with GGUF filename):
(The output is the same when adding
"embeddings": true
to the request body).Docker Logs (Excerpt showing "backend not found" for YAML name call):
Docker Logs (Excerpt showing model loaded but embedding disabled for GGUF filename call):
Additional Context:
text-embedding-ada-002
model, which also uses thellama-cpp
backend (based on its YAML configuration in LocalAI's AIO image), successfully loads and returns embedding vectors using the same LocalAI Docker image and the/v1/embeddings
endpoint. This confirms that the corellama-cpp
library and the general embeddings functionality are working correctly within the container and with the GPU.multilingual-e5-base
model (perhaps due to its architecture being "bert" as shown in logs, or differences in its GGUF structure) interacts with LocalAI'sllama-embeddings
backend abstraction, or how parameters (likeembeddings: true
) are passed tollama-cpp
in different loading scenarios.master-cublas-cuda12
,master-aio-gpu-nvidia-cuda-12
) and they all exhibit the same "backend not found" error when calling by YAML name.This detailed information should help the LocalAI developers diagnose the specific issue within their build or model loading logic for
llama-embeddings
with this type of model/GGUF.The text was updated successfully, but these errors were encountered: