Skip to content

[Bug] Multilingual-e5-base Embeddings Issue with llama-embeddings Backend on CUDA 12 Docker (Windows 11) #5289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
catiglu opened this issue May 1, 2025 · 0 comments
Labels
bug Something isn't working unconfirmed

Comments

@catiglu
Copy link

catiglu commented May 1, 2025

Problem Description:

I am attempting to deploy the multilingual-e5-base embedding model for local inference on Windows 11 using LocalAI via Docker Compose with NVIDIA GPU acceleration (RTX 1660 SUPER, CUDA 12).

Despite configuring the model via a YAML file and manually placing a compatible GGUF file, I encounter inconsistent behavior depending on how the model is referenced in the API call.

  • When calling the embeddings API using the model name specified in the YAML (multilingual-e5-base), the request fails with a backend not found error, specifically referencing llama-embeddings.
  • When calling the embeddings API directly using the GGUF filename (multilingual-e5-base-Q8_0.gguf), the model loads successfully via the llama-cpp backend and utilizes the GPU, but the returned embedding vector is consistently empty ([]), with logs indicating embedding disabled.

This suggests an issue with the integration or routing of the llama-embeddings backend within the Docker image builds for CUDA 12, or potentially a parameter passing issue when using the underlying llama-cpp library directly.

Steps to Reproduce:

  1. Environment Setup:

    • Operating System: Windows 11
    • Docker Desktop installed and running.
    • NVIDIA GPU: GeForce GTX 1660 SUPER
    • NVIDIA Driver: Compatible with CUDA 12 (logs showed CUDA Version: 12.7).
    • LocalAI deployed using Docker Compose.
  2. docker-compose.yaml Configuration:

    • Used a standard docker-compose.yaml obtained from the LocalAI GitHub repository.
    • Modified the image: to use CUDA 12 compatible tags (tested master-cublas-cuda12 and master-aio-gpu-nvidia-cuda-12). The logs provided below are from master-aio-gpu-nvidia-cuda-12.
    • Added deploy: section for NVIDIA GPU.
    • Ensured volumes: maps ./models to /models:cached.
    • Ensured environment: includes MODELS_PATH=/models and DEBUG=true.
    • Crucially, removed or commented out the default command: line.
    • Removed or commented out DOWNLOAD_MODELS=true.
    # Relevant parts of docker-compose.yaml
    services:
      api:
        image: quay.io/go-skynet/local-ai:master-aio-gpu-nvidia-cuda-12 # Or master-cublas-cuda12
    
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: 1
                  capabilities: [gpu]
        ports:
          - 8080:8080
        environment:
          - MODELS_PATH=/models
          - DEBUG=true
          # - DOWNLOAD_MODELS=true # Removed
        volumes:
          - ./models:/models:cached
        # command: # Removed or commented out
        # - some-model
  3. Model File and Configuration Setup:

    • Manually downloaded the multilingual-e5-base-Q8_0.gguf file from https://huggingface.co/yixuan-chia/multilingual-e5-base-gguf.
    • Created the ./models/ directory in the LocalAI project root.
    • Placed the downloaded multilingual-e5-base-Q8_0.gguf file in the ./models/ directory.
    • Created the multilingual-e5-base.yaml file in the ./models/ directory with the following content:
    # ./models/multilingual-e5-base.yaml
    name: multilingual-e5-base
    backend: llama-embeddings # Specify backend
    embeddings: true          # Mark as embeddings model
    parameters:
      model: multilingual-e5-base-Q8_0.gguf # File name relative to MODELS_PATH
      n_gpu_layers: -1 # Attempt to offload all layers to GPU
      embedding: true # Explicitly set embedding parameter
    f16: true
  4. Deploy LocalAI:

    • Open PowerShell in the directory containing docker-compose.yaml.
    • Run docker-compose down.
    • Run docker-compose pull <selected_image_tag>.
    • Run docker-compose up -d.
  5. Attempt Embeddings API Calls: Wait for LocalAI to start (check logs or /readyz).

    • Attempt 1 (Using YAML name):
      curl -X POST http://localhost:8080/v1/embeddings `
           -H "Content-Type: application/json" `
           -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base"}' ` # Use YAML name
           -v
    • Attempt 2 (Using GGUF filename):
      curl -X POST http://localhost:8080/v1/embeddings `
           -H "Content-Type: application/json" `
           -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base-Q8_0.gguf"}' ` # Use GGUF filename
           -v
      (Note: Adding "embeddings": true to the JSON body in Attempt 2 yielded the same result).

Expected Behavior:

  • Both Attempt 1 and Attempt 2 should return a 200 OK response with a JSON body containing a data array, where each element has a non-empty embedding list (the vector).
  • Logs should indicate successful loading and use of the model, preferably utilizing the GPU.

Observed Behavior:

  • Attempt 1 (Using YAML name): Returns 500 Internal Server Error with the message "failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings". (See Curl Output 1 below).
  • Attempt 2 (Using GGUF filename): Returns 200 OK status, but the embedding list in the JSON response is empty ([]). (See Curl Output 2 below). Docker logs show the model is loaded but embedding is disabled.

Environment Information:

  • OS: Windows 11
  • Docker Desktop Version: (Please specify your version, e.g., 4.29.0)
  • GPU: NVIDIA GeForce GTX 1660 SUPER
  • NVIDIA Driver Version: (Please specify your driver version)
  • CUDA Version (as reported by nvidia-smi in logs): 12.7
  • LocalAI Docker Image Tags Tested: quay.io/go-skynet/local-ai:master-cublas-cuda12, quay.io/go-skynet/local-ai:master-aio-gpu-nvidia-cuda-12, potentially others from sha-*-cuda12. All tested tags exhibiting the "backend not found" error when using the YAML name.
  • LocalAI Version (as reported in logs): 4076ea0 (from the master branch)

Relevant Logs:

  • Curl Output 1 (Attempt 1 - calling with YAML name):

    (base) PS E:\AI\LocalAI> curl -X POST http://localhost:8080/v1/embeddings `
    >>      -H "Content-Type: application/json" `
    >>      -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base"}' ` # <-- Use YAML name
    {"error":{"code":500,"message":"failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings","type":""}}
    ... (rest of curl -v output showing 500 Internal Server Error) ...
    
  • Curl Output 2 (Attempt 2 - calling with GGUF filename):

    (base) PS E:\AI\LocalAI> curl -X POST http://localhost:8080/v1/embeddings `
    >>      -H "Content-Type: application/json" `
    >>      -d '{"input": "这是一个测试句子。", "model": "multilingual-e5-base-Q8_0.gguf"}' ` # <-- Use GGUF filename
    {"created":1746090262,"object":"list","id":"a4e28026-95c6-46d5-ad7b-3a3ce87a14e5","model":"multilingual-e5-base-Q8_0.gguf","data":[{"embedding":[],"index":0,"object":"embedding"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
    ... (rest of curl -v output showing 200 OK) ...
    

    (The output is the same when adding "embeddings": true to the request body).

  • Docker Logs (Excerpt showing "backend not found" for YAML name call):

    ... (startup logs) ...
    8:59AM INF Preloading models from /models # LocalAI finds the YAML and GGUF
      Model name: multilingual-e5-base
    8:59AM DBG Model: multilingual-e5-base (config: {... parameters:{model:multilingual-e5-base-Q8_0.gguf ... Backend:llama-embeddings Embeddings:true ...}}) # Correct config loaded
    ... (user sends curl request with model: "multilingual-e5-base") ...
    8:59AM INF BackendLoader starting backend=llama-embeddings modelID=multilingual-e5-base o.model=multilingual-e5-base-Q8_0.gguf # Attempting to load via backend name
    8:59AM DBG Loading model in memory from file: /models/multilingual-e5-base-Q8_0.gguf # Attempting to load file
    8:59AM DBG Loading Model multilingual-e5-base with gRPC (file: /models/multilingual-e5-base-Q8_0.gguf) (backend: llama-embeddings): {...}
    8:59AM ERR Server error error="failed to load model with internal loader: backend not found: /tmp/localai/backend_data/backend-assets/grpc/llama-embeddings" ip=172.19.0.1 latency=2m22.975112253s method=POST status=500 url=/v1/embeddings # Backend executable not found
    ...
    
  • Docker Logs (Excerpt showing model loaded but embedding disabled for GGUF filename call):

    ... (user sends curl request with model: "multilingual-e5-base-Q8_0.gguf") ...
    9:04AM DBG Model file loaded: multilingual-e5-base-Q8_0.gguf architecture=bert bosTokenID=0 eosTokenID=2 modelName= # File identified
    ...
    9:04AM INF Trying to load the model 'multilingual-e5-base-Q8_0.gguf' with the backend '[llama-cpp llama-cpp-fallback ...]' # Tries multiple backends, including llama-cpp
    9:04AM INF [llama-cpp] Attempting to load
    ...
    9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stderr llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1660 SUPER) - 5134 MiB free # GPU detected and used
    ...
    9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stderr llama_model_loader: loaded meta data with 35 key-value pairs ... from /models/multilingual-e5-base-Q8_0.gguf (version GGUF V3 (latest)) # GGUF loaded successfully
    ...
    9:04AM INF [llama-cpp] Loads OK # Model loaded successfully by llama-cpp
    ...
    9:04AM DBG GRPC(multilingual-e5-base-Q8_0.gguf-...): stdout {"timestamp":...,"level":"WARNING","function":"send_embedding","line":1368,"message":"embedding disabled","params.embedding":false} # Embedding is explicitly disabled
    ...
    9:04AM DBG Response: {"created":...,"object":"list","id":...,"model":"multilingual-e5-base-Q8_0.gguf","data":[{"embedding":[],"index":0,"object":"embedding"}],"usage":{...}} # Empty embedding returned
    ...
    

Additional Context:

  • The text-embedding-ada-002 model, which also uses the llama-cpp backend (based on its YAML configuration in LocalAI's AIO image), successfully loads and returns embedding vectors using the same LocalAI Docker image and the /v1/embeddings endpoint. This confirms that the core llama-cpp library and the general embeddings functionality are working correctly within the container and with the GPU.
  • This issue seems specific to how the multilingual-e5-base model (perhaps due to its architecture being "bert" as shown in logs, or differences in its GGUF structure) interacts with LocalAI's llama-embeddings backend abstraction, or how parameters (like embeddings: true) are passed to llama-cpp in different loading scenarios.
  • I have tried different CUDA 12 master branch tags (master-cublas-cuda12, master-aio-gpu-nvidia-cuda-12) and they all exhibit the same "backend not found" error when calling by YAML name.

This detailed information should help the LocalAI developers diagnose the specific issue within their build or model loading logic for llama-embeddings with this type of model/GGUF.


@catiglu catiglu added bug Something isn't working unconfirmed labels May 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unconfirmed
Projects
None yet
Development

No branches or pull requests

1 participant