NODE_LLAMA_CPP_GPU=false ignored — GPU detection bypasses env var

## Problem

QMD's `ensureLlama()` in `llm.js` calls `getLlamaGpuTypes()` and tries to use CUDA/Vulkan/Metal regardless of the `NODE_LLAMA_CPP_GPU` environment variable. On systems without a GPU (or without the CUDA toolkit installed), `getLlamaGpuTypes()` can still report `["cuda", "vulkan", false]` because node-llama-cpp checks for prebuilt binaries rather than actual CUDA installation.

This causes a full cmake build attempt on every `qmd` invocation, which fails noisily:

```
-- Could not find nvcc, please set CUDAToolkit_ROOT.
CMake Error at llama.cpp/ggml/src/ggml-cuda/CMakeLists.txt:258 (message):
  CUDA Toolkit not found
-- Configuring incomplete, errors occurred!
ERR! OMG Process terminated: 1

[node-llama-cpp] Failed to build llama.cpp with CUDA support.
QMD Warning: cuda reported available but failed to initialize. Falling back to CPU.
```

This happens **twice** per invocation (the cmake error block appears twice in output), adding significant latency and noise before falling back to CPU anyway.

## Expected Behavior

Setting `NODE_LLAMA_CPP_GPU=false` (which node-llama-cpp itself recognizes as a valid "off" value) should cause QMD to skip GPU detection entirely and go straight to CPU mode.

## Root Cause

In `dist/llm.js`, the `ensureLlama()` method (around line 247) does its own GPU detection:

```js
const gpuTypes = await getLlamaGpuTypes();
const preferred = ["cuda", "metal", "vulkan"].find(g => gpuTypes.includes(g));
```

This bypasses the `NODE_LLAMA_CPP_GPU` env var that node-llama-cpp's own config system respects. The comment in the code explains why: `gpu:"auto"` was returning `false` even when CUDA was available. But this workaround creates the inverse problem — it forces CUDA attempts on systems that explicitly opt out.

## Suggested Fix

Check `NODE_LLAMA_CPP_GPU` before running GPU type detection:

```js
const gpuTypes = await getLlamaGpuTypes();
// Respect NODE_LLAMA_CPP_GPU env var
const gpuEnv = process.env.NODE_LLAMA_CPP_GPU;
const gpuDisabled = gpuEnv && ["false", "off", "none", "disable", "disabled"].includes(gpuEnv.toLowerCase());
const preferred = gpuDisabled ? undefined : ["cuda", "metal", "vulkan"].find(g => gpuTypes.includes(g));
```

This preserves the existing workaround for systems where `gpu:"auto"` incorrectly returns false, while allowing users to explicitly disable GPU via the standard env var.

## Environment

- QMD version: 1.0.6
- node-llama-cpp: bundled with QMD
- OS: Ubuntu (Linux 6.17.0-14-generic x64)
- Node: v22.22.0
- GPU: None (CPU-only system)
- `getLlamaGpuTypes()` returns: `["cuda", "vulkan", false]` despite no CUDA toolkit installed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NODE_LLAMA_CPP_GPU=false ignored — GPU detection bypasses env var #426

Problem

Expected Behavior

Root Cause

Suggested Fix

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

NODE_LLAMA_CPP_GPU=false ignored — GPU detection bypasses env var #426

Description

Problem

Expected Behavior

Root Cause

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions