diff --git a/.github/workflows/test-deploy.yml b/.github/workflows/test-deploy.yml index 2294d83..0d9e2d1 100644 --- a/.github/workflows/test-deploy.yml +++ b/.github/workflows/test-deploy.yml @@ -17,7 +17,7 @@ jobs: fetch-depth: 0 - uses: actions/setup-node@v4 with: - node-version: 18 + node-version: 24.2.0 cache: npm - name: Install dependencies diff --git a/docs/running-locally.md b/docs/running-locally.md index e09b7d1..ad58f08 100644 --- a/docs/running-locally.md +++ b/docs/running-locally.md @@ -2,18 +2,20 @@ id: running-locally slug: running-locally title: Run Moondream Locally -description: High-performance local inference with Photon on NVIDIA GPUs +description: High-performance local inference with Photon on NVIDIA GPUs and Apple Silicon --- # Run Moondream Locally -Photon is Moondream's high-performance inference engine for running Moondream locally on NVIDIA GPUs. It features custom CUDA kernels, automatic batching, paged KV caching, and prefix caching — the same engine that powers [Moondream Cloud](https://moondream.ai), now available for local and on-prem deployment. +Photon is Moondream's high-performance inference engine for running Moondream locally — on NVIDIA GPUs (Linux x86_64 / aarch64 or Windows AMD64), or on Apple Silicon Macs with native Metal kernels. It features custom CUDA and Metal kernels, automatic batching, paged KV caching, and prefix caching — the same engine that powers [Moondream Cloud](https://moondream.ai), now available for local and on-prem deployment. ## Requirements -- **GPU**: NVIDIA GPU (Ampere or newer) — see [Supported GPUs](#supported-gpus) for the full list -- **Python**: 3.10+ -- **API Key**: Get one from [moondream.ai](https://moondream.ai/c/cloud/api-keys) +- One of: + - **NVIDIA GPU** (Ampere or newer) on Linux x86_64 / aarch64 or Windows AMD64 — see [Supported Hardware](#supported-hardware) for the full list. + - **Apple Silicon Mac** (M-series) on macOS 13 (Ventura) or later, Python 3.12. +- **Python**: 3.10+ on Linux / Windows; 3.12 on macOS. +- **API Key**: Get one from [moondream.ai](https://moondream.ai/c/cloud/api-keys). ## Installation @@ -29,7 +31,7 @@ This installs the Moondream Python client with built-in Photon support. import moondream as md from PIL import Image -# Initialize with local GPU inference +# Initialize with local inference (NVIDIA GPU or Apple Silicon) model = md.vl(api_key="YOUR_API_KEY", local=True) # Load an image @@ -107,48 +109,78 @@ The model string format is `{base_model}/{finetune_id}@{step}` where: Adapters are automatically downloaded and cached on first use. -## Supported GPUs +## Supported Hardware -### Server / Datacenter +### NVIDIA GPU | GPU | VRAM | Architecture | |-----|------|--------------| +| B200 | 192 GB | Blackwell (SM100) | | H200 | 141 GB | Hopper (SM90) | | H100 | 80 GB | Hopper (SM90) | | GH200 | 96 GB | Hopper (SM90) | +| RTX PRO 6000 | 96 GB | Blackwell (SM120) | | A100 | 80 GB | Ampere (SM80) | | L40S | 48 GB | Ada Lovelace (SM89) | | A40 | 48 GB | Ampere (SM86) | -| A10 | 24 GB | Ampere (SM86) | | L4 | 24 GB | Ada Lovelace (SM89) | +| A10 | 24 GB | Ampere (SM86) | + +Any Ampere (SM80) or newer NVIDIA GPU should work; the cards above are explicitly tested and tuned. -### Desktop +### Apple Silicon -Any Ampere (SM80+) or newer NVIDIA GPU should work — the server/datacenter GPUs listed above have been explicitly tested and optimized. +Photon runs natively on Apple M-series Macs through Metal kernels — no NVIDIA CUDA, no Triton, no extra setup beyond `pip install moondream`. KV cache size auto-tunes to your machine's unified memory. -### Edge +| Hardware | Notes | +|----------|-------| +| MacBook Pro (M5 Max, 48 GB) | macOS 13+, Python 3.12 | +| Mac mini / Studio (M2 / M3 / M4 Pro / M4 Max, ≥24 GB) | macOS 13+, Python 3.12 | +| Mac mini (M4 base, 16 GB) | macOS 13+, Python 3.12 — fits Moondream 2; Moondream 3 weights exceed unified memory | -| Device | VRAM | Notes | -|--------|------|-------| -| Jetson AGX Orin | 32/64 GB | JetPack 6.0+ required | -| Jetson Orin NX | 16 GB | JetPack 6.0+ required | -| Jetson Orin Nano | 8 GB | JetPack 6.0+ required | +### NVIDIA Jetson -See [Jetson Setup](#jetson-setup) for installation instructions. +| Device | VRAM | JetPack | +|--------|------|---------| +| Jetson AGX Thor | 64 GB | JetPack 7 (CUDA 13) | +| Jetson AGX Orin | 32 / 64 GB | JetPack 6.0+ | +| Jetson Orin NX | 16 GB | JetPack 6.0+ | +| Jetson Orin Nano | 8 GB | JetPack 6.0+ | + +Jetson needs an extra setup step for `LD_LIBRARY_PATH` — see [Jetson Setup](#jetson-setup) below. ## Jetson Setup -Photon supports NVIDIA Jetson Orin (AGX Orin, Orin NX, Orin Nano) with JetPack 6.0, 6.1, or 6.2. +Jetson Thor (JetPack 7) and Jetson Orin (JetPack 6) install differently because the two JetPack versions ship different CUDA major versions and PyTorch wheels. The instructions below cover the common path; for extra troubleshooting (cuSPARSELt errors, missing CUDA packages on minimal images, etc.) see the canonical [kestrel Jetson setup guide](https://github.com/m87-labs/kestrel/blob/main/docs/jetson.md). + +### Jetson AGX Thor (JetPack 7) -### Prerequisites +JetPack 7 ships CUDA 13 and is supported by the standard PyPI PyTorch aarch64 wheel — no custom NVIDIA wheel needed: -- Jetson Orin with JetPack 6.x flashed -- Python 3.10 (required — NVIDIA's Jetson PyTorch wheels are cp310-only) -- CUDA runtime (included with JetPack) +```bash +pip install moondream +``` -### Install PyTorch +This pulls in PyTorch along with the `nvidia-*-cu13` runtime packages and `nvpl` (NVIDIA Performance Libraries: BLAS / LAPACK / FFT for aarch64). Those libraries live under your venv's `site-packages` rather than `/usr/local/cuda`, so you need to point `LD_LIBRARY_PATH` at them once before importing torch: -Jetson requires NVIDIA's custom PyTorch wheels. Install the version matching your JetPack release. +```bash +SITE=$(python -c "import sysconfig; print(sysconfig.get_paths()['purelib'])") +export LD_LIBRARY_PATH="$SITE/nvidia/cu13/lib:$SITE/nvidia/cudnn/lib:$SITE/nvpl/lib:$LD_LIBRARY_PATH" +``` + +Add the export to your shell profile (`~/.bashrc` or similar) so it persists across sessions. + +### Jetson AGX Orin / Orin NX / Orin Nano (JetPack 6) + +JetPack 6 ships an older CUDA 12.x and requires NVIDIA's custom PyTorch wheel. + +#### Prerequisites + +- Jetson Orin device with JetPack 6.x flashed. +- Python 3.10 (matches NVIDIA's JetPack 6 PyTorch wheel). +- CUDA runtime included with JetPack. + +#### Install PyTorch **JetPack 6.1 / 6.2:** ```bash @@ -160,17 +192,17 @@ pip install https://developer.download.nvidia.com/compute/redist/jp/v61/pytorch/ pip install https://developer.download.nvidia.com/compute/redist/jp/v60/pytorch/torch-2.4.0a0+07cecf4168.nv24.05.14710581-cp310-cp310-linux_aarch64.whl ``` -### Install Moondream +#### Install Moondream ```bash pip install "numpy<2" moondream ``` -The Jetson PyTorch wheels are built against NumPy 1.x, so pinning `numpy<2` avoids compatibility warnings. +JetPack 6's PyTorch wheel is built against NumPy 1.x, so pinning `numpy<2` avoids the import-time compatibility warning. -### Set `LD_LIBRARY_PATH` +#### Set `LD_LIBRARY_PATH` -NVIDIA's Jetson PyTorch wheel needs JetPack CUDA libraries on the library path. If `import torch` fails with errors about missing `libnvToolsExt.so.1`, `libcublas.so`, or `libcupti.so`: +JetPack 6's PyTorch wheel loads CUDA libraries from the system JetPack install. If `import torch` fails with errors about missing `libnvToolsExt.so.1`, `libcublas.so`, or `libcupti.so`: ```bash export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/targets/aarch64-linux/lib:$LD_LIBRARY_PATH @@ -184,12 +216,14 @@ sudo apt install cuda-cupti-12-6 libnvtoolsext1 Add the export to your shell profile (`~/.bashrc` or similar) so it persists across sessions. -### Verify +### Verify (Orin or Thor) ```bash -python3 -c "import torch; print(torch.__version__); import moondream; print('moondream OK')" +python3 -c "import torch, moondream; print(torch.__version__, torch.cuda.get_device_name(0))" ``` +You should see something like `2.9.1 NVIDIA Thor` (Thor) or `2.5.0a0+... Orin` (Orin). If you see a `libcudart.so.X` / `libnvToolsExt.so.1` / `libcupti.so` `cannot open shared object file` error, your `LD_LIBRARY_PATH` doesn't cover the right directory — re-check the previous step. + ## Triton Inference Server Photon can be deployed as a [Triton Inference Server](https://github.com/triton-inference-server/server) backend for production serving. @@ -240,9 +274,16 @@ docker run --gpus all --rm -it \ ## Performance -Photon uses custom CUDA kernels and optimized scheduling to deliver high throughput. On an H100, Photon achieves over 60 requests/second for visual Q&A with Moondream 2 and over 58 requests/second with Moondream 3. +Headline ChartQA req/s on Moondream 2 / Moondream 3 visual Q&A: + +| Hardware | Batch | Moondream 2 | Moondream 3 | +|----------|------:|------------:|------------:| +| B200 (Blackwell) | 64 | 93 | 71 | +| H100 (Hopper) | 64 | 63 | 58 | +| RTX PRO 6000 (Blackwell)| 64 | 39 | 40 | +| MacBook Pro M5 Max | 4 | 7.3 | 4.6 | -For detailed benchmarks across all supported GPUs, see [PERFORMANCE.md](https://github.com/m87-labs/kestrel/blob/main/PERFORMANCE.md). +For the full breakdown across every supported card and batch size — including P50/P90/P99 latency and Jetson Thor / Orin numbers — see [PERFORMANCE.md](https://github.com/m87-labs/kestrel/blob/main/PERFORMANCE.md). ## Environment Variables @@ -253,4 +294,4 @@ For detailed benchmarks across all supported GPUs, see [PERFORMANCE.md](https:// ## Hugging Face Transformers -If you're running on non-NVIDIA hardware, Moondream can also be loaded via [Hugging Face Transformers](/transformers). On NVIDIA GPUs, Photon is strongly recommended — it delivers ~5x higher throughput and ~2.4x lower latency. +If your hardware isn't on the [Supported Hardware](#supported-hardware) list — for example, an Intel Mac, an AMD GPU, or a non-Ampere NVIDIA GPU — Moondream can also be loaded via [Hugging Face Transformers](/transformers). On supported hardware (NVIDIA Ampere+ or Apple Silicon), Photon is strongly recommended — it delivers ~5× higher throughput and ~2.4× lower latency than the Transformers path on NVIDIA. diff --git a/docs/transformers.md b/docs/transformers.md index 578efb6..a4529f9 100644 --- a/docs/transformers.md +++ b/docs/transformers.md @@ -288,7 +288,7 @@ print(result) ## Moondream3 -[Sign up for early access](https://huggingface.co/moondream/moondream3-preview) to start using [Moondream3](https://moondream.ai/blog/moondream-3-preview). Currently, only Nvidia GPUs with 24GB+ of memory are supported; quantized and Apple Silicon versions coming soon. +[Sign up for early access](https://huggingface.co/moondream/moondream3-preview) to start using [Moondream3](https://moondream.ai/blog/moondream-3-preview). Through this Transformers path, only NVIDIA GPUs with 24GB+ of memory are supported today; quantized and Apple Silicon Transformers variants are coming soon. To run Moondream 3 on Apple Silicon now, use [Photon](/running-locally) — `md.vl(api_key="...", local=True, model="moondream3-preview")` works on any M-series Mac with ≥24GB unified memory. ```python model = AutoModelForCausalLM.from_pretrained(