Official Python client library for Moondream, a fast multi-function VLM. This client can target Moondream Cloud or run locally via Photon — on NVIDIA GPUs (Linux x86_64 / aarch64 or Windows) or Apple Silicon Macs.
Moondream goes beyond the typical VLM "query" ability to include more visual functions:
| Method | Description |
|---|---|
caption |
Generate descriptive captions for images |
query |
Ask questions about image content |
detect |
Find bounding boxes around objects in images |
point |
Identify the center location of specified objects |
segment |
Generate an SVG path segmentation mask for objects |
Try it out on Moondream's playground.
pip install moondreamChoose how you want to run Moondream:
- Moondream Cloud — Get an API key from the cloud console
- Moondream Photon — High-performance local inference engine on NVIDIA GPUs (Linux / Windows) or Apple Silicon Macs (macOS 13+). Requires an API key.
import moondream as md
from PIL import Image
# Initialize with Moondream Cloud
model = md.vl(api_key="<your-api-key>")
# Or initialize with local inference (Photon — NVIDIA GPU or Apple Silicon)
model = md.vl(api_key="<your-api-key>", local=True)
# Load an image
image = Image.open("path/to/image.jpg")
# Generate a caption
caption = model.caption(image)["caption"]
print("Caption:", caption)
# Ask a question
answer = model.query(image, "What's in this image?")["answer"]
print("Answer:", answer)
# Stream the response
for chunk in model.caption(image, stream=True)["caption"]:
print(chunk, end="", flush=True)model = md.vl(api_key="<your-api-key>") # Cloud
model = md.vl(api_key="<your-api-key>", local=True) # Photon (local: NVIDIA GPU or Apple Silicon)
model = md.vl(api_key="<your-api-key>", model="moondream3-preview/ft_id@step") # FinetuneGenerate a caption for an image.
Parameters:
image—Image.ImageorEncodedImagelength—"normal","short", or"long"(default:"normal")stream—bool(default:False)
Returns: CaptionOutput — {"caption": str | Generator}
caption = model.caption(image, length="short")["caption"]
# With streaming
for chunk in model.caption(image, stream=True)["caption"]:
print(chunk, end="", flush=True)Ask a question about an image.
Parameters:
image—Image.ImageorEncodedImagequestion—strstream—bool(default:False)
Returns: QueryOutput — {"answer": str | Generator}
answer = model.query(image, "What's in this image?")["answer"]
# With streaming
for chunk in model.query(image, "What's in this image?", stream=True)["answer"]:
print(chunk, end="", flush=True)Detect specific objects in an image.
Parameters:
image—Image.ImageorEncodedImageobject—str
Returns: DetectOutput — {"objects": List[Region]}
objects = model.detect(image, "car")["objects"]Get coordinates of specific objects in an image.
Parameters:
image—Image.ImageorEncodedImageobject—str
Returns: PointOutput — {"points": List[Point]}
points = model.point(image, "person")["points"]Segment an object from an image and return an SVG path.
Parameters:
image—Image.ImageorEncodedImageobject—strspatial_refs—List[[x, y] | [x1, y1, x2, y2]]— optional spatial hints (normalized 0-1)stream—bool(default:False)
Returns:
- Non-streaming:
SegmentOutput—{"path": str, "bbox": Region} - Streaming: Generator yielding update dicts
result = model.segment(image, "cat")
svg_path = result["path"]
bbox = result["bbox"] # {"x_min": ..., "y_min": ..., "x_max": ..., "y_max": ...}
# With spatial hint (point)
result = model.segment(image, "cat", spatial_refs=[[0.5, 0.5]])
# With streaming
for update in model.segment(image, "cat", stream=True):
if "bbox" in update and not update.get("completed"):
print(f"Bbox: {update['bbox']}") # Available in first message
if "chunk" in update:
print(update["chunk"], end="") # Coarse path chunks
if update.get("completed"):
print(f"Final path: {update['path']}") # Refined path
print(f"Final bbox: {update['bbox']}")Pre-encode an image for reuse across multiple calls.
Parameters:
image—Image.ImageorEncodedImage
Returns: Base64EncodedImage
encoded = model.encode_image(image)| Type | Description |
|---|---|
Image.Image |
PIL Image object |
EncodedImage |
Base class for encoded images |
Base64EncodedImage |
Output of encode_image(), subtype of EncodedImage |
Region |
Bounding box with x_min, y_min, x_max, y_max |
Point |
Coordinates with x, y indicating object center |
SpatialRef |
[x, y] point or [x1, y1, x2, y2] bbox, normalized to [0, 1] |