From d2f68af615c924db067478e1b2d6bc58310fd2fd Mon Sep 17 00:00:00 2001 From: Peter Steinberger Date: Tue, 21 Apr 2026 22:33:53 +0100 Subject: [PATCH] docs: document Ollama image understanding --- docs/cli/infer.md | 29 ++++++++++-------- docs/nodes/media-understanding.md | 6 ++++ docs/providers/ollama.md | 51 +++++++++++++++++++++++++++++++ 3 files changed, 73 insertions(+), 13 deletions(-) diff --git a/docs/cli/infer.md b/docs/cli/infer.md index 0a4c9d3a53b..4bbfc2a9de3 100644 --- a/docs/cli/infer.md +++ b/docs/cli/infer.md @@ -104,18 +104,18 @@ Benefits: This table maps common inference tasks to the corresponding infer command. -| Task | Command | Notes | -| ----------------------- | ---------------------------------------------------------------------- | ---------------------------------------------------- | -| Run a text/model prompt | `openclaw infer model run --prompt "..." --json` | Uses the normal local path by default | -| Generate an image | `openclaw infer image generate --prompt "..." --json` | Use `image edit` when starting from an existing file | -| Describe an image file | `openclaw infer image describe --file ./image.png --json` | `--model` must be `` | -| Transcribe audio | `openclaw infer audio transcribe --file ./memo.m4a --json` | `--model` must be `` | -| Synthesize speech | `openclaw infer tts convert --text "..." --output ./speech.mp3 --json` | `tts status` is gateway-oriented | -| Generate a video | `openclaw infer video generate --prompt "..." --json` | | -| Describe a video file | `openclaw infer video describe --file ./clip.mp4 --json` | `--model` must be `` | -| Search the web | `openclaw infer web search --query "..." --json` | | -| Fetch a web page | `openclaw infer web fetch --url https://example.com --json` | | -| Create embeddings | `openclaw infer embedding create --text "..." --json` | | +| Task | Command | Notes | +| ----------------------- | ---------------------------------------------------------------------- | ----------------------------------------------------- | +| Run a text/model prompt | `openclaw infer model run --prompt "..." --json` | Uses the normal local path by default | +| Generate an image | `openclaw infer image generate --prompt "..." --json` | Use `image edit` when starting from an existing file | +| Describe an image file | `openclaw infer image describe --file ./image.png --json` | `--model` must be an image-capable `` | +| Transcribe audio | `openclaw infer audio transcribe --file ./memo.m4a --json` | `--model` must be `` | +| Synthesize speech | `openclaw infer tts convert --text "..." --output ./speech.mp3 --json` | `tts status` is gateway-oriented | +| Generate a video | `openclaw infer video generate --prompt "..." --json` | | +| Describe a video file | `openclaw infer video describe --file ./clip.mp4 --json` | `--model` must be `` | +| Search the web | `openclaw infer web search --query "..." --json` | | +| Fetch a web page | `openclaw infer web fetch --url https://example.com --json` | | +| Create embeddings | `openclaw infer embedding create --text "..." --json` | | ## Behavior @@ -123,6 +123,7 @@ This table maps common inference tasks to the corresponding infer command. - Use `--json` when the output will be consumed by another command or script. - Use `--provider` or `--model provider/model` when a specific backend is required. - For `image describe`, `audio transcribe`, and `video describe`, `--model` must use the form ``. +- For `image describe`, an explicit `--model` runs that provider/model directly. The model must be image-capable in the model catalog or provider config. - Stateless execution commands default to local. - Gateway-managed state commands default to gateway. - The normal local path does not require the gateway to be running. @@ -152,12 +153,14 @@ openclaw infer image generate --prompt "friendly lobster illustration" --json openclaw infer image generate --prompt "cinematic product photo of headphones" --json openclaw infer image describe --file ./photo.jpg --json openclaw infer image describe --file ./ui-screenshot.png --model openai/gpt-4.1-mini --json +openclaw infer image describe --file ./photo.jpg --model ollama/qwen2.5vl:7b --json ``` Notes: - Use `image edit` when starting from existing input files. -- For `image describe`, `--model` must be ``. +- For `image describe`, `--model` must be an image-capable ``. +- For local Ollama vision models, pull the model first and set `OLLAMA_API_KEY` to any placeholder value, for example `ollama-local`. See [Ollama](/providers/ollama#vision-and-image-description). ## Audio diff --git a/docs/nodes/media-understanding.md b/docs/nodes/media-understanding.md index c58afbe4005..ff4e60f92be 100644 --- a/docs/nodes/media-understanding.md +++ b/docs/nodes/media-understanding.md @@ -136,6 +136,9 @@ Rules: - If the active primary image model already supports vision natively, OpenClaw skips the `[Image]` summary block and passes the original image into the model instead. +- Explicit `openclaw infer image describe --model ` requests + are different: they run that image-capable provider/model directly, including + Ollama refs such as `ollama/qwen2.5vl:7b`. - If `.enabled: true` but no models are configured, OpenClaw tries the **active reply model** when its provider supports the capability. @@ -157,6 +160,9 @@ working option**: tried before the bundled fallback order. - Image-only config providers with an image-capable model auto-register for media understanding even when they are not a bundled vendor plugin. + - Ollama image understanding is available when selected explicitly, for + example through `agents.defaults.imageModel` or + `openclaw infer image describe --model ollama/`. - Bundled fallback order: - Audio: OpenAI → Groq → Deepgram → Google → Mistral - Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI diff --git a/docs/providers/ollama.md b/docs/providers/ollama.md index 011d2397837..84d4fb70170 100644 --- a/docs/providers/ollama.md +++ b/docs/providers/ollama.md @@ -3,6 +3,7 @@ summary: "Run OpenClaw with Ollama (cloud and local models)" read_when: - You want to run OpenClaw with cloud or local models via Ollama - You need Ollama setup and configuration guidance + - You want Ollama vision models for image understanding title: "Ollama" --- @@ -182,6 +183,56 @@ The new model will be automatically discovered and available to use. If you set `models.providers.ollama` explicitly, auto-discovery is skipped and you must define models manually. See the explicit config section below. +## Vision and image description + +The bundled Ollama plugin registers Ollama as an image-capable media-understanding provider. This lets OpenClaw route explicit image-description requests and configured image-model defaults through local or hosted Ollama vision models. + +For local vision, pull a model that supports images: + +```bash +ollama pull qwen2.5vl:7b +export OLLAMA_API_KEY="ollama-local" +``` + +Then verify with the infer CLI: + +```bash +openclaw infer image describe \ + --file ./photo.jpg \ + --model ollama/qwen2.5vl:7b \ + --json +``` + +`--model` must be a full `` ref. When it is set, `openclaw infer image describe` runs that model directly instead of skipping description because the model supports native vision. + +To make Ollama the default image-understanding model for inbound media, configure `agents.defaults.imageModel`: + +```json5 +{ + agents: { + defaults: { + imageModel: { + primary: "ollama/qwen2.5vl:7b", + }, + }, + }, +} +``` + +If you define `models.providers.ollama.models` manually, mark vision models with image input support: + +```json5 +{ + id: "qwen2.5vl:7b", + name: "qwen2.5vl:7b", + input: ["text", "image"], + contextWindow: 128000, + maxTokens: 8192, +} +``` + +OpenClaw rejects image-description requests for models that are not marked image-capable. With implicit discovery, OpenClaw reads this from Ollama when `/api/show` reports a vision capability. + ## Configuration