From d2f68af615c924db067478e1b2d6bc58310fd2fd Mon Sep 17 00:00:00 2001
From: Peter Steinberger <steipete@gmail.com>
Date: Tue, 21 Apr 2026 22:33:53 +0100
Subject: [PATCH] docs: document Ollama image understanding

---
 docs/cli/infer.md                 | 29 ++++++++++--------
 docs/nodes/media-understanding.md |  6 ++++
 docs/providers/ollama.md          | 51 +++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 13 deletions(-)

diff --git a/docs/cli/infer.md b/docs/cli/infer.md
index 0a4c9d3a53b..4bbfc2a9de3 100644
--- a/docs/cli/infer.md
+++ b/docs/cli/infer.md
@@ -104,18 +104,18 @@ Benefits:
 
 This table maps common inference tasks to the corresponding infer command.
 
-| Task                    | Command                                                                | Notes                                                |
-| ----------------------- | ---------------------------------------------------------------------- | ---------------------------------------------------- |
-| Run a text/model prompt | `openclaw infer model run --prompt "..." --json`                       | Uses the normal local path by default                |
-| Generate an image       | `openclaw infer image generate --prompt "..." --json`                  | Use `image edit` when starting from an existing file |
-| Describe an image file  | `openclaw infer image describe --file ./image.png --json`              | `--model` must be `<provider/model>`                 |
-| Transcribe audio        | `openclaw infer audio transcribe --file ./memo.m4a --json`             | `--model` must be `<provider/model>`                 |
-| Synthesize speech       | `openclaw infer tts convert --text "..." --output ./speech.mp3 --json` | `tts status` is gateway-oriented                     |
-| Generate a video        | `openclaw infer video generate --prompt "..." --json`                  |                                                      |
-| Describe a video file   | `openclaw infer video describe --file ./clip.mp4 --json`               | `--model` must be `<provider/model>`                 |
-| Search the web          | `openclaw infer web search --query "..." --json`                       |                                                      |
-| Fetch a web page        | `openclaw infer web fetch --url https://example.com --json`            |                                                      |
-| Create embeddings       | `openclaw infer embedding create --text "..." --json`                  |                                                      |
+| Task                    | Command                                                                | Notes                                                 |
+| ----------------------- | ---------------------------------------------------------------------- | ----------------------------------------------------- |
+| Run a text/model prompt | `openclaw infer model run --prompt "..." --json`                       | Uses the normal local path by default                 |
+| Generate an image       | `openclaw infer image generate --prompt "..." --json`                  | Use `image edit` when starting from an existing file  |
+| Describe an image file  | `openclaw infer image describe --file ./image.png --json`              | `--model` must be an image-capable `<provider/model>` |
+| Transcribe audio        | `openclaw infer audio transcribe --file ./memo.m4a --json`             | `--model` must be `<provider/model>`                  |
+| Synthesize speech       | `openclaw infer tts convert --text "..." --output ./speech.mp3 --json` | `tts status` is gateway-oriented                      |
+| Generate a video        | `openclaw infer video generate --prompt "..." --json`                  |                                                       |
+| Describe a video file   | `openclaw infer video describe --file ./clip.mp4 --json`               | `--model` must be `<provider/model>`                  |
+| Search the web          | `openclaw infer web search --query "..." --json`                       |                                                       |
+| Fetch a web page        | `openclaw infer web fetch --url https://example.com --json`            |                                                       |
+| Create embeddings       | `openclaw infer embedding create --text "..." --json`                  |                                                       |
 
 ## Behavior
 
@@ -123,6 +123,7 @@ This table maps common inference tasks to the corresponding infer command.
 - Use `--json` when the output will be consumed by another command or script.
 - Use `--provider` or `--model provider/model` when a specific backend is required.
 - For `image describe`, `audio transcribe`, and `video describe`, `--model` must use the form `<provider/model>`.
+- For `image describe`, an explicit `--model` runs that provider/model directly. The model must be image-capable in the model catalog or provider config.
 - Stateless execution commands default to local.
 - Gateway-managed state commands default to gateway.
 - The normal local path does not require the gateway to be running.
@@ -152,12 +153,14 @@ openclaw infer image generate --prompt "friendly lobster illustration" --json
 openclaw infer image generate --prompt "cinematic product photo of headphones" --json
 openclaw infer image describe --file ./photo.jpg --json
 openclaw infer image describe --file ./ui-screenshot.png --model openai/gpt-4.1-mini --json
+openclaw infer image describe --file ./photo.jpg --model ollama/qwen2.5vl:7b --json
 ```
 
 Notes:
 
 - Use `image edit` when starting from existing input files.
-- For `image describe`, `--model` must be `<provider/model>`.
+- For `image describe`, `--model` must be an image-capable `<provider/model>`.
+- For local Ollama vision models, pull the model first and set `OLLAMA_API_KEY` to any placeholder value, for example `ollama-local`. See [Ollama](/providers/ollama#vision-and-image-description).
 
 ## Audio
 
diff --git a/docs/nodes/media-understanding.md b/docs/nodes/media-understanding.md
index c58afbe4005..ff4e60f92be 100644
--- a/docs/nodes/media-understanding.md
+++ b/docs/nodes/media-understanding.md
@@ -136,6 +136,9 @@ Rules:
 - If the active primary image model already supports vision natively, OpenClaw
   skips the `[Image]` summary block and passes the original image into the
   model instead.
+- Explicit `openclaw infer image describe --model <provider/model>` requests
+  are different: they run that image-capable provider/model directly, including
+  Ollama refs such as `ollama/qwen2.5vl:7b`.
 - If `<capability>.enabled: true` but no models are configured, OpenClaw tries the
   **active reply model** when its provider supports the capability.
 
@@ -157,6 +160,9 @@ working option**:
      tried before the bundled fallback order.
    - Image-only config providers with an image-capable model auto-register for
      media understanding even when they are not a bundled vendor plugin.
+   - Ollama image understanding is available when selected explicitly, for
+     example through `agents.defaults.imageModel` or
+     `openclaw infer image describe --model ollama/<vision-model>`.
    - Bundled fallback order:
      - Audio: OpenAI → Groq → Deepgram → Google → Mistral
      - Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI
diff --git a/docs/providers/ollama.md b/docs/providers/ollama.md
index 011d2397837..84d4fb70170 100644
--- a/docs/providers/ollama.md
+++ b/docs/providers/ollama.md
@@ -3,6 +3,7 @@ summary: "Run OpenClaw with Ollama (cloud and local models)"
 read_when:
   - You want to run OpenClaw with cloud or local models via Ollama
   - You need Ollama setup and configuration guidance
+  - You want Ollama vision models for image understanding
 title: "Ollama"
 ---
 
@@ -182,6 +183,56 @@ The new model will be automatically discovered and available to use.
 If you set `models.providers.ollama` explicitly, auto-discovery is skipped and you must define models manually. See the explicit config section below.
 </Note>
 
+## Vision and image description
+
+The bundled Ollama plugin registers Ollama as an image-capable media-understanding provider. This lets OpenClaw route explicit image-description requests and configured image-model defaults through local or hosted Ollama vision models.
+
+For local vision, pull a model that supports images:
+
+```bash
+ollama pull qwen2.5vl:7b
+export OLLAMA_API_KEY="ollama-local"
+```
+
+Then verify with the infer CLI:
+
+```bash
+openclaw infer image describe \
+  --file ./photo.jpg \
+  --model ollama/qwen2.5vl:7b \
+  --json
+```
+
+`--model` must be a full `<provider/model>` ref. When it is set, `openclaw infer image describe` runs that model directly instead of skipping description because the model supports native vision.
+
+To make Ollama the default image-understanding model for inbound media, configure `agents.defaults.imageModel`:
+
+```json5
+{
+  agents: {
+    defaults: {
+      imageModel: {
+        primary: "ollama/qwen2.5vl:7b",
+      },
+    },
+  },
+}
+```
+
+If you define `models.providers.ollama.models` manually, mark vision models with image input support:
+
+```json5
+{
+  id: "qwen2.5vl:7b",
+  name: "qwen2.5vl:7b",
+  input: ["text", "image"],
+  contextWindow: 128000,
+  maxTokens: 8192,
+}
+```
+
+OpenClaw rejects image-description requests for models that are not marked image-capable. With implicit discovery, OpenClaw reads this from Ollama when `/api/show` reports a vision capability.
+
 ## Configuration
 
 <Tabs>