mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 05:50:43 +00:00
docs: document Ollama image understanding
This commit is contained in:
@@ -104,18 +104,18 @@ Benefits:
|
||||
|
||||
This table maps common inference tasks to the corresponding infer command.
|
||||
|
||||
| Task | Command | Notes |
|
||||
| ----------------------- | ---------------------------------------------------------------------- | ---------------------------------------------------- |
|
||||
| Run a text/model prompt | `openclaw infer model run --prompt "..." --json` | Uses the normal local path by default |
|
||||
| Generate an image | `openclaw infer image generate --prompt "..." --json` | Use `image edit` when starting from an existing file |
|
||||
| Describe an image file | `openclaw infer image describe --file ./image.png --json` | `--model` must be `<provider/model>` |
|
||||
| Transcribe audio | `openclaw infer audio transcribe --file ./memo.m4a --json` | `--model` must be `<provider/model>` |
|
||||
| Synthesize speech | `openclaw infer tts convert --text "..." --output ./speech.mp3 --json` | `tts status` is gateway-oriented |
|
||||
| Generate a video | `openclaw infer video generate --prompt "..." --json` | |
|
||||
| Describe a video file | `openclaw infer video describe --file ./clip.mp4 --json` | `--model` must be `<provider/model>` |
|
||||
| Search the web | `openclaw infer web search --query "..." --json` | |
|
||||
| Fetch a web page | `openclaw infer web fetch --url https://example.com --json` | |
|
||||
| Create embeddings | `openclaw infer embedding create --text "..." --json` | |
|
||||
| Task | Command | Notes |
|
||||
| ----------------------- | ---------------------------------------------------------------------- | ----------------------------------------------------- |
|
||||
| Run a text/model prompt | `openclaw infer model run --prompt "..." --json` | Uses the normal local path by default |
|
||||
| Generate an image | `openclaw infer image generate --prompt "..." --json` | Use `image edit` when starting from an existing file |
|
||||
| Describe an image file | `openclaw infer image describe --file ./image.png --json` | `--model` must be an image-capable `<provider/model>` |
|
||||
| Transcribe audio | `openclaw infer audio transcribe --file ./memo.m4a --json` | `--model` must be `<provider/model>` |
|
||||
| Synthesize speech | `openclaw infer tts convert --text "..." --output ./speech.mp3 --json` | `tts status` is gateway-oriented |
|
||||
| Generate a video | `openclaw infer video generate --prompt "..." --json` | |
|
||||
| Describe a video file | `openclaw infer video describe --file ./clip.mp4 --json` | `--model` must be `<provider/model>` |
|
||||
| Search the web | `openclaw infer web search --query "..." --json` | |
|
||||
| Fetch a web page | `openclaw infer web fetch --url https://example.com --json` | |
|
||||
| Create embeddings | `openclaw infer embedding create --text "..." --json` | |
|
||||
|
||||
## Behavior
|
||||
|
||||
@@ -123,6 +123,7 @@ This table maps common inference tasks to the corresponding infer command.
|
||||
- Use `--json` when the output will be consumed by another command or script.
|
||||
- Use `--provider` or `--model provider/model` when a specific backend is required.
|
||||
- For `image describe`, `audio transcribe`, and `video describe`, `--model` must use the form `<provider/model>`.
|
||||
- For `image describe`, an explicit `--model` runs that provider/model directly. The model must be image-capable in the model catalog or provider config.
|
||||
- Stateless execution commands default to local.
|
||||
- Gateway-managed state commands default to gateway.
|
||||
- The normal local path does not require the gateway to be running.
|
||||
@@ -152,12 +153,14 @@ openclaw infer image generate --prompt "friendly lobster illustration" --json
|
||||
openclaw infer image generate --prompt "cinematic product photo of headphones" --json
|
||||
openclaw infer image describe --file ./photo.jpg --json
|
||||
openclaw infer image describe --file ./ui-screenshot.png --model openai/gpt-4.1-mini --json
|
||||
openclaw infer image describe --file ./photo.jpg --model ollama/qwen2.5vl:7b --json
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- Use `image edit` when starting from existing input files.
|
||||
- For `image describe`, `--model` must be `<provider/model>`.
|
||||
- For `image describe`, `--model` must be an image-capable `<provider/model>`.
|
||||
- For local Ollama vision models, pull the model first and set `OLLAMA_API_KEY` to any placeholder value, for example `ollama-local`. See [Ollama](/providers/ollama#vision-and-image-description).
|
||||
|
||||
## Audio
|
||||
|
||||
|
||||
@@ -136,6 +136,9 @@ Rules:
|
||||
- If the active primary image model already supports vision natively, OpenClaw
|
||||
skips the `[Image]` summary block and passes the original image into the
|
||||
model instead.
|
||||
- Explicit `openclaw infer image describe --model <provider/model>` requests
|
||||
are different: they run that image-capable provider/model directly, including
|
||||
Ollama refs such as `ollama/qwen2.5vl:7b`.
|
||||
- If `<capability>.enabled: true` but no models are configured, OpenClaw tries the
|
||||
**active reply model** when its provider supports the capability.
|
||||
|
||||
@@ -157,6 +160,9 @@ working option**:
|
||||
tried before the bundled fallback order.
|
||||
- Image-only config providers with an image-capable model auto-register for
|
||||
media understanding even when they are not a bundled vendor plugin.
|
||||
- Ollama image understanding is available when selected explicitly, for
|
||||
example through `agents.defaults.imageModel` or
|
||||
`openclaw infer image describe --model ollama/<vision-model>`.
|
||||
- Bundled fallback order:
|
||||
- Audio: OpenAI → Groq → Deepgram → Google → Mistral
|
||||
- Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI
|
||||
|
||||
@@ -3,6 +3,7 @@ summary: "Run OpenClaw with Ollama (cloud and local models)"
|
||||
read_when:
|
||||
- You want to run OpenClaw with cloud or local models via Ollama
|
||||
- You need Ollama setup and configuration guidance
|
||||
- You want Ollama vision models for image understanding
|
||||
title: "Ollama"
|
||||
---
|
||||
|
||||
@@ -182,6 +183,56 @@ The new model will be automatically discovered and available to use.
|
||||
If you set `models.providers.ollama` explicitly, auto-discovery is skipped and you must define models manually. See the explicit config section below.
|
||||
</Note>
|
||||
|
||||
## Vision and image description
|
||||
|
||||
The bundled Ollama plugin registers Ollama as an image-capable media-understanding provider. This lets OpenClaw route explicit image-description requests and configured image-model defaults through local or hosted Ollama vision models.
|
||||
|
||||
For local vision, pull a model that supports images:
|
||||
|
||||
```bash
|
||||
ollama pull qwen2.5vl:7b
|
||||
export OLLAMA_API_KEY="ollama-local"
|
||||
```
|
||||
|
||||
Then verify with the infer CLI:
|
||||
|
||||
```bash
|
||||
openclaw infer image describe \
|
||||
--file ./photo.jpg \
|
||||
--model ollama/qwen2.5vl:7b \
|
||||
--json
|
||||
```
|
||||
|
||||
`--model` must be a full `<provider/model>` ref. When it is set, `openclaw infer image describe` runs that model directly instead of skipping description because the model supports native vision.
|
||||
|
||||
To make Ollama the default image-understanding model for inbound media, configure `agents.defaults.imageModel`:
|
||||
|
||||
```json5
|
||||
{
|
||||
agents: {
|
||||
defaults: {
|
||||
imageModel: {
|
||||
primary: "ollama/qwen2.5vl:7b",
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
If you define `models.providers.ollama.models` manually, mark vision models with image input support:
|
||||
|
||||
```json5
|
||||
{
|
||||
id: "qwen2.5vl:7b",
|
||||
name: "qwen2.5vl:7b",
|
||||
input: ["text", "image"],
|
||||
contextWindow: 128000,
|
||||
maxTokens: 8192,
|
||||
}
|
||||
```
|
||||
|
||||
OpenClaw rejects image-description requests for models that are not marked image-capable. With implicit discovery, OpenClaw reads this from Ollama when `/api/show` reports a vision capability.
|
||||
|
||||
## Configuration
|
||||
|
||||
<Tabs>
|
||||
|
||||
Reference in New Issue
Block a user