fix(cli): wire image describe prompt options

2026-05-06 12:10:42 +00:00 · 2026-04-28 10:53:47 +01:00
parent 0bc8b9a95a
commit 6f8792f3f1
7 changed files with 187 additions and 20 deletions
--- a/docs/cli/infer.md
+++ b/docs/cli/infer.md
@@ -107,18 +107,18 @@ and the shared capability runtime before the provider request is made.

 This table maps common inference tasks to the corresponding infer command.

-| Task                    | Command                                                                | Notes                                                 |
-| ----------------------- | ---------------------------------------------------------------------- | ----------------------------------------------------- |
-| Run a text/model prompt | `openclaw infer model run --prompt "..." --json`                       | Uses the normal local path by default                 |
-| Generate an image       | `openclaw infer image generate --prompt "..." --json`                  | Use `image edit` when starting from an existing file  |
-| Describe an image file  | `openclaw infer image describe --file ./image.png --json`              | `--model` must be an image-capable `<provider/model>` |
-| Transcribe audio        | `openclaw infer audio transcribe --file ./memo.m4a --json`             | `--model` must be `<provider/model>`                  |
-| Synthesize speech       | `openclaw infer tts convert --text "..." --output ./speech.mp3 --json` | `tts status` is gateway-oriented                      |
-| Generate a video        | `openclaw infer video generate --prompt "..." --json`                  | Supports provider hints such as `--resolution`        |
-| Describe a video file   | `openclaw infer video describe --file ./clip.mp4 --json`               | `--model` must be `<provider/model>`                  |
-| Search the web          | `openclaw infer web search --query "..." --json`                       |                                                       |
-| Fetch a web page        | `openclaw infer web fetch --url https://example.com --json`            |                                                       |
-| Create embeddings       | `openclaw infer embedding create --text "..." --json`                  |                                                       |
+| Task                    | Command                                                                  | Notes                                                 |
+| ----------------------- | ------------------------------------------------------------------------ | ----------------------------------------------------- |
+| Run a text/model prompt | `openclaw infer model run --prompt "..." --json`                         | Uses the normal local path by default                 |
+| Generate an image       | `openclaw infer image generate --prompt "..." --json`                    | Use `image edit` when starting from an existing file  |
+| Describe an image file  | `openclaw infer image describe --file ./image.png --prompt "..." --json` | `--model` must be an image-capable `<provider/model>` |
+| Transcribe audio        | `openclaw infer audio transcribe --file ./memo.m4a --json`               | `--model` must be `<provider/model>`                  |
+| Synthesize speech       | `openclaw infer tts convert --text "..." --output ./speech.mp3 --json`   | `tts status` is gateway-oriented                      |
+| Generate a video        | `openclaw infer video generate --prompt "..." --json`                    | Supports provider hints such as `--resolution`        |
+| Describe a video file   | `openclaw infer video describe --file ./clip.mp4 --json`                 | `--model` must be `<provider/model>`                  |
+| Search the web          | `openclaw infer web search --query "..." --json`                         |                                                       |
+| Fetch a web page        | `openclaw infer web fetch --url https://example.com --json`              |                                                       |
+| Create embeddings       | `openclaw infer embedding create --text "..." --json`                    |                                                       |

 ## Behavior

@@ -176,8 +176,10 @@ openclaw infer image generate --prompt "slow image backend" --timeout-ms 180000
 openclaw infer image edit --file ./logo.png --model openai/gpt-image-1.5 --output-format png --background transparent --prompt "keep the logo, remove the background" --json
 openclaw infer image edit --file ./poster.png --prompt "make this a vertical story ad" --size 2160x3840 --aspect-ratio 9:16 --resolution 4K --json
 openclaw infer image describe --file ./photo.jpg --json
+openclaw infer image describe --file ./receipt.jpg --prompt "Extract the merchant, date, and total" --json
+openclaw infer image describe-many --file ./before.png --file ./after.png --prompt "Compare the screenshots and list visible UI changes" --json
 openclaw infer image describe --file ./ui-screenshot.png --model openai/gpt-4.1-mini --json
-openclaw infer image describe --file ./photo.jpg --model ollama/qwen2.5vl:7b --json
+openclaw infer image describe --file ./photo.jpg --model ollama/qwen2.5vl:7b --prompt "Describe the image in one sentence" --timeout-ms 300000 --json
 ```

 Notes:
@@ -208,6 +210,8 @@ Notes:
  output paths. When `--output` is set, the final extension may follow the
  provider's returned MIME type.

+- For `image describe` and `image describe-many`, use `--prompt` to give the vision model a task-specific instruction such as OCR, comparison, UI inspection, or concise captioning.
+- Use `--timeout-ms` with slow local vision models or cold Ollama starts.
 - For `image describe`, `--model` must be an image-capable `<provider/model>`.
 - For local Ollama vision models, pull the model first and set `OLLAMA_API_KEY` to any placeholder value, for example `ollama-local`. See [Ollama](/providers/ollama#vision-and-image-description).