fix(cli): streamline local model probes

2026-05-06 14:50:45 +00:00 · 2026-04-27 23:02:26 +01:00
parent d7dcd0e21e
commit 42dddbbe78
14 changed files with 605 additions and 56 deletions
--- a/docs/cli/infer.md
+++ b/docs/cli/infer.md
@@ -130,7 +130,8 @@ This table maps common inference tasks to the corresponding infer command.
 - Stateless execution commands default to local.
 - Gateway-managed state commands default to gateway.
 - The normal local path does not require the gateway to be running.
- `model run` is one-shot. MCP servers opened through the agent runtime for that command are retired after the reply for both local and `--gateway` execution, so repeated scripted invocations do not keep stdio MCP child processes alive.
+- Local `model run` is a lean one-shot provider completion. It resolves the configured agent model and auth, but does not start a chat-agent turn, load tools, or open bundled MCP servers.
+- `model run --gateway` still uses the Gateway agent runtime so it can exercise the same routed runtime path as a normal Gateway-backed turn. MCP servers opened through that runtime are retired after the reply, so repeated scripted invocations do not keep stdio MCP child processes alive.

 ## Model

@@ -143,10 +144,22 @@ openclaw infer model providers --json
 openclaw infer model inspect --name gpt-5.5 --json
 ```

+Use full `<provider/model>` refs to smoke-test a specific provider without
+starting the Gateway or loading the full agent tool surface:
+
+```bash
+openclaw infer model run --local --model anthropic/claude-sonnet-4-6 --prompt "Reply with exactly: pong" --json
+openclaw infer model run --local --model cerebras/zai-glm-4.7 --prompt "Reply with exactly: pong" --json
+openclaw infer model run --local --model google/gemini-2.5-flash --prompt "Reply with exactly: pong" --json
+openclaw infer model run --local --model groq/llama-3.1-8b-instant --prompt "Reply with exactly: pong" --json
+openclaw infer model run --local --model mistral/mistral-small-latest --prompt "Reply with exactly: pong" --json
+openclaw infer model run --local --model openai/gpt-4.1 --prompt "Reply with exactly: pong" --json
+```
+
 Notes:

- `model run` reuses the agent runtime so provider/model overrides behave like normal agent execution.
- Because `model run` is intended for headless automation, it does not retain per-session bundled MCP runtimes after the command finishes.
+- Local `model run` is the narrowest CLI smoke for provider/model/auth health because it sends only the supplied prompt to the selected model.
+- Use `model run --gateway` when you need to test Gateway routing, agent-runtime setup, or Gateway-managed provider state instead of the lean local completion path.
 - `model auth login`, `model auth logout`, and `model auth status` manage saved provider auth state.

 ## Image
--- a/docs/gateway/local-models.md
+++ b/docs/gateway/local-models.md
@@ -239,14 +239,20 @@ Compatibility notes for stricter OpenAI-compatible backends:
  ```

 - Some smaller or stricter local backends are unstable with OpenClaw's full
-  agent-runtime prompt shape, especially when tool schemas are included. If the
-  backend works for tiny direct `/v1/chat/completions` calls but fails on normal
-  OpenClaw agent turns, first try
+  agent-runtime prompt shape, especially when tool schemas are included. First
+  verify the provider path with the lean local probe:
+
+  ```bash
+  openclaw infer model run --local --model <provider/model> --prompt "Reply with exactly: pong" --json
+  ```
+
+  If that succeeds but normal OpenClaw agent turns fail, first try
  `agents.defaults.experimental.localModelLean: true` to drop heavyweight
  default tools like `browser`, `cron`, and `message`; this is an experimental
  flag, not a stable default-mode setting. See
  [Experimental Features](/concepts/experimental-features). If that still fails, try
  `models.providers.<provider>.models[].compat.supportsTools: false`.
+
 - If the backend still fails only on larger OpenClaw runs, the remaining issue
  is usually upstream model/server capacity or a backend bug, not OpenClaw's
  transport layer.
@@ -264,10 +270,11 @@ Compatibility notes for stricter OpenAI-compatible backends:
 - Context errors? Lower `contextWindow` or raise your server limit.
 - OpenAI-compatible server returns `messages[].content ... expected a string`?
  Add `compat.requiresStringContent: true` on that model entry.
- Direct tiny `/v1/chat/completions` calls work, but `openclaw infer model run`
-  fails on Gemma or another local model? Disable tool schemas first with
-  `compat.supportsTools: false`, then retest. If the server still crashes only
-  on larger OpenClaw prompts, treat it as an upstream server/model limitation.
+- Direct tiny `/v1/chat/completions` calls work, but `openclaw infer model run --local`
+  fails on Gemma or another local model? Check the provider URL, model ref, auth
+  marker, and server logs first; local `model run` does not include agent tools.
+  If local `model run` succeeds but larger agent turns fail, reduce the agent
+  tool surface with `localModelLean` or `compat.supportsTools: false`.
 - Tool calls show up as raw JSON/XML/ReAct text, or the provider returns an
  empty `tool_calls` array? Do not add a proxy that blindly converts assistant
  text into tool execution. Fix the server chat template/parser first. If the
--- a/docs/providers/ollama.md
+++ b/docs/providers/ollama.md
@@ -185,7 +185,7 @@ When you set `OLLAMA_API_KEY` (or an auth profile) and **do not** define `models
 | Token limits         | Sets `maxTokens` to the default Ollama max-token cap used by OpenClaw                                                                                               |
 | Costs                | Sets all costs to `0`                                                                                                                                               |

-This avoids manual model entries while keeping the catalog aligned with the local Ollama instance.
+This avoids manual model entries while keeping the catalog aligned with the local Ollama instance. You can use a full ref such as `ollama/<pulled-model>:latest` in local `infer model run`; OpenClaw resolves that installed model from Ollama's live catalog without requiring a hand-written `models.json` entry.

 ```bash
 # See what models are available
@@ -193,6 +193,31 @@ ollama list
 openclaw models list
 ```

+For a narrow text-generation smoke test that avoids the full agent tool surface,
+use local `infer model run` with a full Ollama model ref:
+
+```bash
+OLLAMA_API_KEY=ollama-local \
+  openclaw infer model run \
+    --local \
+    --model ollama/llama3.2:latest \
+    --prompt "Reply with exactly: pong" \
+    --json
+```
+
+That path still uses OpenClaw's configured provider, auth, and native Ollama
+transport, but it does not start a chat-agent turn or load MCP/tool context. If
+this succeeds while normal agent replies fail, troubleshoot the model's agent
+prompt/tool capacity next.
+
+Live-verify the local text path, native stream path, and embeddings against
+local Ollama with:
+
+```bash
+OPENCLAW_LIVE_TEST=1 OPENCLAW_LIVE_OLLAMA=1 OPENCLAW_LIVE_OLLAMA_WEB_SEARCH=0 \
+  pnpm test:live -- extensions/ollama/ollama.live.test.ts
+```
+
 To add a new model, simply pull it with Ollama:

 ```bash