From 624eaf5d4a7939c7b565257ecc379cdba1114d4b Mon Sep 17 00:00:00 2001
From: Vincent Koc <vincentkoc@ieee.org>
Date: Sat, 2 May 2026 14:43:43 -0700
Subject: [PATCH] docs(local-models): add hardware floor + backend picker,
 restructure stricter backend section as numbered triage flow

---
 docs/gateway/local-models.md | 56 +++++++++++++++++++-----------------
 1 file changed, 30 insertions(+), 26 deletions(-)
diff --git a/docs/gateway/local-models.md b/docs/gateway/local-models.md
index 2e6b26cf456..d5bcf482a12 100644
--- a/docs/gateway/local-models.md
+++ b/docs/gateway/local-models.md
@@ -7,9 +7,22 @@ read_when:
 title: "Local models"
 ---
 
-Local is doable, but OpenClaw expects large context + strong defenses against prompt injection. Small cards truncate context and leak safety. Aim high: **≥2 maxed-out Mac Studios or equivalent GPU rig (~$30k+)**. A single **24 GB** GPU works only for lighter prompts with higher latency. Use the **largest / full-size model variant you can run**; aggressively quantized or “small” checkpoints raise prompt-injection risk (see [Security](/gateway/security)).
+Local models are doable. They also raise the bar on hardware, context size, and prompt-injection defense — small or aggressively quantized cards truncate context and leak safety. This page is the opinionated guide for higher-end local stacks and custom OpenAI-compatible local servers. For lowest-friction onboarding, start with [LM Studio](/providers/lmstudio) or [Ollama](/providers/ollama) and `openclaw onboard`.
 
-If you want the lowest-friction local setup, start with [LM Studio](/providers/lmstudio) or [Ollama](/providers/ollama) and `openclaw onboard`. This page is the opinionated guide for higher-end local stacks and custom OpenAI-compatible local servers.
+## Hardware floor
+
+Aim high: **≥2 maxed-out Mac Studios or an equivalent GPU rig (~$30k+)** for a comfortable agent loop. A single **24 GB** GPU works only for lighter prompts at higher latency. Always run the **largest / full-size variant you can host**; small or heavily quantized checkpoints raise prompt-injection risk (see [Security](/gateway/security)).
+
+## Pick a backend
+
+| Backend                                              | Use when                                                                    |
+| ---------------------------------------------------- | --------------------------------------------------------------------------- |
+| [LM Studio](/providers/lmstudio)                     | First-time local setup, GUI loader, native Responses API                    |
+| [Ollama](/providers/ollama)                          | CLI workflow, model library, hands-off systemd service                      |
+| MLX / vLLM / SGLang                                  | High-throughput self-hosted serving with an OpenAI-compatible HTTP endpoint |
+| LiteLLM / OAI-proxy / custom OpenAI-compatible proxy | You front another model API and need OpenClaw to treat it as OpenAI         |
+
+Use Responses API (`api: "openai-responses"`) when the backend supports it (LM Studio does). Otherwise stick to Chat Completions (`api: "openai-completions"`).
 
 <Warning>
 **WSL2 + Ollama + NVIDIA/CUDA users:** The official Ollama Linux installer enables a systemd service with `Restart=always`. On WSL2 GPU setups, autostart can reload the last model during boot and pin host memory. If your WSL2 VM repeatedly restarts after enabling Ollama, see [WSL2 crash loop](/providers/ollama#wsl2-crash-loop-repeated-reboots).
@@ -279,36 +292,27 @@ Compatibility notes for stricter OpenAI-compatible backends:
   }
   ```
 
-- Some smaller or stricter local backends are unstable with OpenClaw's full
-  agent-runtime prompt shape, especially when tool schemas are included. First
-  verify the provider path with the lean local probe:
+## Smaller or stricter backends
 
-  ```bash
-  openclaw infer model run --local --model <provider/model> --prompt "Reply with exactly: pong" --json
-  ```
+If the model loads cleanly but full agent turns misbehave, work top-down — confirm transport first, then narrow the surface.
 
-  To verify the Gateway route without the full agent prompt shape, use the
-  Gateway model probe instead:
+1. **Confirm the local model itself responds.** No tools, no agent context:
 
-  ```bash
-  openclaw infer model run --gateway --model <provider/model> --prompt "Reply with exactly: pong" --json
-  ```
+   ```bash
+   openclaw infer model run --local --model <provider/model> --prompt "Reply with exactly: pong" --json
+   ```
 
-  Both local and Gateway model probes send only the supplied prompt. The
-  Gateway probe still validates Gateway routing, auth, and provider selection,
-  but it intentionally skips prior session transcript, AGENTS/bootstrap context,
-  context-engine assembly, tools, and bundled MCP servers.
+2. **Confirm Gateway routing.** Sends only the supplied prompt — skips transcript, AGENTS bootstrap, context-engine assembly, tools, and bundled MCP servers, but still exercises Gateway routing, auth, and provider selection:
 
-  If that succeeds but normal OpenClaw agent turns fail, first try
-  `agents.defaults.experimental.localModelLean: true` to drop heavyweight
-  default tools like `browser`, `cron`, and `message`; this is an experimental
-  flag, not a stable default-mode setting. See
-  [Experimental Features](/concepts/experimental-features). If that still fails, try
-  `models.providers.<provider>.models[].compat.supportsTools: false`.
+   ```bash
+   openclaw infer model run --gateway --model <provider/model> --prompt "Reply with exactly: pong" --json
+   ```
 
-- If the backend still fails only on larger OpenClaw runs, the remaining issue
-  is usually upstream model/server capacity or a backend bug, not OpenClaw's
-  transport layer.
+3. **Try lean mode.** If both probes pass but real agent turns fail with malformed tool calls or oversized prompts, enable `agents.defaults.experimental.localModelLean: true`. It drops the three heaviest default tools (`browser`, `cron`, `message`) so the prompt shape is smaller and less brittle. See [Experimental Features → Local model lean mode](/concepts/experimental-features#local-model-lean-mode) for the full explanation, when to use it, and how to confirm it is on.
+
+4. **Disable tools entirely as a last resort.** If lean mode is not enough, set `models.providers.<provider>.models[].compat.supportsTools: false` for that model entry. The agent will then operate without tool calls on that model.
+
+5. **Past that, the bottleneck is upstream.** If the backend still fails only on larger OpenClaw runs after lean mode and `supportsTools: false`, the remaining issue is usually upstream model or server capacity — context window, GPU memory, kv-cache eviction, or a backend bug. It is not OpenClaw's transport layer at that point.
 
 ## Troubleshooting