feat(memory): configurable local embedding contextSize (default 4096) (#70544)

node-llama-cpp defaults contextSize to "auto", which on large embedding models like Qwen3-Embedding-8B (trained context 40,960) inflates gateway VRAM from ~8.8 GB to ~32 GB and causes OOM on single-GPU hosts that share the gateway with an LLM runtime. Expose memorySearch.local.contextSize in openclaw.json (number | "auto"), default to 4096 which comfortably covers typical memory-search chunks (128–512 tokens) while keeping non-weight VRAM bounded. Closes #69667.
2026-05-06 18:10:45 +00:00 · 2026-04-24 02:51:53 +05:30
parent 88b3fa14f0
commit d40dd9088e
11 changed files with 97 additions and 6 deletions
--- a/docs/reference/memory-config.md
+++ b/docs/reference/memory-config.md
@@ -198,10 +198,11 @@ arn:aws:bedrock:*::foundation-model/amazon.titan-embed-text-v2:0

 ## Local embedding config

-| Key                   | Type     | Default                | Description                     |
-| --------------------- | -------- | ---------------------- | ------------------------------- |
-| `local.modelPath`     | `string` | auto-downloaded        | Path to GGUF model file         |
-| `local.modelCacheDir` | `string` | node-llama-cpp default | Cache dir for downloaded models |
+| Key                   | Type               | Default                | Description                                                                                                                                                                                                                                                                                                          |
+| --------------------- | ------------------ | ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `local.modelPath`     | `string`           | auto-downloaded        | Path to GGUF model file                                                                                                                                                                                                                                                                                              |
+| `local.modelCacheDir` | `string`           | node-llama-cpp default | Cache dir for downloaded models                                                                                                                                                                                                                                                                                      |
+| `local.contextSize`   | `number \| "auto"` | `4096`                 | Context window size for the embedding context. 4096 covers typical chunks (128–512 tokens) while bounding non-weight VRAM. Lower to 1024–2048 on constrained hosts. `"auto"` uses the model's trained maximum — not recommended for 8B+ models (Qwen3-Embedding-8B: 40 960 tokens → ~32 GB VRAM vs ~8.8 GB at 4096). |

 Default model: `embeddinggemma-300m-qat-Q8_0.gguf` (~0.6 GB, auto-downloaded).
 Requires native build: `pnpm approve-builds` then `pnpm rebuild node-llama-cpp`.