feat(memory): configurable local embedding contextSize (default 4096) (#70544)

node-llama-cpp defaults contextSize to "auto", which on large embedding
models like Qwen3-Embedding-8B (trained context 40,960) inflates gateway
VRAM from ~8.8 GB to ~32 GB and causes OOM on single-GPU hosts that share
the gateway with an LLM runtime.

Expose memorySearch.local.contextSize in openclaw.json (number | "auto"),
default to 4096 which comfortably covers typical memory-search chunks
(128–512 tokens) while keeping non-weight VRAM bounded.

Closes #69667.
This commit is contained in:
aalekh-sarvam
2026-04-24 02:51:53 +05:30
committed by GitHub
parent 88b3fa14f0
commit d40dd9088e
11 changed files with 97 additions and 6 deletions

View File

@@ -198,10 +198,11 @@ arn:aws:bedrock:*::foundation-model/amazon.titan-embed-text-v2:0
## Local embedding config
| Key | Type | Default | Description |
| --------------------- | -------- | ---------------------- | ------------------------------- |
| `local.modelPath` | `string` | auto-downloaded | Path to GGUF model file |
| `local.modelCacheDir` | `string` | node-llama-cpp default | Cache dir for downloaded models |
| Key | Type | Default | Description |
| --------------------- | ------------------ | ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `local.modelPath` | `string` | auto-downloaded | Path to GGUF model file |
| `local.modelCacheDir` | `string` | node-llama-cpp default | Cache dir for downloaded models |
| `local.contextSize` | `number \| "auto"` | `4096` | Context window size for the embedding context. 4096 covers typical chunks (128512 tokens) while bounding non-weight VRAM. Lower to 10242048 on constrained hosts. `"auto"` uses the model's trained maximum — not recommended for 8B+ models (Qwen3-Embedding-8B: 40 960 tokens → ~32 GB VRAM vs ~8.8 GB at 4096). |
Default model: `embeddinggemma-300m-qat-Q8_0.gguf` (~0.6 GB, auto-downloaded).
Requires native build: `pnpm approve-builds` then `pnpm rebuild node-llama-cpp`.