fix(gateway): prefer linux child OOM victims

Raise eligible Linux child processes own oom_score_adj from a child-side /bin/sh exec shim so cgroup memory pressure prefers transient workers over the long-lived gateway. Cover supervisor children, PTY shells, MCP stdio servers, and OpenClaw-launched browser processes through the shared process runtime seam.

Harden the wrapper for distroless images, shell startup env, per-child and process-level opt-outs, dash-compatible exec, and leading-dash command names. Document Linux verification and OOM behavior.

Fixes #70404.

Co-authored-by: Neerav Makwana <261249544+neeravmakwana@users.noreply.github.com>
This commit is contained in:
Peter Steinberger
2026-04-23 05:10:30 +01:00
parent d3a2e993a8
commit cc9dcd3d69
14 changed files with 451 additions and 25 deletions

View File

@@ -1,2 +1,2 @@
e10f01ce10a381ecb098b805cee95b7278d16de42e02c7873f54448eb2b6c5cc plugin-sdk-api-baseline.json
918b646ff2e0849c4feba5ef930a08187a7bdad3a2d35ba4e1dd456fe3ea2cea plugin-sdk-api-baseline.jsonl
6297ca54fecbf277f3ed2e76410cc79aef95cf7dd887ab2383858a2132f81777 plugin-sdk-api-baseline.json
aa3343fda656a0034f9dd5ec7e28fcf45d49b15c1ed64329673ac1629285730c plugin-sdk-api-baseline.jsonl

View File

@@ -3,6 +3,7 @@ summary: "Linux support + companion app status"
read_when:
- Looking for Linux companion app status
- Planning platform coverage or contributions
- Debugging Linux OOM kills or exit 137 on a VPS or container
title: "Linux App"
---
@@ -98,3 +99,39 @@ Enable it:
```
systemctl --user enable --now openclaw-gateway[-<profile>].service
```
## Memory pressure and OOM kills
On Linux, the kernel chooses an OOM victim when a host, VM, or container cgroup
runs out of memory. The Gateway can be a poor victim because it owns long-lived
sessions and channel connections. OpenClaw therefore biases transient child
processes to be killed before the Gateway when possible.
For eligible Linux child spawns, OpenClaw starts the child through a short
`/bin/sh` wrapper that raises the child's own `oom_score_adj` to `1000`, then
`exec`s the real command. This is an unprivileged operation because the child is
only increasing its own OOM kill likelihood.
Covered child process surfaces include:
- supervisor-managed command children,
- PTY shell children,
- MCP stdio server children,
- OpenClaw-launched browser/Chrome processes.
The wrapper is Linux-only and is skipped when `/bin/sh` is unavailable. It is
also skipped if the child env sets `OPENCLAW_CHILD_OOM_SCORE_ADJ=0`, `false`,
`no`, or `off`.
To verify a child process:
```bash
cat /proc/<child-pid>/oom_score_adj
```
Expected value for covered children is `1000`. The Gateway process should keep
its normal score, usually `0`.
This does not replace normal memory tuning. If a VPS or container repeatedly
kills children, increase the memory limit, reduce concurrency, or add stronger
resource controls such as systemd `MemoryMax=` or container-level memory limits.

View File

@@ -114,3 +114,6 @@ If you deliberately installed a system unit instead, edit
How `Restart=` policies help automated recovery:
[systemd can automate service recovery](https://www.redhat.com/en/blog/systemd-automate-recovery).
For Linux OOM behavior, child process victim selection, and `exit 137`
diagnostics, see [Linux memory pressure and OOM kills](/platforms/linux#memory-pressure-and-oom-kills).