When a child openclaw process is spawned via a backgrounded subshell that
exits before the new process reaches the stale-pid sweep, the new process
is reparented to the supervisor (PID 1 / launchd) and the ancestor walk
in getSelfAndAncestorPidsSync can no longer see the running gateway. The
running gateway then shows up on lsof as an unrelated sibling on the
port and gets SIGKILL'd by cleanStaleGatewayProcessesSync, recreating
the issue #68451 supervisor restart loop across a reparent boundary.
Real-world trigger: a user ~/.zshrc auto-start block
if ! pgrep -x openclaw-gateway >/dev/null; then
(openclaw gateway >/dev/null 2>&1 &)
fi
combined with codex per-turn `zsh -c "set -e; . shell_snapshot"` invocations
caused every chat turn on rh-bot to SIGKILL its launchd-managed gateway,
producing HTTP 000 errors and ~33 kill events captured by a forensic
launchd unified-log tracker before the zshrc was patched.
Fix: gateway-cli captures OPENCLAW_GATEWAY_SERVICE_PID from inherited env
BEFORE overwriting it with process.pid, then threads the captured PID
through cleanStaleGatewayProcessesSync into getSelfAndAncestorPidsSync's
exclusion set. The protection is opt-in per call site so existing
maintainer paths (openclaw update / openclaw doctor restart helpers) keep
their ability to terminate a running gateway intentionally.
The inherited-PID parser is strict positive-integer only: a malformed
inherited env value (`"123abc"`, `"123.4"`, `"0x7b"`, etc.) is rejected
rather than silently protecting PID 123 from cleanup and leaving the
stale listener alive. New focused unit tests cover the parser
contract.
Existing regression tests cover the reparent suicide-kill scenario and
the defensive ignore-non-positive-PID contract on the cleanup side.
* Mark active main sessions during restart shutdown
* Type restart marker mock in close tests
* fix(gateway): preserve active run ownership across restart
* fix(gateway): preserve active runs across restart
* fix(gateway): close restart recovery edge cases
* fix(cron): preserve lifecycle ownership across restart
* fix(gateway): release rejected run contexts
* fix(gateway): preserve restart lifecycle ownership
* fix(cron): retain overlapping run ownership
* fix(agents): preserve restart terminal precedence
---------
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Reject malformed or explicit empty Gateway RPC timeout values before opening Gateway calls, align the shared Gateway RPC omitted-timeout fallback with the 30000 ms CLI default, and validate explicit `cron add --timeout-seconds` values at the CLI boundary.
Carries forward the useful source work from #54646 and the earlier timeout-validation context from #40953. #60661 remains separate accepted-run timeout semantics work and is intentionally not folded into this change.
Validation:
- `npm run review-results -- /tmp/clownfish-check-27341769444`
- `git diff --check`
- OpenClaw PR checks on `ce7bd8b9388a5689b14ddc2b3a984f7b4647e5ca`: 132 pass, 0 pending, 0 failing
- ClawSweeper re-review: https://github.com/openclaw/clawsweeper/actions/runs/27344244608
Co-authored-by: RayRuan <43744645+ruanrrn@users.noreply.github.com>
Co-authored-by: Homeran <11574611+comeran@users.noreply.github.com>
* redact tool output secrets
* Expand tool-output secret redaction
* fix(security): keep redaction prefilter in sync with expanded defaults
- build DEFAULT_REDACT_PREFILTER_RE from sources covering every default
pattern family: new vendor prefixes, webhook hosts, bare query/form keys,
userinfo/connection-string passwords, and percent/plus/invisible
obfuscated keys (including trailing separator splices)
- run default-pattern redaction tests through the default options path and
redact the vendor corpus per token so prefilter gaps fail tests
- fix quoted standalone assignment values containing the other quote char
or an unterminated quote; never re-mask *** placeholders
- align net-policy URL query-name separator stripping with logging key
normalization (Hangul fillers)
* fix(security): keep base64-prefix redaction out of media payloads
- pure-base64-alphabet token prefixes (gAAAA, AKIA, ASIA, dapi,
ATCTT3xFfG, ATATT, ATBB) now require a non-alphanumeric left boundary,
skip explicit ;base64, payload spans, and run unchunked so chunk starts
cannot fake the boundary or hide the container from the lookbehind
- tokens after URL/path delimiters or assignments still mask; data-URL
media survives redaction byte-identical (fixes chat media mirror CI)
- regression tests: tiny-PNG data URL, in-blob plus boundary,
chunk-aligned large data URL, reset-path Fernet token, path AWS key
---------
Co-authored-by: Alex Knight <15041791+amknight@users.noreply.github.com>
Extract shared normalization/coercion helpers into private @openclaw/normalization-core workspace package while preserving existing plugin SDK helper subpaths.\n\nAlso keeps direct normalization-core imports internal, wires UI/build/loader resolution, and replaces the slow PR network CodeQL lane with a fast added-line boundary scan while retaining full CodeQL for scheduled/manual runs.\n\nVerification: local moved tests, plugin SDK boundary tests, extension loader tests, agents-support shard, UI build/test, build artifacts, lint, workflow guards, autoreview, and GitHub CI passed on PR head 963d893715.
Add actionable operator guidance when an unauthorized SIGUSR1 gateway restart is ignored because unmanaged restart is disabled.
The change is log-only: restart authorization and scheduling semantics are unchanged, and the existing run-loop test now asserts both the reason warning and the recovery hint.
Refs #79577
Refs #78110
Refs #82433
Co-authored-by: wAngByg <281221101+wAngByg@users.noreply.github.com>
Summary:
- The PR moves the runtime `HEARTBEAT.md` bootstrap template into `src/agents/templates`, keeps docs templates ... or other workspace files, adds a legacy heartbeat-template doctor repair, and updates package guards/tests.
- PR surface: Source +281, Tests +283, Docs +11, Config +1, Other 0. Total +576 across 15 files.
- Reproducibility: yes. from source inspection: current main loads `HEARTBEAT.md` from the docs template, and ... pty heartbeat file non-empty to the runtime. I did not run a live heartbeat repro in this read-only review.
Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(doctor): recognize heartbeat docs boilerplate
- PR branch already contained follow-up commit before automerge: fix(agents): update heartbeat workspace test
- PR branch already contained follow-up commit before automerge: fix(doctor): tighten heartbeat template repair
Validation:
- ClawSweeper review passed for head e34e85864c.
- Required merge gates passed before the squash merge.
Prepared head SHA: e34e85864c
Review: https://github.com/openclaw/openclaw/pull/85416#issuecomment-4519851630
Co-authored-by: Mason Huang <masonxhuang@tencent.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: hxy91819
Co-authored-by: hxy91819 <8814856+hxy91819@users.noreply.github.com>
After config.patch writes new values to openclaw.json, a subsequent
SIGUSR1 in-process restart could overwrite them with a stale snapshot.
Root cause: run-loop's onIteration hook resets lanes and task registry,
but leaves the runtimeConfigSnapshot intact. loadConfig() then returns
the old snapshot via loadPinnedRuntimeConfig() instead of re-reading disk.
Fix: clearRuntimeConfigSnapshot() in the restart iteration hook so the
next startup reads fresh config from disk.
Refs #86350
Honor configured restart drain budgets for embedded runs and avoid a second active-work drain after forced deferral timeout restarts.
Includes maintainer changelog entry.
* fix(gateway): normalize explicit state dir overrides at startup
* test(gateway): simplify state-dir startup coverage
* test: fix state dir startup coverage
---------
Co-authored-by: Peter Steinberger <steipete@gmail.com>
The SIGTERM handler's fire-and-forget IIFE can reject if the graceful
drain or tunnel-teardown throws. Without a catch, this becomes an
unhandled promise rejection. Add .catch() that logs the error and
falls back to a hard stop request. Same treatment for SIGUSR1.
Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>
* fix(gateway): eager-load lifecycle runtime to survive in-place upgrades
After a package-swap update (e.g. via update.run), dist/ chunk hashes
rotate while the gateway is still running. The SIGUSR1 listener's first
dynamic import of the lifecycle runtime module then throws
ERR_MODULE_NOT_FOUND inside its async IIFE, silently rejects, and leaves
restart.ts's emittedRestartToken permanently unconsumed. From that point
every scheduleGatewaySigusr1Restart() — including the one update.run
schedules for itself — returns { coalesced: true } without scheduling
anything, and the gateway never restarts until manually kickstarted.
Fix:
1. Eagerly resolve the lifecycle runtime module as the first statement
of runGatewayLoop, before any signal listener is installed. lifecycle.runtime
is a 36-line re-export hub, so loading it once pulls the entire restart
/ respawn / queue / sentinel / handoff graph into memory, immune to
later disk rotation. If the module is missing at startup, fail fast
with a loud error so the supervisor can recover instead of running
half-broken.
2. Defense in depth: catch SIGUSR1 IIFE rejections and call
markGatewaySigusr1RestartHandled() via the eagerly captured reference,
so a transient listener failure doesn't permanently stick the restart
token.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(changelog): mention lifecycle restart eager load
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
Fix logs.tail credential-header redaction and JSON-mode gateway transport errors.\n\nFixes #66832.\nFixes #79108.\nSupersedes #67041.\nSupersedes #79233.\n\nCo-authored-by: Mil Wang <mingjwan@microsoft.com>\nCo-authored-by: Andy Ye <35905412+TurboTheTurtle@users.noreply.github.com>