When the gateway restarts with many overdue cron jobs, they are now
executed with staggered delays to prevent overwhelming the gateway.
- Add missedJobStaggerMs config (default 5s between jobs)
- Add maxMissedJobsPerRestart limit (default 5 jobs immediately)
- Prioritize most overdue jobs by sorting by nextRunAtMs
- Reschedule deferred jobs to fire gradually via normal timer
Fixes#18892
AbortSignal.any() fails in Node.js when signals come from different module
contexts (grammY's internal signal vs local AbortController), producing:
"The signals[0] argument must be an instance of AbortSignal. Received an
instance of AbortSignal".
Replace with manual event forwarding that works across all realms.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the gateway receives SIGTERM, runner.stop() stops the grammY polling
loop but does not abort the in-flight getUpdates HTTP request. That request
hangs for up to 30 seconds (the Telegram API timeout). If a new gateway
instance starts polling during that window, Telegram returns a 409 Conflict
error, causing message loss and requiring exponential backoff recovery.
This is especially problematic with service managers (launchd, systemd)
that restart the process immediately after SIGTERM.
Wire an AbortController into the fetch layer so every Telegram API request
(especially the long-polling getUpdates) aborts immediately on shutdown:
- bot.ts: Accept optional fetchAbortSignal in TelegramBotOptions; wrap
the grammY fetch with AbortSignal.any() to merge the shutdown signal.
- monitor.ts: Create a per-iteration AbortController, pass its signal to
createTelegramBot, and abort it from the SIGTERM handler, force-restart
path, and finally block.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On macOS, launchd sets XPC_SERVICE_NAME on managed processes but does
not set LAUNCH_JOB_LABEL or LAUNCH_JOB_NAME. Without checking
XPC_SERVICE_NAME, isLikelySupervisedProcess() returns false for
launchd-managed gateways, causing restartGatewayProcessWithFreshPid()
to fork a detached child instead of returning "supervised". The
detached child holds the gateway lock while launchd simultaneously
respawns the original process (KeepAlive=true), leading to an infinite
lock-timeout / restart loop.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Release gateway lock when in-process restart fails, so daemon
restart/stop can still manage the process (Codex P2)
- P1 (env mismatch) already addressed: best-effort by design, documented
in JSDoc
- Remove dead 'return false' in runServiceStart (Greptile)
- Include stack trace in run-loop crash guard error log (Greptile)
- Only catch startup errors on subsequent restarts, not initial start (Codex P1)
- Add JSDoc note about env var false positive edge case (Codex P1)
Address Greptile review: add test coverage for runServiceStart path.
The error message copy-paste issue was already fixed in the DRY refactor
(uses params.serviceNoun instead of hardcoded 'restart').
When an in-process restart (SIGUSR1) triggers a config-triggered restart
and the new config is invalid, params.start() throws and the while loop
exits, killing the process. On macOS this loses TCC permissions.
Wrap params.start() in try/catch: on failure, set server=null, log the
error, and wait for the next SIGUSR1 instead of crashing.
When 'openclaw gateway restart' is run with an invalid config, the new
process crashes on startup due to config validation failure. On macOS,
this causes Full Disk Access (TCC) permissions to be lost because the
respawned process has a different PID.
Add getConfigValidationError() helper and pre-flight config validation
in both runServiceRestart() and runServiceStart(). If config is invalid,
abort with a clear error message instead of crashing.
The config watcher's hot-reload path already had this guard
(handleInvalidSnapshot), but the CLI restart/start commands did not.
AI-assisted (OpenClaw agent, fully tested)
When a config-change restart hits the force-exit timeout, exit with
code 1 instead of 0 so launchd/systemd treats it as a failure and
triggers a clean process restart. Stop-timeout stays at exit(0)
since graceful stops should not cause supervisor recovery.
Closes#36822
The repair/recovery path had the same missing `enable` guard as
`restartLaunchAgent`. If launchd persists a "disabled" state after a
previous `bootout`, the `bootstrap` call in `repairLaunchAgentBootstrap`
fails silently, leaving the gateway unloaded in the recovery flow.
Add the same `enable` guard before `bootstrap` that was already applied
to `installLaunchAgent` and (in this PR) `restartLaunchAgent`.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
restartLaunchAgent was missing the launchctl enable call that
installLaunchAgent already performs. launchd can persist a "disabled"
state after bootout, causing bootstrap to silently fail and leaving the
gateway unloaded until a manual reinstall.
Fixes#39211
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>