Commit Graph

90 Commits

Author SHA1 Message Date
Peter Steinberger
680eff63fb fix: land SIGUSR1 orphan recovery regressions (#47719) (thanks @joeykrug) 2026-03-15 22:32:36 -07:00
Peter Steinberger
4e055d8df2 refactor: share gateway timeout parsing 2026-03-14 01:41:16 +00:00
Peter Steinberger
158a3b49a7 test: deduplicate cli option collision fixtures 2026-03-10 20:34:54 +00:00
Charles Dusek
54be30ef89 fix(agents): bound compaction retry wait and drain embedded runs on restart (#40324)
Merged via squash.

Prepared head SHA: cfd99562d6
Co-authored-by: cgdusek <38732970+cgdusek@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
2026-03-09 08:27:29 -07:00
Peter Steinberger
3caab9260c test: narrow gateway loop signal harness 2026-03-09 07:42:15 +00:00
Peter Steinberger
cc0f30f5fb test: fix windows runtime and restart loop harnesses 2026-03-09 07:22:23 +00:00
merlin
f84adcbe88 fix: release gateway lock on restart failure + reply to Codex reviews
- Release gateway lock when in-process restart fails, so daemon
  restart/stop can still manage the process (Codex P2)
- P1 (env mismatch) already addressed: best-effort by design, documented
  in JSDoc
2026-03-09 05:53:52 +00:00
merlin
c79a0dbdb4 fix: address bot review feedback on #35862
- Remove dead 'return false' in runServiceStart (Greptile)
- Include stack trace in run-loop crash guard error log (Greptile)
- Only catch startup errors on subsequent restarts, not initial start (Codex P1)
- Add JSDoc note about env var false positive edge case (Codex P1)
2026-03-09 05:53:52 +00:00
merlin
6740cdf160 fix(gateway): catch startup failure in run loop to prevent process exit (#35862)
When an in-process restart (SIGUSR1) triggers a config-triggered restart
and the new config is invalid, params.start() throws and the while loop
exits, killing the process. On macOS this loses TCC permissions.

Wrap params.start() in try/catch: on failure, set server=null, log the
error, and wait for the next SIGUSR1 instead of crashing.
2026-03-09 05:53:52 +00:00
Daniel dos Santos Reis
1d6a2d0165 fix(gateway): exit non-zero on restart shutdown timeout
When a config-change restart hits the force-exit timeout, exit with
code 1 instead of 0 so launchd/systemd treats it as a failure and
triggers a clean process restart. Stop-timeout stays at exit(0)
since graceful stops should not cause supervisor recovery.

Closes #36822
2026-03-09 05:38:54 +00:00
Vincent Koc
76a028a50a Gateway CLI: allowlist password-file fixture 2026-03-07 18:28:18 -08:00
Vincent Koc
4062aa5e5d Gateway: add safer password-file input for gateway run (#39067)
* CLI: add gateway password-file option

* Docs: document safer gateway password input

* Update src/cli/gateway-cli/run.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Tests: clean up gateway password temp dirs

* CLI: restore gateway password warning flow

* Security: harden secret file reads

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2026-03-07 18:20:17 -08:00
Peter Steinberger
1b9e4800eb test: fix gateway register option collision mock 2026-03-08 01:58:33 +00:00
Vincent Koc
2c7fb54956 Config: fail closed invalid config loads (#39071)
* Config: fail closed invalid config loads

* CLI: keep diagnostics on explicit best-effort config

* Tests: cover invalid config best-effort diagnostics

* Changelog: note invalid config fail-closed fix

* Status: pass best-effort config through status-all gateway RPCs

* CLI: pass config through gateway secret RPC

* CLI: skip plugin loading from invalid config

* Tests: align daemon token drift env precedence
2026-03-07 17:48:13 -08:00
Peter Steinberger
cc7e61612a fix(gateway): harden service-mode stale process cleanup (#38463, thanks @spirittechie)
Co-authored-by: Jesse Paul <drzin69@gmail.com>
2026-03-07 21:36:24 +00:00
Ayaan Zaidi
05c240fad6 fix: restart Windows gateway via Scheduled Task (#38825) (#38825) 2026-03-07 18:00:38 +05:30
Josh Avant
72cf9253fc Gateway: add SecretRef support for gateway.auth.token with auth-mode guardrails (#35094) 2026-03-05 12:53:56 -06:00
Tak Hoffman
1be39d4250 fix(gateway): synthesize lifecycle robustness for restart and startup probes (#33831)
* fix(gateway): correct launchctl command sequence for gateway restart (closes #20030)

* fix(restart): expand HOME and escape label in launchctl plist path

* fix(restart): poll port free after SIGKILL to prevent EADDRINUSE restart loop

When cleanStaleGatewayProcessesSync() kills a stale gateway process,
the kernel may not immediately release the TCP port. Previously the
function returned after a fixed 500ms sleep (300ms SIGTERM + 200ms
SIGKILL), allowing triggerOpenClawRestart() to hand off to systemd
before the port was actually free. The new systemd process then raced
the dying socket for port 18789, hit EADDRINUSE, and exited with
status 1, causing systemd to retry indefinitely — the zombie restart
loop reported in #33103.

Fix: add waitForPortFreeSync() that polls lsof at 50ms intervals for
up to 2 seconds after SIGKILL. cleanStaleGatewayProcessesSync() now
blocks until the port is confirmed free (or the budget expires with a
warning) before returning. The increased SIGTERM/SIGKILL wait budgets
(600ms / 400ms) also give slow processes more time to exit cleanly.

Fixes #33103
Related: #28134

* fix: add EADDRINUSE retry and TIME_WAIT port-bind checks for gateway startup

* fix(ports): treat EADDRNOTAVAIL as non-retryable and fix flaky test

* fix(gateway): hot-reload agents.defaults.models allowlist changes

The reload plan had a rule for `agents.defaults.model` (singular) but
not `agents.defaults.models` (plural — the allowlist array).  Because
`agents.defaults.models` does not prefix-match `agents.defaults.model.`,
it fell through to the catch-all `agents` tail rule (kind=none), so
allowlist edits in openclaw.json were silently ignored at runtime.

Add a dedicated reload rule so changes to the models allowlist trigger
a heartbeat restart, which re-reads the config and serves the updated
list to clients.

Fixes #33600

Co-authored-by: HCL <chenglunhu@gmail.com>
Signed-off-by: HCL <chenglunhu@gmail.com>

* test(restart): 100% branch coverage — audit round 2

Audit findings fixed:
- remove dead guard: terminateStaleProcessesSync pids.length===0 check was
  unreachable (only caller cleanStaleGatewayProcessesSync already guards)
- expose __testing.callSleepSyncRaw so sleepSync's real Atomics.wait path
  can be unit-tested directly without going through the override
- fix broken sleepSync Atomics.wait test: previous test set override=null
  but cleanStaleGatewayProcessesSync returned before calling sleepSync —
  replaced with direct callSleepSyncRaw calls that actually exercise L36/L42-47
- fix pid collision: two tests used process.pid+304 (EPERM + dead-at-SIGTERM);
  EPERM test changed to process.pid+305
- fix misindented tests: 'deduplicates pids' and 'lsof status 1 container
  edge case' were outside their intended describe blocks; moved to correct
  scopes (findGatewayPidsOnPortSync and pollPortOnce respectively)
- add missing branch tests:
  - status 1 + non-empty stdout with zero openclaw pids → free:true (L145)
  - mid-loop non-openclaw cmd in &&-chain (L67)
  - consecutive p-lines without c-line between them (L67)
  - invalid PID in p-line (p0 / pNaN) — ternary false branch (L67)
  - unknown lsof output line (else-if false branch L69)

Coverage: 100% stmts / 100% branch / 100% funcs / 100% lines (36 tests)

* test(restart): fix stale-pid test typing for tsgo

* fix(gateway): address lifecycle review findings

* test(update): make restart-helper path assertions windows-safe

---------

Signed-off-by: HCL <chenglunhu@gmail.com>
Co-authored-by: Glucksberg <markuscontasul@gmail.com>
Co-authored-by: Efe Büken <efe@arven.digital>
Co-authored-by: Riccardo Marino <rmarino@apple.com>
Co-authored-by: HCL <chenglunhu@gmail.com>
2026-03-03 21:31:12 -06:00
Peter Steinberger
2287d1ec13 test: micro-optimize slow suites and CLI command setup 2026-03-02 23:00:49 +00:00
Peter Steinberger
d92fc85555 refactor(cli): dedupe gateway run mode parsing 2026-02-26 19:50:49 +01:00
Peter Steinberger
a909019078 fix: align gateway run auth modes (#27469) (thanks @s1korrrr) 2026-02-26 18:20:27 +00:00
Rafal
1087033abd fix(cli): list all supported auth modes in gateway run --auth help
Made-with: Cursor
2026-02-26 18:20:27 +00:00
Peter Steinberger
c397a02c9a fix(queue): harden drain/abort/timeout race handling
- reject new lane enqueues once gateway drain begins
- always reset lane draining state and isolate onWait callback failures
- persist per-session abort cutoff and skip stale queued messages
- avoid false 600s agentTurn timeout in isolated cron jobs

Fixes #27407
Fixes #27332
Fixes #27427

Co-authored-by: Kevin Shenghui <shenghuikevin@github.com>
Co-authored-by: zjmy <zhangjunmengyang@gmail.com>
Co-authored-by: suko <miha.sukic@gmail.com>
2026-02-26 13:43:39 +01:00
Peter Steinberger
2f8c68ae4d refactor(test): dedupe run-loop signal harness setup 2026-02-22 21:19:09 +00:00
Peter Steinberger
e6383a2c13 fix(gateway): probe port liveness for stale lock recovery
Co-authored-by: Operative-001 <261882263+Operative-001@users.noreply.github.com>
2026-02-22 22:11:51 +01:00
Peter Steinberger
296b19e413 test: dedupe gateway browser discord and channel coverage 2026-02-22 17:11:54 +00:00
Peter Steinberger
ee7a43b895 test: replace slow gateway SIGTERM integration coverage 2026-02-22 17:06:35 +00:00
Peter Steinberger
edaa5ef7a5 refactor(gateway): simplify restart flow and expand lock tests 2026-02-22 10:44:47 +01:00
Peter Steinberger
dd07c06d00 fix: tighten gateway restart loop handling (#23416) (thanks @jeffwnli) 2026-02-22 10:38:32 +01:00
jeffr
9c30243c8f fix: release gateway lock before spawning restart child
Move lock.release() before restartGatewayProcessWithFreshPid() so the
spawned child can immediately acquire the lock without racing against
a zombie parent. This eliminates the root cause of the restart loop
where the child times out waiting for a lock held by its now-dead parent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 10:38:32 +01:00
jeffr
01bd83d644 fix: release gateway lock before process.exit in run-loop
process.exit() called from inside an async IIFE bypasses the outer
try/finally block that releases the gateway lock. This leaves a stale
lock file pointing to a zombie PID, preventing the spawned child or
systemctl restart from acquiring the lock. Release the lock explicitly
before calling exit in both the restart-spawned and stop code paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 10:38:32 +01:00
Peter Steinberger
a1cb700a05 test: dedupe and optimize test suites 2026-02-19 15:19:38 +00:00
Peter Steinberger
b4dbe03298 refactor: unify restart gating and update availability sync 2026-02-19 10:00:41 +01:00
Gustavo Madeira Santana
c5698caca3 Security: default gateway auth bootstrap and explicit mode none (#20686)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: be1b73182c
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-19 02:35:50 -05:00
Peter Steinberger
8f866d51c4 test(cli): dedupe runtime capture fixtures across command specs 2026-02-18 13:34:03 +00:00
Gustavo Madeira Santana
40a6661597 test(cli): fix option-collision mock typings 2026-02-17 21:32:04 -05:00
Gustavo Madeira Santana
5a31da8eec chore: format imports in gateway and session tools 2026-02-17 21:10:38 -05:00
Gustavo Madeira Santana
985ec71c55 CLI: resolve parent/subcommand option collisions (#18725)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: b7e51cf909
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-17 20:57:09 -05:00
Peter Steinberger
b8b43175c5 style: align formatting with oxfmt 0.33 2026-02-18 01:34:35 +00:00
Peter Steinberger
31f9be126c style: run oxfmt and fix gate failures 2026-02-18 01:29:02 +00:00
cpojer
048e29ea35 chore: Fix types in tests 45/N. 2026-02-17 15:50:07 +09:00
cpojer
f2f17bafbc chore: Fix types in tests 30/N. 2026-02-17 14:32:57 +09:00
cpojer
d0cb8c19b2 chore: wtf. 2026-02-17 13:36:48 +09:00
Sebastian
ed11e93cf2 chore(format) 2026-02-16 23:20:16 -05:00
Gustavo Madeira Santana
7b172d61cd Revert "fix: respect OPENCLAW_HOME for isolated gateway instances"
This reverts commit 34b18ea9db.
2026-02-16 20:36:01 -05:00
cpojer
90ef2d6bdf chore: Update formatting. 2026-02-17 09:18:40 +09:00
Zaf (via OpenClaw)
34b18ea9db fix: respect OPENCLAW_HOME for isolated gateway instances
When OPENCLAW_HOME is set (indicating an isolated instance), the gateway
port should be read from config rather than inheriting OPENCLAW_GATEWAY_PORT
from a parent process. This fixes running multiple OpenClaw instances
where a child process would incorrectly use the parent's port.

Changes:
- resolveGatewayPort() now prioritizes config.gateway.port when OPENCLAW_HOME is set
- Added getConfigPath() function for runtime-evaluated config path
- Deprecated CONFIG_PATH constant with warning about module-load-time evaluation
- Updated gateway run command to use getConfigPath() instead of CONFIG_PATH

Fixes the issue where spawning a sandbox OpenClaw instance from within
another OpenClaw process would fail because OPENCLAW_GATEWAY_PORT from
the parent (set in server.impl.ts) would override the child's config.
2026-02-16 23:52:16 +01:00
Benjamin Jesuiter
b25f334fa2 CLI: improve command descriptions in help output (#18486)
* CLI: clarify config vs configure descriptions

* CLI: improve top-level command descriptions

* CLI: make direct command help more descriptive

* CLI: add commands hint to root help

* CLI: show root help hint in implicit help output

* CLI: add help example for command-specific help

* CLI: tweak root subcommand marker spacing

* CLI: mark clawbot as subcommand root in help

* CLI: derive subcommand markers from registry metadata

* CLI: escape help regex CLI name
2026-02-16 22:06:25 +01:00
Peter Steinberger
f50e1e8015 perf(test): fold gateway discover tests into run-loop 2026-02-16 02:45:00 +00:00
Peter Steinberger
92f8c0fac3 perf(test): speed up suites and reduce fs churn 2026-02-15 19:29:27 +00:00