Commit Graph

269 Commits

Author SHA1 Message Date
Peter Steinberger
bb46b79d3c refactor: internalize OpenClaw agent runtime (#85341)
* refactor: extract agent core package

Introduce packages/agent-core as the OpenClaw-owned home for reusable agent loop, harness, session, prompt, and runtime dependency contracts.

* refactor: extract shared llm runtime

Move provider model registries, stream wrappers, OAuth helpers, and LLM utilities into src/llm with plugin-sdk barrels instead of depending on the old embedded runtime layout.

* refactor: remove pi runtime internals

Rename remaining Pi-shaped agent surfaces to OpenClaw agent runtime names, delete obsolete Pi docs and package graph checks, and add the third-party notice for incorporated code.

* refactor: tighten agent session runtime

Make agent-core/runtime dependencies explicit, consolidate compaction and session transcript helpers, and move model/session helpers behind OpenClaw-owned contracts.

* refactor: remove static model and pi auth paths

Drop static model catalogs and Pi auth bridges, move model/provider facts to manifest-owned runtime contracts, and harden internal embedded-agent utilities.

* refactor: remove legacy provider compat paths

* docs: remove agent parity notes

* fix: skip provider wildcard metadata parsing

* refactor: share session extension sdk loading

* refactor: inline acpx proxy error formatter

* refactor: fold edit recovery into edit tool

* fix: accept extension batch separator

* test: align startup provider plugin expectations

* fix: restore provider-scoped release discovery

* test: align static asset packaging expectations

* fix: run static provider catalogs during scoped discovery

* fix: add provider entry catalogs for scoped live discovery

* fix: load lightweight provider catalog entries

* fix: refresh provider-scoped plugin metadata

* fix: keep provider catalog entries on release live path

* fix: keep static manifest models in release live checks

* fix: harden release model discovery

* fix: reduce OpenAI live cache probe reasoning

* fix: disable OpenAI cache probe reasoning

* ci: extend OpenAI gateway live timeout

* fix: extend live gateway model budget

* fix: stabilize release validation regressions

* fix: honor provider aliases in model rows

* fix: stabilize release validation lanes

* fix: stabilize release memory qa

* ci: stabilize release validation lanes

* ci: prefer ipv4 for live docker node calls

* fix: restore shared tool-call stream wrapper

* ci: remove legacy pi test shard alias

* fix: clean up embedded agent test drift

* fix: stabilize runtime alias status

* fix: clean up embedded agent ci drift

* fix: restore release ci invariants

* fix: clean up post-rebase runtime drift

* fix: restore release ci checks

* fix: restore release ci after rebase

* fix: remove stale pi runtime path

* test: align compaction runtime expectations

* test: update plugin prerelease expectations

* fix: handle claude live tool approvals

* fix: stabilize release validation gates

* fix: finish agent runtime import

* test: finish post-rebase agent runtime mocks

* fix: keep codex compaction native

* fix: stabilize codex app-server hook tests

* test: isolate codex diagnostic active run

* test: remove codex diagnostic completion race

# Conflicts:
#	extensions/codex/src/app-server/run-attempt.test.ts

* ci: fix full release manifest performance run id

* refactor: narrow llm plugin sdk boundary

* chore: drop generated google boundary stamps

* fix: repair rebase fallout

* fix: clean up rebased runtime references

* fix: decode codex jwt payloads as base64url

* fix: preserve shipped pi runtime alias

* fix: add scoped sdk virtual modules

* fix: decode llm codex oauth jwt as base64url

* fix: avoid stale vertex adc negative cache

* fix: harden tool arg decoding and codeql path

* fix: keep vertex adc negative checks live

* refactor: consolidate codex jwt and edit helpers

* fix: await codex oauth node runtime imports

* fix: preserve sdk tool and notice contracts

* fix: preserve shipped compat config boundaries

* fix: align codex oauth callback host

* fix: terminate agent-core loop streams on failure

* fix: keep codex oauth callback alive during fallback

* ci: include session tools in critical codeql scans

* fix: keep Cloudflare Anthropic provider auth header

* docs: redirect legacy pi runtime pages

* fix: honor bundled web provider compat discovery

* fix: protect session output spill files

* fix: keep legacy agent dir env blocked

* fix: contain auto-discovered skill symlinks

* fix: harden agent core sdk proxy surfaces

* fix: restore approval reaction sdk compat

* fix: keep live docker runs bounded

* fix: keep codex oauth redirect host aligned

* fix: resolve post-rebase agent runtime drift

* fix: redact anthropic oauth parse failures

* fix: preserve responses strict tool shaping

* fix: repair agent runtime rebase cleanup

* docs: redirect retired parity pages

* fix: bound auto-discovered resources to roots

* fix: repair post-rebase agent test drift

* fix: preserve bundled provider allowlist migration

* fix: preserve manifest-owned provider aliases

* fix: declare photon image dependency

* fix: keep provider headers out of proxy body

* fix: preserve shipped env aliases

* fix: refresh control ui i18n generated state

* fix: quote read fallback paths

* fix: preview edits through configured backend

* test: satisfy core test typecheck

* fix: preserve ZAI usage auth fallback

* test: repair codex diagnostic test

* fix: repair agent runtime rebase drift

* test: finish embedded runner import rename

* fix: repair agent runtime rebase integrations

* test: align compaction oauth fallback expectations

* fix: allow sdk-auth session models

* fix: update doctor tool schema import

* fix: preserve bedrock plugin region

* fix: stream harmony-like prose immediately

* ci: include session runtime in codeql shards

* fix: repair latest rebase integrations

* fix: honor explicit codex websocket transport

* fix: keep openai-compatible credentials provider-scoped

* fix: refresh sdk api baseline after rebase

* fix: route cli runtime aliases through openclaw harness

* test: rename stale harness mock expectation

* test: rename embedded agent overflow calls

* test: clean embedded auth test wording

* test: use openclaw stream types in deepinfra cache test

* fix: refresh sdk api baseline on latest main

* fix: honor bundled discovery compat allowlists

* fix: refresh sdk api baseline after latest rebase

* fix: remove stale rebase imports

* test: rename stale model catalog mock

* test: mock renamed doctor runtime modules

* fix: map canonical kimi env auth

* fix: use internal model registry in bench script

* fix: migrate deepinfra provider catalog entry

* fix: enforce builtin tool suppression

* fix: route compaction auth and proxy payloads safely

* refactor: prune unused llm registry leftovers

* test: update codex hooks session import

* test: fix model picker ci coverage

* test: align model picker auth mock types
2026-05-27 19:24:04 +01:00
Peter Steinberger
94fb547fe2 fix(agents): handle deferred maintenance drain
Ensure deferred context-engine maintenance rejects cleanly when the gateway command queue is draining, including coalesced active-run requests. This prevents budget compaction from treating an unscheduled deferred maintenance run as successful and leaving the context engine alive.

Verification:
- pnpm exec oxfmt --check --threads=1 src/process/command-queue.ts src/agents/pi-embedded-runner/compact.queued.ts src/agents/pi-embedded-runner/context-engine-maintenance.ts src/agents/pi-embedded-runner/context-engine-maintenance.test.ts
- pnpm test src/auto-reply/reply/agent-runner-memory.test.ts src/agents/pi-embedded-runner/compact.hooks.test.ts src/agents/pi-embedded-runner/context-engine-maintenance.test.ts src/tasks/task-flow-registry.store.test.ts src/auto-reply/reply/commands-compact.test.ts src/agents/pi-embedded-runner/compact-reasons.test.ts
- .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main
- GitHub Actions CI run 26475226442: relevant Node/Linux, lint, type, security, CodeQL, OpenGrep, Socket, Real behavior proof, and build jobs passed; Windows job failed before tests due current runner image Node 22.19.0 vs required 24.x, matching current main infra failure.
2026-05-26 22:17:19 +01:00
WhatsSkiLL
b13166bc0c fix: gracefully escalate process supervisor cancellations (#85865)
* fix: gracefully escalate supervisor cancellations

* fix: preserve process-tree cancellation during grace

* fix: satisfy signal monitor allSettled lint

* fix(process): split graceful cancel signal escalation

---------

Co-authored-by: JARVIS-Glasses <284122573+JARVIS-Glasses@users.noreply.github.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-05-24 03:35:37 +01:00
Patrick Erichsen
c0312748c4 feat: support git and local skill installs (#84793) 2026-05-20 21:12:03 -07:00
Peter Steinberger
4f4d108639 chore(lint): remove underscore-dangle allow list (#83542)
* chore(lint): reduce underscore-dangle exceptions

* chore(lint): reduce more underscore exceptions

* chore(lint): remove underscore-dangle allow list

* fix(lint): repair underscore cleanup regressions

* test(lint): track version define suppression
2026-05-18 14:56:06 +01:00
Vincent Koc
54d063167e test: use platform spy helper in cli tests 2026-05-17 17:03:23 +08:00
Galin Iliev
18812bfc03 fix(process): clarify lane wait diagnostics (#82792)
Merged via squash.

Prepared head SHA: 1a09b724a5
Co-authored-by: galiniliev <5711535+galiniliev@users.noreply.github.com>
Reviewed-by: @galiniliev
2026-05-16 21:26:31 -07:00
Vincent Koc
e0c3c80ebc test: share Windows platform spy helpers 2026-05-17 08:56:56 +08:00
Galin Iliev
7c151b212b fix(agents): prioritize manual session turns (#82765)
* fix(agents): prioritize manual session turns

* docs: update changelog for session priority

---------

Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
2026-05-16 17:49:48 -07:00
Peter Steinberger
21c5f8dc6d fix(codex): keep run lane timeout progress-aware 2026-05-16 16:21:34 +01:00
samzong
2fa853dce5 fix(gateway): isolate gmail watcher restart and abort handling (#82395)
Merged via squash.

Prepared head SHA: 61502846df
Co-authored-by: samzong <13782141+samzong@users.noreply.github.com>
Co-authored-by: frankekn <4488090+frankekn@users.noreply.github.com>
Reviewed-by: @frankekn
2026-05-16 15:31:51 +08:00
Vincent Koc
d656cda46d fix(process): normalize Windows child env keys 2026-05-14 14:34:00 +08:00
Peter Steinberger
f468f0f61f test: guard windows exec mock calls 2026-05-12 04:19:02 +01:00
Peter Steinberger
60e5804a9d test: guard windows exec spawn call 2026-05-11 22:42:30 +01:00
Peter Steinberger
c69b58157f test: guard process child adapter calls 2026-05-11 22:41:27 +01:00
Peter Steinberger
ca6a513c55 test: guard process queue calls 2026-05-11 22:40:24 +01:00
Peter Steinberger
70a0236ccc test: guard process spawn calls 2026-05-11 22:36:06 +01:00
Shakker
3cebb019d0 test: tighten windows exec assertions 2026-05-11 04:51:55 +01:00
Peter Steinberger
aa3421e2d7 test: tighten command queue assertions 2026-05-10 22:59:02 +01:00
Peter Steinberger
41859bb3fc fix: preserve cron lane timeout result 2026-05-10 19:03:17 +01:00
bitloi
ed6b030a43 feat(process): show input-wait hints in log and poll
Show input-wait hints in process log/poll for idle interactive background sessions, keep list markers and structured stdin metadata, and document the recovery flow through log plus existing input actions.

Docs: updated docs/gateway/background-process.md.

Verification:
- pnpm test src/agents/bash-tools.test.ts
- pnpm test src/agents/bash-tools.process.input-hints.test.ts
- pnpm test src/agents/bash-tools.process.input-hints.test.ts src/agents/bash-tools.process.poll-timeout.test.ts src/agents/bash-tools.process.supervisor.test.ts src/agents/bash-tools.process-send-keys.test.ts
- pnpm check:docs
- git diff --check
- CI on 4aea1f11fe: check, check-additional, check-docs, checks-node-core, process/security relevant shards, real behavior proof passed

Fixes #33957.
Thanks @bitloi and @vincentkoc.

Co-authored-by: bitloi <89318445+bitloi@users.noreply.github.com>
Co-authored-by: bitloi <raphaelaloi.eth@gmail.com>
2026-05-10 04:13:07 -04:00
Peter Steinberger
b447d30349 test: tighten process assertions 2026-05-09 12:28:59 +01:00
Shakker
8cd978c02a test: tighten core empty array assertions 2026-05-09 05:12:12 +01:00
Peter Steinberger
03e7fcfcc8 test: simplify supervisor adapter fixture 2026-05-08 20:13:35 +01:00
Peter Steinberger
0c34f7ac1c test: reuse command queue deferred helper 2026-05-08 19:26:34 +01:00
Shakker
40998a8152 test: tighten command queue wait assertion 2026-05-08 16:57:46 +01:00
Peter Steinberger
03ac05a3cd test: tighten core helper assertions 2026-05-08 16:48:41 +01:00
Peter Steinberger
5c589673ec test: clarify loose boolean assertions 2026-05-08 14:00:34 +01:00
Peter Steinberger
b67bc04c43 test: clarify command queue reset assertions 2026-05-08 12:44:20 +01:00
Peter Steinberger
9ef37d1907 test: tighten assertions and harness coverage 2026-05-08 05:28:12 +01:00
Mert Başar
029ca8c268 feat(agents): implement state-aware failover and lane suspension
Summary:
- Persist quota-suspension state transitions and reload fresh suspension state before failover handoff injection.
- Restore suspended lanes to configured concurrency and share failover-to-suspension reason mapping across fallback and embedded runner paths.
- Export model.failover diagnostics via OTLP and cover queueing/resume behavior with regressions.

Verification:
- pnpm test src/config/sessions/store.pruning.integration.test.ts src/process/command-queue.test.ts src/agents/session-suspension.test.ts src/agents/model-fallback.test.ts extensions/diagnostics-otel/src/service.test.ts
- git diff --check
- pnpm exec oxfmt --check --threads=1 on changed TypeScript files
- GitHub checks: 92 successful, 0 pending, 0 failed on head 962146be88
- Review threads: none unresolved
2026-05-07 18:34:05 -05:00
Devin Robison
982d123b80 Harden Windows command wrapper resolution (#77472)
* Harden Windows command wrapper resolution

* clawsweeper: route Windows cmd.exe wrapper through getWindowsInstallRoots

Replace the local SystemRoot/windir/SYSTEMROOT/WINDIR scan in
resolveTrustedWindowsCmdExe with the shared getWindowsInstallRoots()
resolver from src/infra/windows-install-roots.ts. The shared resolver
already rejects UNC paths, root-relative values, semicolon-delimited
path-lists, and missing-drive-letter roots, and prefers registry-derived
roots over env, so the wrapper-launch trust boundary now matches the
existing Windows install-root boundary on main.

Tests:
- _resetWindowsInstallRootsForTests in beforeEach so cached roots track
  per-test process.env mutations
- expectedTrustedCmdExe helper now joins the resolved systemRoot, so the
  expected wrapper executable matches the production resolver on Linux
  CI (where it falls back to DEFAULT_WINDOWS_SYSTEM_ROOT)
- new "rejects unsafe Windows root values" test covers UNC,
  semicolon-delimited path-list, root-relative, and bare-relative
  SystemRoot inputs

* Add CHANGELOG entry for #77472 Windows command wrapper hardening

* clawsweeper: stub registry probe in Windows wrapper tests

On real Windows CI runners getWindowsInstallRoots() reads the canonical
SystemRoot from the registry (e.g. C:\WINDOWS) before falling back to
process.env, which shadowed the env-only setup in the ComSpec-poisoning
and unsafe-root tests and produced casing mismatches like
"C:\WINDOWS\System32\cmd.exe" vs the expected "C:\Windows\...". Pass a
queryRegistryValue stub returning null in beforeEach (and inside the
unsafe-root loop) so install-root resolution is fully driven by the
test's process.env setup on every platform.

* clawsweeper: overwrite WINDIR alongside SystemRoot in unsafe-root test

Real Windows runners did not honor `delete process.env.windir`, so the
unsafe-root iteration's WINDIR fallback still resolved to the canonical
`C:\WINDOWS` and produced a casing mismatch against the expected default
`C:\Windows\System32\cmd.exe`. Set both `SystemRoot` and `WINDIR` to the
unsafe payload so every install-root env source is rejected by
`normalizeWindowsInstallRoot` and the resolver falls through to
`DEFAULT_WINDOWS_SYSTEM_ROOT`.
2026-05-04 14:33:18 -06:00
Peter Steinberger
3921e1b0b7 fix(process): kill Windows command trees on timeout
(cherry picked from commit 9cc3ae100b)
2026-05-04 20:44:27 +01:00
Peter Steinberger
2f0c9358b1 refactor: hide shared constants 2026-05-02 08:29:21 +01:00
Peter Steinberger
ad1e14af53 refactor: delete unused test helper code 2026-05-01 13:11:42 +01:00
Alex Knight
e1a7c5b860 fix: handle EPIPE errors on child process stdin writes (#75602)
Fix three child-process stdin write paths that let async EPIPE errors
escape to uncaughtException and crash the gateway.

extensions/imessage/src/client.ts (the actual #75438 crash path):
- Add child.stdin.on('error') listener in start() to catch async EPIPE
  and reject all pending requests via failAll().
- Add write callback to request() stdin.write() that rejects the
  specific pending request on error, instead of leaving it hanging
  until timeout.

src/agents/mcp-stdio-transport.ts:
- Fix write callback race in send(): previously resolved the promise
  immediately when write() returned true, then the write callback with
  EPIPE would fire after the promise was already fulfilled. Now always
  settles the promise from the write callback so the outcome is known
  before resolving.

src/process/exec.ts:
- Add stdin.on('error') before writing input so EPIPE from a
  prematurely-exited child is swallowed — the process exit handler
  reports the real status.

One reporter observed a gateway crash after 10.5 hours of stable
uptime — a single EPIPE on an iMessage RPC child process stdin write
killed the gateway with code 1.

Fixes: #75438
2026-05-01 21:45:12 +10:00
Peter Steinberger
42d73fd955 refactor: remove dead private helpers 2026-05-01 06:55:26 +01:00
Peter Steinberger
470098bd26 fix: keep embedded run lanes from wedging 2026-04-29 21:37:17 +01:00
Peter Steinberger
f5e7557c70 fix(heartbeat): defer during cron and nested lane pressure 2026-04-29 10:08:48 +01:00
Peter Steinberger
14e8a2d00b chore: remove unused internal dead code 2026-04-29 09:34:40 +01:00
Peter Steinberger
c500e8704f fix(gateway): recover stale session lanes 2026-04-28 20:37:29 +01:00
Peter Steinberger
e1acb61317 refactor: expose SDK test helper subpaths 2026-04-28 03:28:17 +01:00
Peter Steinberger
7d74c29dcc fix: isolate cron nested lane concurrency 2026-04-27 09:39:10 +01:00
Vincent Koc
e6d2c9b080 fix(process): decode Windows command output with console codepage awareness (#72393)
* fix(process): decode Windows command output with console codepage awareness

* fix(clownfish): address review for ghcrawl-199248-agentic-merge (1)
2026-04-26 23:10:59 -07:00
hcl
4a72e1b990 fix(process): skip kill-tree group kill when child wasn't detached (#71662) (#71681)
* fix(process): skip kill-tree group kill when child wasn't detached (#71662)

When the supervisor spawns a child with detached:false (service-managed
runtime under launchd/systemd), the child shares the gateway's process
group. On session abort or SIGKILL, killProcessTree was unconditionally
issuing process.kill(-pid, 'SIGTERM') — which targets the entire process
GROUP (negative pid is POSIX group-kill semantics) and therefore
SIGTERMs the gateway parent along with the child.

Reporter saw this on macOS (LaunchAgent + KeepAlive=true): aborting a
claude-cli/claude-opus-4-7 session caused the gateway to receive
SIGTERM, then auto-restart, dropping all in-flight sessions. Switching
the primary model to a non-cli provider eliminated it because the
non-cli paths don't go through this kill-tree call. Did not occur on
Linux VPS where the gateway runs detached, because there
useDetached === true and the child got its own process group.

Fix:
- killProcessTree now accepts opts.detached?: boolean. When detached:false,
  killProcessTreeUnix skips the `-pid` group-kill and goes straight to
  direct-pid SIGTERM/SIGKILL. Group-kill default (detached:true) is
  preserved so all existing callers behave exactly as before.
- supervisor/adapters/child.ts:286 now threads the spawn-time `useDetached`
  flag into killProcessTree, so the kill-tree path matches the spawn-time
  detachment decision (line 45 of the same file already computes
  useDetached = process.platform !== 'win32' && !isServiceManagedRuntime()).

Tests:
- new: detached:false skips group kill and uses direct pid SIGTERM only.
- new: default behaviour (detached:true) still uses group kill (regression
  guard so the existing test case isn't accidentally weakened).

Existing tests still pass (6/6 in kill-tree.test.ts). Lint clean.

Out of scope: other killProcessTree callers (mcp-stdio-transport,
bash-tools.process, etc.) keep the default group-kill behaviour because
those processes are typically detached from the gateway. Only the
supervisor/adapters/child.ts path threads `detached` through, since it's
the path that knows whether the child was actually spawned detached.

* fixup(process): also gate kill-tree group-kill on the no-detach spawn fallback (#71662)

Greptile review on the original PR caught a P1 gap: when
spawnWithFallback's initial detached spawn fails and it retries with the
no-detach fallback (label: "no-detach", options.detached: false), the
child runs detached:false but my variable useDetached was still true.
The kill closure then passed `detached: useDetached` = true to
killProcessTree, which still group-killed the gateway — same bug, just
on the fallback path.

Compute the actual detachment as
`useDetached && !spawned.usedFallback` after spawn returns, and pass
that through. This closes the gap: the kill path now correctly skips
group-kill in BOTH:
1. Service-managed runtime (useDetached=false from the start, original case)
2. Detached-spawn fallback to no-detach (useDetached=true at intent
   time but spawned.usedFallback=true)

Tests:
- existing 'uses process-tree kill for default SIGKILL' updated to
  assert the new {detached} option is forwarded.
- new: passes detached:false to killProcessTree when spawn fell back.
- new: passes detached:false in service-managed mode (regression guard
  for the original fix).

11/11 tests pass in child.test.ts. 6/6 in kill-tree.test.ts.
2026-04-25 17:08:53 -04:00
Vincent Koc
ec1f72b6c5 fix(gateway): preserve restart drain for active runs
Fixes https://github.com/openclaw/openclaw/issues/65485
2026-04-25 01:35:47 -07:00
Peter Steinberger
01bc49c88c test: move pty cleanup coverage to adapter 2026-04-24 11:09:55 +01:00
Peter Steinberger
cc9dcd3d69 fix(gateway): prefer linux child OOM victims
Raise eligible Linux child processes own oom_score_adj from a child-side /bin/sh exec shim so cgroup memory pressure prefers transient workers over the long-lived gateway. Cover supervisor children, PTY shells, MCP stdio servers, and OpenClaw-launched browser processes through the shared process runtime seam.

Harden the wrapper for distroless images, shell startup env, per-child and process-level opt-outs, dash-compatible exec, and leading-dash command names. Document Linux verification and OOM behavior.

Fixes #70404.

Co-authored-by: Neerav Makwana <261249544+neeravmakwana@users.noreply.github.com>
2026-04-23 05:23:40 +01:00
Peter Steinberger
0195da6b0e refactor: cache optional runtime imports 2026-04-18 20:45:26 +01:00
Peter Steinberger
4fa961d4f1 refactor(lint): enable map spread rule 2026-04-18 20:37:12 +01:00