Tighten the shutdown finalizer so it actually waits for plugin handlers
under its bounded budget and so it covers every session lifecycle path,
not just the centralized emitters in `session-reset-service.ts`.
- `drainActiveSessionsForShutdown` previously called
`emitGatewaySessionEndPluginHook`, which fires `runSessionEnd` as
fire-and-forget (`void hookRunner.runSessionEnd(...)`). The bounded
2 s timeout then raced only the synchronous for-loop, so the close
handler could proceed to subsystem teardown while a database-writing
`session_end` plugin was still in flight -- the exact ghost-session
failure this PR is supposed to fix. Inline the emit path: build the
`buildSessionEndHookPayload` + `resolveStableSessionEndTranscript`
payload directly in the drain and `await hookRunner.runSessionEnd(...)`
under the bounded race. A never-resolving handler now surfaces as
`timedOut=true` and the close handler records `session-end-drain` as
a warning, but is never blocked.
- The channel reply path in `src/auto-reply/reply/session.ts` and the
compaction lifecycle helper in `src/auto-reply/reply/session-updates.ts`
emit `session_start` / `session_end` directly through the global hook
runner without going through `emitGatewaySessionStartPluginHook`, so
the shutdown tracker never saw normal channel sessions or rolled-over
compacted sessions. Wire the tracker `note` / `forget` calls into both
paths so every public lifecycle emitter participates in the same
tracker, and so a compacted session is both forgotten (previous id)
and re-noted (new id) on rollover.
Tests:
- `src/gateway/drain-active-sessions-for-shutdown.test.ts` gains two
cases: one proves the drain genuinely waits for an in-flight handler
to settle before returning, the other proves a never-resolving handler
is cut off at the configured budget with `timedOut=true`.
Refs #57790.
`session_end` was only fired when a session was replaced, reset, deleted, or
compacted -- the gateway shutdown/restart paths closed the process without
enumerating active sessions, so downstream `session_end` plugins
(e.g. claude-mem) accumulated ghost rows in `active` state across restarts.
Issue reporter saw 11 orphaned sessions cause 63 timeouts/day from agent
pool exhaustion.
Add an in-memory active-session tracker
(`src/gateway/active-sessions-shutdown-tracker.ts`) populated by
`emitGatewaySessionStartPluginHook` and forgotten unconditionally by
`emitGatewaySessionEndPluginHook` (even when no plugin listens), so any
session that has already been finalized through the normal lifecycle is
never re-fired by the shutdown drain. The close handler then calls a new
`drainActiveSessionsForShutdown({ reason })` in `session-reset-service.ts`
between the `gateway:shutdown`/`gateway:pre-restart` lifecycle hooks and
the subsystem teardown steps; the drain races a bounded 2 s total timeout
so a slow plugin cannot block SIGTERM/SIGINT, surfacing the timeout as a
`session-end-drain` warning on the shutdown result.
Extend `PluginHookSessionEndReason` with `"shutdown"` and `"restart"` so
plugins can distinguish a graceful close from a planned restart; the close
handler picks `restart` when `restartExpectedMs` is set and `shutdown`
otherwise. Update `emitGatewaySessionStartPluginHook` to also accept
`storePath`, `sessionFile`, and `agentId` so the shutdown drain can build
the same `session_end` payload shape the normal lifecycle path emits, and
update the existing call sites in `session-reset-service.ts` and
`server-methods/sessions.ts` to pass those fields through.
Tests:
- `src/gateway/active-sessions-shutdown-tracker.test.ts` (new) -- tracker
insert/forget/clear semantics, idempotent re-noting, empty-id guard,
snapshot isolation.
- `src/gateway/drain-active-sessions-for-shutdown.test.ts` (new) -- drain
fires `session_end` with the right reason for every tracked session,
skips sessions already finalized via reset/delete/compaction, and still
forgets sessions even when no `session_end` plugin is registered.
- `src/gateway/server-close.test.ts` -- four new cases covering the
shutdown/restart drain wiring, the bounded timeout warning, and the
drain-skipped-when-no-helper case.
Docs:
- `docs/plugins/hooks.md` documents the new `shutdown`/`restart` values
on `PluginHookSessionEndReason`.
- `docs/automation/hooks.md` documents the post-`gateway:shutdown`
`session_end` drain step and its bounded execution guarantee.
Fixes#57790.
When `cron.wake` is called with only an agent-prefixed `sessionKey` (no
explicit `agentId`), the gateway cron adapter must derive the same agentId
on both `enqueueSystemEvent` and `requestHeartbeat` so events land in (and
heartbeats fire on) the same agent target. Pre-PR, only `requestHeartbeat`
derived agentId from the key; `enqueueSystemEvent` ran through
`resolveCronSessionKey` with the configured-default agent and was rerouted
to that agent's main session under multi-agent deployments where `main`
exists but is not the default.
The new test exercises the cron-adapter directly via `state.cron.state.deps`
with a multi-agent config (`primary` default + `ops` non-default) and a
`agent:ops:cron:nightly:run:abc-123` foreign-agent session key, asserting
that both call sites resolve the agent target to "ops" rather than falling
back to "primary".
Refs #78687.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review on PR #78687 [P3] flagged that the docs say next-heartbeat
"waits for the next scheduled tick" while the patched timer collapses
next-heartbeat+sessionKey to an immediate targeted wake. Add a callout
describing the exception and pointing callers who want delayed delivery
back at the no-session-key path.
Refs #78687.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caught by oxlint typescript-eslint(no-unnecessary-type-assertion) in CI.
mock.calls is typed as any[][], so the trailing `!` adds nothing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review findings from successive codex rounds:
1. next-heartbeat + sessionKey now fires a targeted immediate wake.
The regularly-scheduled heartbeat fires for the agent's main session,
not the supplied sessionKey, so an event queued for a non-main session
would sit stranded indefinitely; an "event"-intent wake is also
deferred as not-due by the heartbeat runner and not retried, so
neither path delivers without an explicit immediate wake.
2. resolveCronWakeTarget now always runs through resolveCronAgent, both
for agent-prefixed session keys (so non-default agents are honored)
and relative keys (so the configured default agent is used instead
of the hardcoded "main" returned by resolveAgentIdFromSessionKey).
Mirrors the matching fix in the enqueueSystemEvent adapter so wake
and enqueue resolve to the same target.
3. Generated Swift `WakeParams` models now expose the new optional
`sessionkey` field (codingKey "sessionKey") in both the macOS and
shared OpenClawKit copies. Locally regenerated from agent.ts via
protocol:gen + protocol:gen:swift would have produced this; the
environment couldn't run the generators (fs-safe transitive
typecheck errors), so the diff was applied by hand to match what
pnpm protocol:check would output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an optional sessionKey to the WakeParamsSchema and threads it through
the gateway wake handler, CronService.wake(), and the underlying timer.wake()
ops so callers can target a specific session for async-task completion
relays instead of always hitting the agent's main session.
Also adds --session-key to `openclaw system event`.
The schema rejects empty/non-string sessionKey at the gateway boundary;
mismatched session keys (a key that does not belong to the resolving agent)
fall back to the agent's main session inside resolveCronSessionKey, which
is the existing safety path.
Refs #52305 (companion to PR #50818, which closes the related cron-run
remap slice at internal enqueue sites). Doesn't depend on #50818.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>