* Prevent hidden channel lifecycle runs from staying stuck as running
Hidden channel-routed runs were dropping session keys on lifecycle events at
our shared agent-event bus. Gateway lifecycle persistence then had to rely on
run-context lookup surviving until the terminal event, which is unnecessarily
fragile for the exact sessions that are intentionally hidden from Control UI.
This keeps session keys on hidden lifecycle events only, preserving the existing
privacy boundary for assistant/tool traffic while making terminal session-state
persistence explicit and test-covered.
Constraint: Hidden channel runs must stay out of Control UI chat/tool streams
Rejected: Broaden sessionKey preservation to every hidden event | would expose more hidden traffic than needed
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: If hidden-run event redaction changes again, keep lifecycle persistence independent from ephemeral run-context lookup
Tested: pnpm exec oxfmt --check --threads=1 CHANGELOG.md src/infra/agent-events.ts src/infra/agent-events.test.ts; pnpm tsgo:core; pnpm tsgo:extensions; pnpm tsgo:core:test; pnpm tsgo:extensions:test; pnpm test src/infra/agent-events.test.ts; pnpm test src/gateway/server-chat.agent-events.test.ts; pnpm test src/gateway/session-lifecycle-state.test.ts; pnpm lint:extensions:bundled; codex exec review returned ship it
Not-tested: Live gateway reproduction against Knox's local stuck-session install
* Clarify hidden lifecycle redaction and cover context fallback
The follow-up review asked for two things: document why the separate error
stream stays redacted for hidden runs, and cover the registered-context fallback
branch for hidden lifecycle events when callers omit sessionKey.
Constraint: Hidden assistant/tool/error diagnostics must remain redacted from Control UI
Rejected: Preserve sessionKey on the generic error stream | terminal persistence already flows through lifecycle phase:error, so widening the visible identity surface is unnecessary
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep hidden-run identity exceptions tightly scoped to terminal lifecycle persistence unless a concrete downstream consumer requires more
Tested: pnpm exec oxfmt --write --threads=1 src/infra/agent-events.ts src/infra/agent-events.test.ts; pnpm test src/infra/agent-events.test.ts; pnpm test src/gateway/server-chat.agent-events.test.ts; pnpm test src/gateway/session-lifecycle-state.test.ts
Not-tested: Full repo gate rerun; previous branch-wide gates remain from the parent PR commit
* fix(gateway): keep hidden agent broadcasts redacted
---------
Co-authored-by: Peter Steinberger <steipete@gmail.com>
* fix(heartbeat): clamp scheduler delay to Node setTimeout cap (#71414)
When `agents.defaults.heartbeat.every` resolves to >2_147_483_647 ms
(~24.85d), the previous scheduleNext() called setTimeout with the raw
delay. Node clamps any delay > 2^31-1 to 1 ms, fires the callback, and
the heartbeat re-arms with the same oversized value - a tight loop that
floods the log with TimeoutOverflowWarning and crashes the gateway with
exit code 1.
Clamp the computed delay to HEARTBEAT_MAX_TIMEOUT_MS (2_147_483_647)
before calling setTimeout. The worst case is now one heartbeat every
~24.85d instead of crash-loop. Warn once per process when clamping
fires, so a misconfigured "365d" remains visible without flooding.
This is a defense-in-depth fix at the scheduler layer; loadConfig-level
rejection is a broader change with more blast radius and a separate
question (some users may legitimately want "every: 365d" to mean
"effectively never"). The clamped behaviour is closer to that intent
than the crash is.
Test: new scheduler test sets heartbeat.every="365d" with fake timers,
advances 60s, and asserts runSpy was never called (with the bug, it
would be called ~60_000 times).
* style: format heartbeat scheduler clamp
* fix: share safe timeout delay clamp (#71478) (thanks @hclsys)
---------
Co-authored-by: Peter Steinberger <steipete@gmail.com>
* fix: make cleanup "keep" persist subagent sessions indefinitely
* feat: expose subagent session metadata in sessions list
* fix: include status and timing in sessions_list tool
* fix: hide injected timestamp prefixes in chat ui
* feat: push session list updates over websocket
* feat: expose child subagent sessions in subagents list
* feat: add admin http endpoint to kill sessions
* Emit session.message websocket events for transcript updates
* Estimate session costs in sessions list
* Add direct session history HTTP and SSE endpoints
* Harden dashboard session events and history APIs
* Add session lifecycle gateway methods
* Add dashboard session API improvements
* Add dashboard session model and parent linkage support
* fix: tighten dashboard session API metadata
* Fix dashboard session cost metadata
* Persist accumulated session cost
* fix: stop followup queue drain cfg crash
* Fix dashboard session create and model metadata
* fix: stop guessing session model costs
* Gateway: cache OpenRouter pricing for configured models
* Gateway: add timeout session status
* Fix subagent spawn test config loading
* Gateway: preserve operator scopes without device identity
* Emit user message transcript events and deduplicate plugin warnings
* feat: emit sessions.changed lifecycle event on subagent spawn
Adds a session-lifecycle-events module (similar to transcript-events)
that emits create events when subagents are spawned. The gateway
server.impl.ts listens for these events and broadcasts sessions.changed
with reason=create to SSE subscribers, so dashboards can pick up new
subagent sessions without polling.
* Gateway: allow persistent dashboard orchestrator sessions
* fix: preserve operator scopes for token-authenticated backend clients
Backend clients (like agent-dashboard) that authenticate with a valid gateway
token but don't present a device identity were getting their scopes stripped.
The scope-clearing logic ran before checking the device identity decision,
so even when evaluateMissingDeviceIdentity returned 'allow' (because
roleCanSkipDeviceIdentity passed for token-authed operators), scopes were
already cleared.
Fix: also check decision.kind before clearing scopes, so token-authenticated
operators keep their requested scopes.
* Gateway: allow operator-token session kills
* Fix stale active subagent status after follow-up runs
* Fix dashboard image attachments in sessions send
* Fix completed session follow-up status updates
* feat: stream session tool events to operator UIs
* Add sessions.steer gateway coverage
* Persist subagent timing in session store
* Fix subagent session transcript event keys
* Fix active subagent session status in gateway
* bump session label max to 512
* Fix gateway send session reactivation
* fix: publish terminal session lifecycle state
* feat: change default session reset to effectively never
- Change DEFAULT_RESET_MODE from "daily" to "idle"
- Change DEFAULT_IDLE_MINUTES from 60 to 0 (0 = disabled/never)
- Allow idleMinutes=0 through normalization (don't clamp to 1)
- Treat idleMinutes=0 as "no idle expiry" in evaluateSessionFreshness
- Default behavior: mode "idle" + idleMinutes 0 = sessions never auto-reset
- Update test assertion for new default mode
* fix: prep session management followups (#50101) (thanks @clay-datacurve)
---------
Co-authored-by: Tyler Yust <TYTYYUST@YAHOO.COM>
Complete the stop reason propagation chain so ACP clients can
distinguish end_turn from max_tokens:
- server-chat.ts: emitChatFinal accepts optional stopReason param,
includes it in the final payload, reads it from lifecycle event data
- translator.ts: read stopReason from the final payload instead of
hardcoding end_turn
Chain: LLM API → run.ts (meta.stopReason) → agent.ts (lifecycle event)
→ server-chat.ts (final payload) → ACP translator (PromptResponse)
* fix(gateway): flush throttled delta before emitChatFinal
The 150ms throttle in emitChatDelta can suppress the last text chunk
before emitChatFinal fires, causing streaming clients (e.g. ACP) to
receive truncated responses. The final event carries the complete text,
but clients that build responses incrementally from deltas miss the
tail end.
Flush one last unthrottled delta with the complete buffered text
immediately before sending the final event. This ensures all streaming
consumers have the full response without needing to reconcile deltas
against the final payload.
* fix(gateway): avoid duplicate delta flush when buffer unchanged
Track the text length at the time of the last broadcast. The flush in
emitChatFinal now only sends a delta if the buffer has grown since the
last broadcast, preventing duplicate sends when the final delta passed
the 150ms throttle and was already broadcast.
* fix(gateway): honor heartbeat suppression in final delta flush
* test(gateway): add final delta flush and dedupe coverage
* fix(gateway): skip final flush for silent lead fragments
* docs(changelog): note gateway final-delta flush fix credits
---------
Co-authored-by: Jonathan Taylor <visionik@pobox.com>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
The webchat channel sent NO_REPLY as visible text to clients instead
of suppressing it. Other channels (Telegram, Discord) already filter
this token via the reply dispatcher, but the webchat streaming path
bypassed this check.
Fixes#16269