fix: decouple Discord inbound worker timeout from listener timeout (#36602) (thanks @dutifulbob) (#36602)

Co-authored-by: Onur Solmaz <2453968+osolmaz@users.noreply.github.com>
2026-03-12 07:20:45 +00:00 · 2026-03-06 00:09:14 +01:00
parent 97ea9df57f
commit 063e493d3d
17 changed files with 1047 additions and 253 deletions
--- a/docs/channels/discord.md
+++ b/docs/channels/discord.md
@@ -1102,12 +1102,19 @@ openclaw logs --follow

    - `Listener DiscordMessageListener timed out after 30000ms for event MESSAGE_CREATE`
    - `Slow listener detected ...`
+    - `discord inbound worker timed out after ...`

-    Canonical knob:
+    Listener budget knob:

    - single-account: `channels.discord.eventQueue.listenerTimeout`
    - multi-account: `channels.discord.accounts.<accountId>.eventQueue.listenerTimeout`

+    Worker run timeout knob:
+
+    - single-account: `channels.discord.inboundWorker.runTimeoutMs`
+    - multi-account: `channels.discord.accounts.<accountId>.inboundWorker.runTimeoutMs`
+    - default: `1800000` (30 minutes); set `0` to disable
+
    Recommended baseline:

 ```json5
@@ -1119,6 +1126,9 @@ openclaw logs --follow
          eventQueue: {
            listenerTimeout: 120000,
          },
+          inboundWorker: {
+            runTimeoutMs: 1800000,
+          },
        },
      },
    },
@@ -1126,7 +1136,8 @@ openclaw logs --follow
 }
 ```

-    Tune this first before adding alternate timeout controls elsewhere.
+    Use `eventQueue.listenerTimeout` for slow listener setup and `inboundWorker.runTimeoutMs`
+    only if you want a separate safety valve for queued agent turns.

  </Accordion>

@@ -1177,7 +1188,8 @@ High-signal Discord fields:
 - startup/auth: `enabled`, `token`, `accounts.*`, `allowBots`
 - policy: `groupPolicy`, `dm.*`, `guilds.*`, `guilds.*.channels.*`
 - command: `commands.native`, `commands.useAccessGroups`, `configWrites`, `slashCommand.*`
- event queue: `eventQueue.listenerTimeout` (canonical), `eventQueue.maxQueueSize`, `eventQueue.maxConcurrency`
+- event queue: `eventQueue.listenerTimeout` (listener budget), `eventQueue.maxQueueSize`, `eventQueue.maxConcurrency`
+- inbound worker: `inboundWorker.runTimeoutMs`
 - reply/history: `replyToMode`, `historyLimit`, `dmHistoryLimit`, `dms.*.historyLimit`
 - delivery: `textChunkLimit`, `chunkMode`, `maxLinesPerMessage`
 - streaming: `streaming` (legacy alias: `streamMode`), `draftChunk`, `blockStreaming`, `blockStreamingCoalesce`
--- a/docs/experiments/plans/discord-async-inbound-worker.md
+++ b/docs/experiments/plans/discord-async-inbound-worker.md
@@ -0,0 +1,337 @@
+---
+summary: "Status and next steps for decoupling Discord gateway listeners from long-running agent turns with a Discord-specific inbound worker"
+owner: "openclaw"
+status: "in_progress"
+last_updated: "2026-03-05"
+title: "Discord Async Inbound Worker Plan"
+---
+
+# Discord Async Inbound Worker Plan
+
+## Objective
+
+Remove Discord listener timeout as a user-facing failure mode by making inbound Discord turns asynchronous:
+
+1. Gateway listener accepts and normalizes inbound events quickly.
+2. A Discord run queue stores serialized jobs keyed by the same ordering boundary we use today.
+3. A worker executes the actual agent turn outside the Carbon listener lifetime.
+4. Replies are delivered back to the originating channel or thread after the run completes.
+
+This is the long-term fix for queued Discord runs timing out at `channels.discord.eventQueue.listenerTimeout` while the agent run itself is still making progress.
+
+## Current status
+
+This plan is partially implemented.
+
+Already done:
+
+- Discord listener timeout and Discord run timeout are now separate settings.
+- Accepted inbound Discord turns are enqueued into `src/discord/monitor/inbound-worker.ts`.
+- The worker now owns the long-running turn instead of the Carbon listener.
+- Existing per-route ordering is preserved by queue key.
+- Timeout regression coverage exists for the Discord worker path.
+
+What this means in plain language:
+
+- the production timeout bug is fixed
+- the long-running turn no longer dies just because the Discord listener budget expires
+- the worker architecture is not finished yet
+
+What is still missing:
+
+- `DiscordInboundJob` is still only partially normalized and still carries live runtime references
+- command semantics (`stop`, `new`, `reset`, future session controls) are not yet fully worker-native
+- worker observability and operator status are still minimal
+- there is still no restart durability
+
+## Why this exists
+
+Current behavior ties the full agent turn to the listener lifetime:
+
+- `src/discord/monitor/listeners.ts` applies the timeout and abort boundary.
+- `src/discord/monitor/message-handler.ts` keeps the queued run inside that boundary.
+- `src/discord/monitor/message-handler.process.ts` performs media loading, routing, dispatch, typing, draft streaming, and final reply delivery inline.
+
+That architecture has two bad properties:
+
+- long but healthy turns can be aborted by the listener watchdog
+- users can see no reply even when the downstream runtime would have produced one
+
+Raising the timeout helps but does not change the failure mode.
+
+## Non-goals
+
+- Do not redesign non-Discord channels in this pass.
+- Do not broaden this into a generic all-channel worker framework in the first implementation.
+- Do not extract a shared cross-channel inbound worker abstraction yet; only share low-level primitives when duplication is obvious.
+- Do not add durable crash recovery in the first pass unless needed to land safely.
+- Do not change route selection, binding semantics, or ACP policy in this plan.
+
+## Current constraints
+
+The current Discord processing path still depends on some live runtime objects that should not stay inside the long-term job payload:
+
+- Carbon `Client`
+- raw Discord event shapes
+- in-memory guild history map
+- thread binding manager callbacks
+- live typing and draft stream state
+
+We already moved execution onto a worker queue, but the normalization boundary is still incomplete. Right now the worker is "run later in the same process with some of the same live objects," not a fully data-only job boundary.
+
+## Target architecture
+
+### 1. Listener stage
+
+`DiscordMessageListener` remains the ingress point, but its job becomes:
+
+- run preflight and policy checks
+- normalize accepted input into a serializable `DiscordInboundJob`
+- enqueue the job into a per-session or per-channel async queue
+- return immediately to Carbon once the enqueue succeeds
+
+The listener should no longer own the end-to-end LLM turn lifetime.
+
+### 2. Normalized job payload
+
+Introduce a serializable job descriptor that contains only the data needed to run the turn later.
+
+Minimum shape:
+
+- route identity
+  - `agentId`
+  - `sessionKey`
+  - `accountId`
+  - `channel`
+- delivery identity
+  - destination channel id
+  - reply target message id
+  - thread id if present
+- sender identity
+  - sender id, label, username, tag
+- channel context
+  - guild id
+  - channel name or slug
+  - thread metadata
+  - resolved system prompt override
+- normalized message body
+  - base text
+  - effective message text
+  - attachment descriptors or resolved media references
+- gating decisions
+  - mention requirement outcome
+  - command authorization outcome
+  - bound session or agent metadata if applicable
+
+The job payload must not contain live Carbon objects or mutable closures.
+
+Current implementation status:
+
+- partially done
+- `src/discord/monitor/inbound-job.ts` exists and defines the worker handoff
+- the payload still contains live Discord runtime context and should be reduced further
+
+### 3. Worker stage
+
+Add a Discord-specific worker runner responsible for:
+
+- reconstructing the turn context from `DiscordInboundJob`
+- loading media and any additional channel metadata needed for the run
+- dispatching the agent turn
+- delivering final reply payloads
+- updating status and diagnostics
+
+Recommended location:
+
+- `src/discord/monitor/inbound-worker.ts`
+- `src/discord/monitor/inbound-job.ts`
+
+### 4. Ordering model
+
+Ordering must remain equivalent to today for a given route boundary.
+
+Recommended key:
+
+- use the same queue key logic as `resolveDiscordRunQueueKey(...)`
+
+This preserves existing behavior:
+
+- one bound agent conversation does not interleave with itself
+- different Discord channels can still progress independently
+
+### 5. Timeout model
+
+After cutover, there are two separate timeout classes:
+
+- listener timeout
+  - only covers normalization and enqueue
+  - should be short
+- run timeout
+  - optional, worker-owned, explicit, and user-visible
+  - should not be inherited accidentally from Carbon listener settings
+
+This removes the current accidental coupling between "Discord gateway listener stayed alive" and "agent run is healthy."
+
+## Recommended implementation phases
+
+### Phase 1: normalization boundary
+
+- Status: partially implemented
+- Done:
+  - extracted `buildDiscordInboundJob(...)`
+  - added worker handoff tests
+- Remaining:
+  - make `DiscordInboundJob` plain data only
+  - move live runtime dependencies to worker-owned services instead of per-job payload
+  - stop rebuilding process context by stitching live listener refs back into the job
+
+### Phase 2: in-memory worker queue
+
+- Status: implemented
+- Done:
+  - added `DiscordInboundWorkerQueue` keyed by resolved run queue key
+  - listener enqueues jobs instead of directly awaiting `processDiscordMessage(...)`
+  - worker executes jobs in-process, in memory only
+
+This is the first functional cutover.
+
+### Phase 3: process split
+
+- Status: not started
+- Move delivery, typing, and draft streaming ownership behind worker-facing adapters.
+- Replace direct use of live preflight context with worker context reconstruction.
+- Keep `processDiscordMessage(...)` temporarily as a facade if needed, then split it.
+
+### Phase 4: command semantics
+
+- Status: not started
+  Make sure native Discord commands still behave correctly when work is queued:
+
+- `stop`
+- `new`
+- `reset`
+- any future session-control commands
+
+The worker queue must expose enough run state for commands to target the active or queued turn.
+
+### Phase 5: observability and operator UX
+
+- Status: not started
+- emit queue depth and active worker counts into monitor status
+- record enqueue time, start time, finish time, and timeout or cancellation reason
+- surface worker-owned timeout or delivery failures clearly in logs
+
+### Phase 6: optional durability follow-up
+
+- Status: not started
+  Only after the in-memory version is stable:
+
+- decide whether queued Discord jobs should survive gateway restart
+- if yes, persist job descriptors and delivery checkpoints
+- if no, document the explicit in-memory boundary
+
+This should be a separate follow-up unless restart recovery is required to land.
+
+## File impact
+
+Current primary files:
+
+- `src/discord/monitor/listeners.ts`
+- `src/discord/monitor/message-handler.ts`
+- `src/discord/monitor/message-handler.preflight.ts`
+- `src/discord/monitor/message-handler.process.ts`
+- `src/discord/monitor/status.ts`
+
+Current worker files:
+
+- `src/discord/monitor/inbound-job.ts`
+- `src/discord/monitor/inbound-worker.ts`
+- `src/discord/monitor/inbound-job.test.ts`
+- `src/discord/monitor/message-handler.queue.test.ts`
+
+Likely next touch points:
+
+- `src/auto-reply/dispatch.ts`
+- `src/discord/monitor/reply-delivery.ts`
+- `src/discord/monitor/thread-bindings.ts`
+- `src/discord/monitor/native-command.ts`
+
+## Next step now
+
+The next step is to make the worker boundary real instead of partial.
+
+Do this next:
+
+1. Move live runtime dependencies out of `DiscordInboundJob`
+2. Keep those dependencies on the Discord worker instance instead
+3. Reduce queued jobs to plain Discord-specific data:
+   - route identity
+   - delivery target
+   - sender info
+   - normalized message snapshot
+   - gating and binding decisions
+4. Reconstruct worker execution context from that plain data inside the worker
+
+In practice, that means:
+
+- `client`
+- `threadBindings`
+- `guildHistories`
+- `discordRestFetch`
+- other mutable runtime-only handles
+
+should stop living on each queued job and instead live on the worker itself or behind worker-owned adapters.
+
+After that lands, the next follow-up should be command-state cleanup for `stop`, `new`, and `reset`.
+
+## Testing plan
+
+Keep the existing timeout repro coverage in:
+
+- `src/discord/monitor/message-handler.queue.test.ts`
+
+Add new tests for:
+
+1. listener returns after enqueue without awaiting full turn
+2. per-route ordering is preserved
+3. different channels still run concurrently
+4. replies are delivered to the original message destination
+5. `stop` cancels the active worker-owned run
+6. worker failure produces visible diagnostics without blocking later jobs
+7. ACP-bound Discord channels still route correctly under worker execution
+
+## Risks and mitigations
+
+- Risk: command semantics drift from current synchronous behavior
+  Mitigation: land command-state plumbing in the same cutover, not later
+
+- Risk: reply delivery loses thread or reply-to context
+  Mitigation: make delivery identity first-class in `DiscordInboundJob`
+
+- Risk: duplicate sends during retries or queue restarts
+  Mitigation: keep first pass in-memory only, or add explicit delivery idempotency before persistence
+
+- Risk: `message-handler.process.ts` becomes harder to reason about during migration
+  Mitigation: split into normalization, execution, and delivery helpers before or during worker cutover
+
+## Acceptance criteria
+
+The plan is complete when:
+
+1. Discord listener timeout no longer aborts healthy long-running turns.
+2. Listener lifetime and agent-turn lifetime are separate concepts in code.
+3. Existing per-session ordering is preserved.
+4. ACP-bound Discord channels work through the same worker path.
+5. `stop` targets the worker-owned run instead of the old listener-owned call stack.
+6. Timeout and delivery failures become explicit worker outcomes, not silent listener drops.
+
+## Remaining landing strategy
+
+Finish this in follow-up PRs:
+
+1. make `DiscordInboundJob` plain-data only and move live runtime refs onto the worker
+2. clean up command-state ownership for `stop`, `new`, and `reset`
+3. add worker observability and operator status
+4. decide whether durability is needed or explicitly document the in-memory boundary
+
+This is still a bounded follow-up if kept Discord-only and if we continue to avoid a premature cross-channel worker abstraction.