12 KiB
summary, owner, status, last_updated, title
| summary | owner | status | last_updated | title |
|---|---|---|---|---|
| Status and next steps for decoupling Discord gateway listeners from long-running agent turns with a Discord-specific inbound worker | openclaw | in_progress | 2026-03-05 | Discord Async Inbound Worker Plan |
Discord Async Inbound Worker Plan
Objective
Remove Discord listener timeout as a user-facing failure mode by making inbound Discord turns asynchronous:
- Gateway listener accepts and normalizes inbound events quickly.
- A Discord run queue stores serialized jobs keyed by the same ordering boundary we use today.
- A worker executes the actual agent turn outside the Carbon listener lifetime.
- Replies are delivered back to the originating channel or thread after the run completes.
This is the long-term fix for queued Discord runs timing out at channels.discord.eventQueue.listenerTimeout while the agent run itself is still making progress.
Current status
This plan is partially implemented.
Already done:
- Discord listener timeout and Discord run timeout are now separate settings.
- Accepted inbound Discord turns are enqueued into
src/discord/monitor/inbound-worker.ts. - The worker now owns the long-running turn instead of the Carbon listener.
- Existing per-route ordering is preserved by queue key.
- Timeout regression coverage exists for the Discord worker path.
What this means in plain language:
- the production timeout bug is fixed
- the long-running turn no longer dies just because the Discord listener budget expires
- the worker architecture is not finished yet
What is still missing:
DiscordInboundJobis still only partially normalized and still carries live runtime references- command semantics (
stop,new,reset, future session controls) are not yet fully worker-native - worker observability and operator status are still minimal
- there is still no restart durability
Why this exists
Current behavior ties the full agent turn to the listener lifetime:
src/discord/monitor/listeners.tsapplies the timeout and abort boundary.src/discord/monitor/message-handler.tskeeps the queued run inside that boundary.src/discord/monitor/message-handler.process.tsperforms media loading, routing, dispatch, typing, draft streaming, and final reply delivery inline.
That architecture has two bad properties:
- long but healthy turns can be aborted by the listener watchdog
- users can see no reply even when the downstream runtime would have produced one
Raising the timeout helps but does not change the failure mode.
Non-goals
- Do not redesign non-Discord channels in this pass.
- Do not broaden this into a generic all-channel worker framework in the first implementation.
- Do not extract a shared cross-channel inbound worker abstraction yet; only share low-level primitives when duplication is obvious.
- Do not add durable crash recovery in the first pass unless needed to land safely.
- Do not change route selection, binding semantics, or ACP policy in this plan.
Current constraints
The current Discord processing path still depends on some live runtime objects that should not stay inside the long-term job payload:
- Carbon
Client - raw Discord event shapes
- in-memory guild history map
- thread binding manager callbacks
- live typing and draft stream state
We already moved execution onto a worker queue, but the normalization boundary is still incomplete. Right now the worker is "run later in the same process with some of the same live objects," not a fully data-only job boundary.
Target architecture
1. Listener stage
DiscordMessageListener remains the ingress point, but its job becomes:
- run preflight and policy checks
- normalize accepted input into a serializable
DiscordInboundJob - enqueue the job into a per-session or per-channel async queue
- return immediately to Carbon once the enqueue succeeds
The listener should no longer own the end-to-end LLM turn lifetime.
2. Normalized job payload
Introduce a serializable job descriptor that contains only the data needed to run the turn later.
Minimum shape:
- route identity
agentIdsessionKeyaccountIdchannel
- delivery identity
- destination channel id
- reply target message id
- thread id if present
- sender identity
- sender id, label, username, tag
- channel context
- guild id
- channel name or slug
- thread metadata
- resolved system prompt override
- normalized message body
- base text
- effective message text
- attachment descriptors or resolved media references
- gating decisions
- mention requirement outcome
- command authorization outcome
- bound session or agent metadata if applicable
The job payload must not contain live Carbon objects or mutable closures.
Current implementation status:
- partially done
src/discord/monitor/inbound-job.tsexists and defines the worker handoff- the payload still contains live Discord runtime context and should be reduced further
3. Worker stage
Add a Discord-specific worker runner responsible for:
- reconstructing the turn context from
DiscordInboundJob - loading media and any additional channel metadata needed for the run
- dispatching the agent turn
- delivering final reply payloads
- updating status and diagnostics
Recommended location:
src/discord/monitor/inbound-worker.tssrc/discord/monitor/inbound-job.ts
4. Ordering model
Ordering must remain equivalent to today for a given route boundary.
Recommended key:
- use the same queue key logic as
resolveDiscordRunQueueKey(...)
This preserves existing behavior:
- one bound agent conversation does not interleave with itself
- different Discord channels can still progress independently
5. Timeout model
After cutover, there are two separate timeout classes:
- listener timeout
- only covers normalization and enqueue
- should be short
- run timeout
- optional, worker-owned, explicit, and user-visible
- should not be inherited accidentally from Carbon listener settings
This removes the current accidental coupling between "Discord gateway listener stayed alive" and "agent run is healthy."
Recommended implementation phases
Phase 1: normalization boundary
- Status: partially implemented
- Done:
- extracted
buildDiscordInboundJob(...) - added worker handoff tests
- extracted
- Remaining:
- make
DiscordInboundJobplain data only - move live runtime dependencies to worker-owned services instead of per-job payload
- stop rebuilding process context by stitching live listener refs back into the job
- make
Phase 2: in-memory worker queue
- Status: implemented
- Done:
- added
DiscordInboundWorkerQueuekeyed by resolved run queue key - listener enqueues jobs instead of directly awaiting
processDiscordMessage(...) - worker executes jobs in-process, in memory only
- added
This is the first functional cutover.
Phase 3: process split
- Status: not started
- Move delivery, typing, and draft streaming ownership behind worker-facing adapters.
- Replace direct use of live preflight context with worker context reconstruction.
- Keep
processDiscordMessage(...)temporarily as a facade if needed, then split it.
Phase 4: command semantics
-
Status: not started Make sure native Discord commands still behave correctly when work is queued:
-
stop -
new -
reset -
any future session-control commands
The worker queue must expose enough run state for commands to target the active or queued turn.
Phase 5: observability and operator UX
- Status: not started
- emit queue depth and active worker counts into monitor status
- record enqueue time, start time, finish time, and timeout or cancellation reason
- surface worker-owned timeout or delivery failures clearly in logs
Phase 6: optional durability follow-up
-
Status: not started Only after the in-memory version is stable:
-
decide whether queued Discord jobs should survive gateway restart
-
if yes, persist job descriptors and delivery checkpoints
-
if no, document the explicit in-memory boundary
This should be a separate follow-up unless restart recovery is required to land.
File impact
Current primary files:
src/discord/monitor/listeners.tssrc/discord/monitor/message-handler.tssrc/discord/monitor/message-handler.preflight.tssrc/discord/monitor/message-handler.process.tssrc/discord/monitor/status.ts
Current worker files:
src/discord/monitor/inbound-job.tssrc/discord/monitor/inbound-worker.tssrc/discord/monitor/inbound-job.test.tssrc/discord/monitor/message-handler.queue.test.ts
Likely next touch points:
src/auto-reply/dispatch.tssrc/discord/monitor/reply-delivery.tssrc/discord/monitor/thread-bindings.tssrc/discord/monitor/native-command.ts
Next step now
The next step is to make the worker boundary real instead of partial.
Do this next:
- Move live runtime dependencies out of
DiscordInboundJob - Keep those dependencies on the Discord worker instance instead
- Reduce queued jobs to plain Discord-specific data:
- route identity
- delivery target
- sender info
- normalized message snapshot
- gating and binding decisions
- Reconstruct worker execution context from that plain data inside the worker
In practice, that means:
clientthreadBindingsguildHistoriesdiscordRestFetch- other mutable runtime-only handles
should stop living on each queued job and instead live on the worker itself or behind worker-owned adapters.
After that lands, the next follow-up should be command-state cleanup for stop, new, and reset.
Testing plan
Keep the existing timeout repro coverage in:
src/discord/monitor/message-handler.queue.test.ts
Add new tests for:
- listener returns after enqueue without awaiting full turn
- per-route ordering is preserved
- different channels still run concurrently
- replies are delivered to the original message destination
stopcancels the active worker-owned run- worker failure produces visible diagnostics without blocking later jobs
- ACP-bound Discord channels still route correctly under worker execution
Risks and mitigations
-
Risk: command semantics drift from current synchronous behavior Mitigation: land command-state plumbing in the same cutover, not later
-
Risk: reply delivery loses thread or reply-to context Mitigation: make delivery identity first-class in
DiscordInboundJob -
Risk: duplicate sends during retries or queue restarts Mitigation: keep first pass in-memory only, or add explicit delivery idempotency before persistence
-
Risk:
message-handler.process.tsbecomes harder to reason about during migration Mitigation: split into normalization, execution, and delivery helpers before or during worker cutover
Acceptance criteria
The plan is complete when:
- Discord listener timeout no longer aborts healthy long-running turns.
- Listener lifetime and agent-turn lifetime are separate concepts in code.
- Existing per-session ordering is preserved.
- ACP-bound Discord channels work through the same worker path.
stoptargets the worker-owned run instead of the old listener-owned call stack.- Timeout and delivery failures become explicit worker outcomes, not silent listener drops.
Remaining landing strategy
Finish this in follow-up PRs:
- make
DiscordInboundJobplain-data only and move live runtime refs onto the worker - clean up command-state ownership for
stop,new, andreset - add worker observability and operator status
- decide whether durability is needed or explicitly document the in-memory boundary
This is still a bounded follow-up if kept Discord-only and if we continue to avoid a premature cross-channel worker abstraction.