mirror of https://github.com/openclaw/openclaw.git synced 2026-03-13 11:00:50 +00:00

Files

Bob 063e493d3d fix: decouple Discord inbound worker timeout from listener timeout (#36602 ) (thanks @dutifulbob) (#36602 )

Co-authored-by: Onur Solmaz <2453968+osolmaz@users.noreply.github.com>

2026-03-06 00:09:14 +01:00

12 KiB

Raw Permalink Blame History

summary, owner, status, last_updated, title

summary	owner	status	last_updated	title
Status and next steps for decoupling Discord gateway listeners from long-running agent turns with a Discord-specific inbound worker	openclaw	in_progress	2026-03-05	Discord Async Inbound Worker Plan

Discord Async Inbound Worker Plan

Objective

Remove Discord listener timeout as a user-facing failure mode by making inbound Discord turns asynchronous:

Gateway listener accepts and normalizes inbound events quickly.
A Discord run queue stores serialized jobs keyed by the same ordering boundary we use today.
A worker executes the actual agent turn outside the Carbon listener lifetime.
Replies are delivered back to the originating channel or thread after the run completes.

This is the long-term fix for queued Discord runs timing out at channels.discord.eventQueue.listenerTimeout while the agent run itself is still making progress.

Current status

This plan is partially implemented.

Already done:

Discord listener timeout and Discord run timeout are now separate settings.
Accepted inbound Discord turns are enqueued into src/discord/monitor/inbound-worker.ts.
The worker now owns the long-running turn instead of the Carbon listener.
Existing per-route ordering is preserved by queue key.
Timeout regression coverage exists for the Discord worker path.

What this means in plain language:

the production timeout bug is fixed
the long-running turn no longer dies just because the Discord listener budget expires
the worker architecture is not finished yet

What is still missing:

DiscordInboundJob is still only partially normalized and still carries live runtime references
command semantics (stop, new, reset, future session controls) are not yet fully worker-native
worker observability and operator status are still minimal
there is still no restart durability

Why this exists

Current behavior ties the full agent turn to the listener lifetime:

src/discord/monitor/listeners.ts applies the timeout and abort boundary.
src/discord/monitor/message-handler.ts keeps the queued run inside that boundary.
src/discord/monitor/message-handler.process.ts performs media loading, routing, dispatch, typing, draft streaming, and final reply delivery inline.

That architecture has two bad properties:

long but healthy turns can be aborted by the listener watchdog
users can see no reply even when the downstream runtime would have produced one

Raising the timeout helps but does not change the failure mode.

Non-goals

Do not redesign non-Discord channels in this pass.
Do not broaden this into a generic all-channel worker framework in the first implementation.
Do not extract a shared cross-channel inbound worker abstraction yet; only share low-level primitives when duplication is obvious.
Do not add durable crash recovery in the first pass unless needed to land safely.
Do not change route selection, binding semantics, or ACP policy in this plan.

Current constraints

The current Discord processing path still depends on some live runtime objects that should not stay inside the long-term job payload:

Carbon Client
raw Discord event shapes
in-memory guild history map
thread binding manager callbacks
live typing and draft stream state

We already moved execution onto a worker queue, but the normalization boundary is still incomplete. Right now the worker is "run later in the same process with some of the same live objects," not a fully data-only job boundary.

Target architecture

1. Listener stage

DiscordMessageListener remains the ingress point, but its job becomes:

run preflight and policy checks
normalize accepted input into a serializable DiscordInboundJob
enqueue the job into a per-session or per-channel async queue
return immediately to Carbon once the enqueue succeeds

The listener should no longer own the end-to-end LLM turn lifetime.

2. Normalized job payload

Introduce a serializable job descriptor that contains only the data needed to run the turn later.

Minimum shape:

route identity
- agentId
- sessionKey
- accountId
- channel
delivery identity
- destination channel id
- reply target message id
- thread id if present
sender identity
- sender id, label, username, tag
channel context
- guild id
- channel name or slug
- thread metadata
- resolved system prompt override
normalized message body
- base text
- effective message text
- attachment descriptors or resolved media references
gating decisions
- mention requirement outcome
- command authorization outcome
- bound session or agent metadata if applicable

The job payload must not contain live Carbon objects or mutable closures.

Current implementation status:

partially done
src/discord/monitor/inbound-job.ts exists and defines the worker handoff
the payload still contains live Discord runtime context and should be reduced further

3. Worker stage

Add a Discord-specific worker runner responsible for:

reconstructing the turn context from DiscordInboundJob
loading media and any additional channel metadata needed for the run
dispatching the agent turn
delivering final reply payloads
updating status and diagnostics

Recommended location:

src/discord/monitor/inbound-worker.ts
src/discord/monitor/inbound-job.ts

4. Ordering model

Ordering must remain equivalent to today for a given route boundary.

Recommended key:

use the same queue key logic as resolveDiscordRunQueueKey(...)

This preserves existing behavior:

one bound agent conversation does not interleave with itself
different Discord channels can still progress independently

5. Timeout model

After cutover, there are two separate timeout classes:

listener timeout
- only covers normalization and enqueue
- should be short
run timeout
- optional, worker-owned, explicit, and user-visible
- should not be inherited accidentally from Carbon listener settings

This removes the current accidental coupling between "Discord gateway listener stayed alive" and "agent run is healthy."

Recommended implementation phases

Phase 1: normalization boundary

Status: partially implemented
Done:
- extracted buildDiscordInboundJob(...)
- added worker handoff tests
Remaining:
- make DiscordInboundJob plain data only
- move live runtime dependencies to worker-owned services instead of per-job payload
- stop rebuilding process context by stitching live listener refs back into the job

Phase 2: in-memory worker queue

Status: implemented
Done:
- added DiscordInboundWorkerQueue keyed by resolved run queue key
- listener enqueues jobs instead of directly awaiting processDiscordMessage(...)
- worker executes jobs in-process, in memory only

This is the first functional cutover.

Phase 3: process split

Status: not started
Move delivery, typing, and draft streaming ownership behind worker-facing adapters.
Replace direct use of live preflight context with worker context reconstruction.
Keep processDiscordMessage(...) temporarily as a facade if needed, then split it.

Phase 4: command semantics

Status: not started Make sure native Discord commands still behave correctly when work is queued:
stop
new
reset
any future session-control commands

The worker queue must expose enough run state for commands to target the active or queued turn.

Phase 5: observability and operator UX

Status: not started
emit queue depth and active worker counts into monitor status
record enqueue time, start time, finish time, and timeout or cancellation reason
surface worker-owned timeout or delivery failures clearly in logs

Phase 6: optional durability follow-up

Status: not started Only after the in-memory version is stable:
decide whether queued Discord jobs should survive gateway restart
if yes, persist job descriptors and delivery checkpoints
if no, document the explicit in-memory boundary

This should be a separate follow-up unless restart recovery is required to land.

File impact

Current primary files:

src/discord/monitor/listeners.ts
src/discord/monitor/message-handler.ts
src/discord/monitor/message-handler.preflight.ts
src/discord/monitor/message-handler.process.ts
src/discord/monitor/status.ts

Current worker files:

src/discord/monitor/inbound-job.ts
src/discord/monitor/inbound-worker.ts
src/discord/monitor/inbound-job.test.ts
src/discord/monitor/message-handler.queue.test.ts

Likely next touch points:

src/auto-reply/dispatch.ts
src/discord/monitor/reply-delivery.ts
src/discord/monitor/thread-bindings.ts
src/discord/monitor/native-command.ts

Next step now

The next step is to make the worker boundary real instead of partial.

Do this next:

Move live runtime dependencies out of DiscordInboundJob
Keep those dependencies on the Discord worker instance instead
Reduce queued jobs to plain Discord-specific data:
- route identity
- delivery target
- sender info
- normalized message snapshot
- gating and binding decisions
Reconstruct worker execution context from that plain data inside the worker

In practice, that means:

client
threadBindings
guildHistories
discordRestFetch
other mutable runtime-only handles

should stop living on each queued job and instead live on the worker itself or behind worker-owned adapters.

After that lands, the next follow-up should be command-state cleanup for stop, new, and reset.

Testing plan

Keep the existing timeout repro coverage in:

src/discord/monitor/message-handler.queue.test.ts

Add new tests for:

listener returns after enqueue without awaiting full turn
per-route ordering is preserved
different channels still run concurrently
replies are delivered to the original message destination
stop cancels the active worker-owned run
worker failure produces visible diagnostics without blocking later jobs
ACP-bound Discord channels still route correctly under worker execution

Risks and mitigations

Risk: command semantics drift from current synchronous behavior Mitigation: land command-state plumbing in the same cutover, not later
Risk: reply delivery loses thread or reply-to context Mitigation: make delivery identity first-class in DiscordInboundJob
Risk: duplicate sends during retries or queue restarts Mitigation: keep first pass in-memory only, or add explicit delivery idempotency before persistence
Risk: message-handler.process.ts becomes harder to reason about during migration Mitigation: split into normalization, execution, and delivery helpers before or during worker cutover

Acceptance criteria

The plan is complete when:

Discord listener timeout no longer aborts healthy long-running turns.
Listener lifetime and agent-turn lifetime are separate concepts in code.
Existing per-session ordering is preserved.
ACP-bound Discord channels work through the same worker path.
stop targets the worker-owned run instead of the old listener-owned call stack.
Timeout and delivery failures become explicit worker outcomes, not silent listener drops.

Remaining landing strategy

Finish this in follow-up PRs:

make DiscordInboundJob plain-data only and move live runtime refs onto the worker
clean up command-state ownership for stop, new, and reset
add worker observability and operator status
decide whether durability is needed or explicitly document the in-memory boundary

This is still a bounded follow-up if kept Discord-only and if we continue to avoid a premature cross-channel worker abstraction.

12 KiB Raw Permalink Blame History