mirror of
https://github.com/openclaw/openclaw.git
synced 2026-03-12 07:20:45 +00:00
fix: decouple Discord inbound worker timeout from listener timeout (#36602) (thanks @dutifulbob) (#36602)
Co-authored-by: Onur Solmaz <2453968+osolmaz@users.noreply.github.com>
This commit is contained in:
337
docs/experiments/plans/discord-async-inbound-worker.md
Normal file
337
docs/experiments/plans/discord-async-inbound-worker.md
Normal file
@@ -0,0 +1,337 @@
|
||||
---
|
||||
summary: "Status and next steps for decoupling Discord gateway listeners from long-running agent turns with a Discord-specific inbound worker"
|
||||
owner: "openclaw"
|
||||
status: "in_progress"
|
||||
last_updated: "2026-03-05"
|
||||
title: "Discord Async Inbound Worker Plan"
|
||||
---
|
||||
|
||||
# Discord Async Inbound Worker Plan
|
||||
|
||||
## Objective
|
||||
|
||||
Remove Discord listener timeout as a user-facing failure mode by making inbound Discord turns asynchronous:
|
||||
|
||||
1. Gateway listener accepts and normalizes inbound events quickly.
|
||||
2. A Discord run queue stores serialized jobs keyed by the same ordering boundary we use today.
|
||||
3. A worker executes the actual agent turn outside the Carbon listener lifetime.
|
||||
4. Replies are delivered back to the originating channel or thread after the run completes.
|
||||
|
||||
This is the long-term fix for queued Discord runs timing out at `channels.discord.eventQueue.listenerTimeout` while the agent run itself is still making progress.
|
||||
|
||||
## Current status
|
||||
|
||||
This plan is partially implemented.
|
||||
|
||||
Already done:
|
||||
|
||||
- Discord listener timeout and Discord run timeout are now separate settings.
|
||||
- Accepted inbound Discord turns are enqueued into `src/discord/monitor/inbound-worker.ts`.
|
||||
- The worker now owns the long-running turn instead of the Carbon listener.
|
||||
- Existing per-route ordering is preserved by queue key.
|
||||
- Timeout regression coverage exists for the Discord worker path.
|
||||
|
||||
What this means in plain language:
|
||||
|
||||
- the production timeout bug is fixed
|
||||
- the long-running turn no longer dies just because the Discord listener budget expires
|
||||
- the worker architecture is not finished yet
|
||||
|
||||
What is still missing:
|
||||
|
||||
- `DiscordInboundJob` is still only partially normalized and still carries live runtime references
|
||||
- command semantics (`stop`, `new`, `reset`, future session controls) are not yet fully worker-native
|
||||
- worker observability and operator status are still minimal
|
||||
- there is still no restart durability
|
||||
|
||||
## Why this exists
|
||||
|
||||
Current behavior ties the full agent turn to the listener lifetime:
|
||||
|
||||
- `src/discord/monitor/listeners.ts` applies the timeout and abort boundary.
|
||||
- `src/discord/monitor/message-handler.ts` keeps the queued run inside that boundary.
|
||||
- `src/discord/monitor/message-handler.process.ts` performs media loading, routing, dispatch, typing, draft streaming, and final reply delivery inline.
|
||||
|
||||
That architecture has two bad properties:
|
||||
|
||||
- long but healthy turns can be aborted by the listener watchdog
|
||||
- users can see no reply even when the downstream runtime would have produced one
|
||||
|
||||
Raising the timeout helps but does not change the failure mode.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Do not redesign non-Discord channels in this pass.
|
||||
- Do not broaden this into a generic all-channel worker framework in the first implementation.
|
||||
- Do not extract a shared cross-channel inbound worker abstraction yet; only share low-level primitives when duplication is obvious.
|
||||
- Do not add durable crash recovery in the first pass unless needed to land safely.
|
||||
- Do not change route selection, binding semantics, or ACP policy in this plan.
|
||||
|
||||
## Current constraints
|
||||
|
||||
The current Discord processing path still depends on some live runtime objects that should not stay inside the long-term job payload:
|
||||
|
||||
- Carbon `Client`
|
||||
- raw Discord event shapes
|
||||
- in-memory guild history map
|
||||
- thread binding manager callbacks
|
||||
- live typing and draft stream state
|
||||
|
||||
We already moved execution onto a worker queue, but the normalization boundary is still incomplete. Right now the worker is "run later in the same process with some of the same live objects," not a fully data-only job boundary.
|
||||
|
||||
## Target architecture
|
||||
|
||||
### 1. Listener stage
|
||||
|
||||
`DiscordMessageListener` remains the ingress point, but its job becomes:
|
||||
|
||||
- run preflight and policy checks
|
||||
- normalize accepted input into a serializable `DiscordInboundJob`
|
||||
- enqueue the job into a per-session or per-channel async queue
|
||||
- return immediately to Carbon once the enqueue succeeds
|
||||
|
||||
The listener should no longer own the end-to-end LLM turn lifetime.
|
||||
|
||||
### 2. Normalized job payload
|
||||
|
||||
Introduce a serializable job descriptor that contains only the data needed to run the turn later.
|
||||
|
||||
Minimum shape:
|
||||
|
||||
- route identity
|
||||
- `agentId`
|
||||
- `sessionKey`
|
||||
- `accountId`
|
||||
- `channel`
|
||||
- delivery identity
|
||||
- destination channel id
|
||||
- reply target message id
|
||||
- thread id if present
|
||||
- sender identity
|
||||
- sender id, label, username, tag
|
||||
- channel context
|
||||
- guild id
|
||||
- channel name or slug
|
||||
- thread metadata
|
||||
- resolved system prompt override
|
||||
- normalized message body
|
||||
- base text
|
||||
- effective message text
|
||||
- attachment descriptors or resolved media references
|
||||
- gating decisions
|
||||
- mention requirement outcome
|
||||
- command authorization outcome
|
||||
- bound session or agent metadata if applicable
|
||||
|
||||
The job payload must not contain live Carbon objects or mutable closures.
|
||||
|
||||
Current implementation status:
|
||||
|
||||
- partially done
|
||||
- `src/discord/monitor/inbound-job.ts` exists and defines the worker handoff
|
||||
- the payload still contains live Discord runtime context and should be reduced further
|
||||
|
||||
### 3. Worker stage
|
||||
|
||||
Add a Discord-specific worker runner responsible for:
|
||||
|
||||
- reconstructing the turn context from `DiscordInboundJob`
|
||||
- loading media and any additional channel metadata needed for the run
|
||||
- dispatching the agent turn
|
||||
- delivering final reply payloads
|
||||
- updating status and diagnostics
|
||||
|
||||
Recommended location:
|
||||
|
||||
- `src/discord/monitor/inbound-worker.ts`
|
||||
- `src/discord/monitor/inbound-job.ts`
|
||||
|
||||
### 4. Ordering model
|
||||
|
||||
Ordering must remain equivalent to today for a given route boundary.
|
||||
|
||||
Recommended key:
|
||||
|
||||
- use the same queue key logic as `resolveDiscordRunQueueKey(...)`
|
||||
|
||||
This preserves existing behavior:
|
||||
|
||||
- one bound agent conversation does not interleave with itself
|
||||
- different Discord channels can still progress independently
|
||||
|
||||
### 5. Timeout model
|
||||
|
||||
After cutover, there are two separate timeout classes:
|
||||
|
||||
- listener timeout
|
||||
- only covers normalization and enqueue
|
||||
- should be short
|
||||
- run timeout
|
||||
- optional, worker-owned, explicit, and user-visible
|
||||
- should not be inherited accidentally from Carbon listener settings
|
||||
|
||||
This removes the current accidental coupling between "Discord gateway listener stayed alive" and "agent run is healthy."
|
||||
|
||||
## Recommended implementation phases
|
||||
|
||||
### Phase 1: normalization boundary
|
||||
|
||||
- Status: partially implemented
|
||||
- Done:
|
||||
- extracted `buildDiscordInboundJob(...)`
|
||||
- added worker handoff tests
|
||||
- Remaining:
|
||||
- make `DiscordInboundJob` plain data only
|
||||
- move live runtime dependencies to worker-owned services instead of per-job payload
|
||||
- stop rebuilding process context by stitching live listener refs back into the job
|
||||
|
||||
### Phase 2: in-memory worker queue
|
||||
|
||||
- Status: implemented
|
||||
- Done:
|
||||
- added `DiscordInboundWorkerQueue` keyed by resolved run queue key
|
||||
- listener enqueues jobs instead of directly awaiting `processDiscordMessage(...)`
|
||||
- worker executes jobs in-process, in memory only
|
||||
|
||||
This is the first functional cutover.
|
||||
|
||||
### Phase 3: process split
|
||||
|
||||
- Status: not started
|
||||
- Move delivery, typing, and draft streaming ownership behind worker-facing adapters.
|
||||
- Replace direct use of live preflight context with worker context reconstruction.
|
||||
- Keep `processDiscordMessage(...)` temporarily as a facade if needed, then split it.
|
||||
|
||||
### Phase 4: command semantics
|
||||
|
||||
- Status: not started
|
||||
Make sure native Discord commands still behave correctly when work is queued:
|
||||
|
||||
- `stop`
|
||||
- `new`
|
||||
- `reset`
|
||||
- any future session-control commands
|
||||
|
||||
The worker queue must expose enough run state for commands to target the active or queued turn.
|
||||
|
||||
### Phase 5: observability and operator UX
|
||||
|
||||
- Status: not started
|
||||
- emit queue depth and active worker counts into monitor status
|
||||
- record enqueue time, start time, finish time, and timeout or cancellation reason
|
||||
- surface worker-owned timeout or delivery failures clearly in logs
|
||||
|
||||
### Phase 6: optional durability follow-up
|
||||
|
||||
- Status: not started
|
||||
Only after the in-memory version is stable:
|
||||
|
||||
- decide whether queued Discord jobs should survive gateway restart
|
||||
- if yes, persist job descriptors and delivery checkpoints
|
||||
- if no, document the explicit in-memory boundary
|
||||
|
||||
This should be a separate follow-up unless restart recovery is required to land.
|
||||
|
||||
## File impact
|
||||
|
||||
Current primary files:
|
||||
|
||||
- `src/discord/monitor/listeners.ts`
|
||||
- `src/discord/monitor/message-handler.ts`
|
||||
- `src/discord/monitor/message-handler.preflight.ts`
|
||||
- `src/discord/monitor/message-handler.process.ts`
|
||||
- `src/discord/monitor/status.ts`
|
||||
|
||||
Current worker files:
|
||||
|
||||
- `src/discord/monitor/inbound-job.ts`
|
||||
- `src/discord/monitor/inbound-worker.ts`
|
||||
- `src/discord/monitor/inbound-job.test.ts`
|
||||
- `src/discord/monitor/message-handler.queue.test.ts`
|
||||
|
||||
Likely next touch points:
|
||||
|
||||
- `src/auto-reply/dispatch.ts`
|
||||
- `src/discord/monitor/reply-delivery.ts`
|
||||
- `src/discord/monitor/thread-bindings.ts`
|
||||
- `src/discord/monitor/native-command.ts`
|
||||
|
||||
## Next step now
|
||||
|
||||
The next step is to make the worker boundary real instead of partial.
|
||||
|
||||
Do this next:
|
||||
|
||||
1. Move live runtime dependencies out of `DiscordInboundJob`
|
||||
2. Keep those dependencies on the Discord worker instance instead
|
||||
3. Reduce queued jobs to plain Discord-specific data:
|
||||
- route identity
|
||||
- delivery target
|
||||
- sender info
|
||||
- normalized message snapshot
|
||||
- gating and binding decisions
|
||||
4. Reconstruct worker execution context from that plain data inside the worker
|
||||
|
||||
In practice, that means:
|
||||
|
||||
- `client`
|
||||
- `threadBindings`
|
||||
- `guildHistories`
|
||||
- `discordRestFetch`
|
||||
- other mutable runtime-only handles
|
||||
|
||||
should stop living on each queued job and instead live on the worker itself or behind worker-owned adapters.
|
||||
|
||||
After that lands, the next follow-up should be command-state cleanup for `stop`, `new`, and `reset`.
|
||||
|
||||
## Testing plan
|
||||
|
||||
Keep the existing timeout repro coverage in:
|
||||
|
||||
- `src/discord/monitor/message-handler.queue.test.ts`
|
||||
|
||||
Add new tests for:
|
||||
|
||||
1. listener returns after enqueue without awaiting full turn
|
||||
2. per-route ordering is preserved
|
||||
3. different channels still run concurrently
|
||||
4. replies are delivered to the original message destination
|
||||
5. `stop` cancels the active worker-owned run
|
||||
6. worker failure produces visible diagnostics without blocking later jobs
|
||||
7. ACP-bound Discord channels still route correctly under worker execution
|
||||
|
||||
## Risks and mitigations
|
||||
|
||||
- Risk: command semantics drift from current synchronous behavior
|
||||
Mitigation: land command-state plumbing in the same cutover, not later
|
||||
|
||||
- Risk: reply delivery loses thread or reply-to context
|
||||
Mitigation: make delivery identity first-class in `DiscordInboundJob`
|
||||
|
||||
- Risk: duplicate sends during retries or queue restarts
|
||||
Mitigation: keep first pass in-memory only, or add explicit delivery idempotency before persistence
|
||||
|
||||
- Risk: `message-handler.process.ts` becomes harder to reason about during migration
|
||||
Mitigation: split into normalization, execution, and delivery helpers before or during worker cutover
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
The plan is complete when:
|
||||
|
||||
1. Discord listener timeout no longer aborts healthy long-running turns.
|
||||
2. Listener lifetime and agent-turn lifetime are separate concepts in code.
|
||||
3. Existing per-session ordering is preserved.
|
||||
4. ACP-bound Discord channels work through the same worker path.
|
||||
5. `stop` targets the worker-owned run instead of the old listener-owned call stack.
|
||||
6. Timeout and delivery failures become explicit worker outcomes, not silent listener drops.
|
||||
|
||||
## Remaining landing strategy
|
||||
|
||||
Finish this in follow-up PRs:
|
||||
|
||||
1. make `DiscordInboundJob` plain-data only and move live runtime refs onto the worker
|
||||
2. clean up command-state ownership for `stop`, `new`, and `reset`
|
||||
3. add worker observability and operator status
|
||||
4. decide whether durability is needed or explicitly document the in-memory boundary
|
||||
|
||||
This is still a bounded follow-up if kept Discord-only and if we continue to avoid a premature cross-channel worker abstraction.
|
||||
Reference in New Issue
Block a user