feat: default discord voice to agent proxy

This commit is contained in:
Peter Steinberger
2026-05-09 12:36:31 +01:00
parent 9859c23bad
commit eb200e369c
10 changed files with 95 additions and 84 deletions

View File

@@ -48,6 +48,7 @@ Docs: https://docs.openclaw.ai
- Discord/voice: add realtime `/vc` modes so Discord voice channels can run as STT/TTS, a realtime talk buffer with the OpenClaw agent brain, or a bidi realtime session with `openclaw_agent_consult`.
- Discord/voice: add bounded realtime gateway logs for voice channel joins, realtime model/voice selection, transcripts, consult routing/answers, and playback start, allow OpenAI realtime Discord sessions to disable input-triggered response interruption for echo-heavy rooms while keeping explicit Discord barge-in available for new and already-active speakers, and allow voice turns to target an existing Discord channel agent session.
- Discord/voice: add `voice.realtime.minBargeInAudioEndMs` and let the realtime provider own playback clearing, so speaker echo no longer cuts OpenAI realtime model audio at `audioEndMs=0` while low-echo rooms can opt back into immediate barge-in with `0`.
- Discord/voice: make `agent-proxy` the default voice mode so realtime voice acts as the microphone/speaker extension of the routed OpenClaw agent session, with `stt-tts` remaining available as an explicit fallback.
- Discord/voice: keep OpenAI realtime bidi consults quiet while the supervisor agent is still working, accept Codex-style `conversation.item.done` function-call events, and preserve continuing tool results through the gateway relay so the OpenAI realtime bridge reliably routes consults before speaking the final answer.
- Discord/voice: include a bounded one-line STT transcript preview in verbose voice logs so live voice debugging shows what speakers said before the agent reply.
- Codex app-server: pin the managed Codex harness and Codex CLI smoke package to `@openai/codex@0.129.0`, defer OpenClaw integration dynamic tools behind Codex tool search by default, and accept current Codex service-tier values so legacy `fast` settings survive the stable harness upgrade as `priority`.

View File

@@ -1,4 +1,4 @@
632c00a35e0ed2413604ff28f5b4df0718131492208863c5d39576d76a9b7c88 config-baseline.json
0c9cdf45265ecb198f11fdc0fee838019f21020272d910596a5202c8074f573d config-baseline.json
7ac9eadabe0119deba4418dbaadc478092fa32617fab3f9618e0a14210720e4b config-baseline.core.json
42264b147fb29e0ba7017b4ec018a0793bb9cd23e58bf5fb796d6b33bf9ca829 config-baseline.channel.json
7d3b4153a6eda2e83c32a0063d2ae179d16c3ce1e3b81a74a9b752fe418831ba config-baseline.channel.json
df93bfde8e3de8d6f80dbf1b0ae43ad250f216f2fc0244c5d9a19afca50806f6 config-baseline.plugin.json

View File

@@ -1172,8 +1172,7 @@ Auto-join example:
discord: {
voice: {
enabled: true,
mode: "stt-tts",
model: "openai/gpt-5.4-mini",
model: "openai-codex/gpt-5.5",
autoJoin: [
{
guildId: "123456789012345678",
@@ -1184,12 +1183,10 @@ Auto-join example:
decryptionFailureTolerance: 24,
connectTimeoutMs: 30000,
reconnectGraceMs: 15000,
tts: {
realtime: {
provider: "openai",
openai: {
model: "gpt-4o-mini-tts",
voice: "cedar",
},
model: "gpt-realtime-2",
voice: "cedar",
},
},
},
@@ -1199,10 +1196,11 @@ Auto-join example:
Notes:
- `voice.tts` overrides `messages.tts` for voice playback only.
- `voice.mode` controls the conversation path: `stt-tts` keeps the existing batch STT plus TTS flow, `talk-buffer` uses a realtime voice shell for turn timing/transcription/playback while the OpenClaw agent produces the answer, and `bidi` lets the realtime model converse directly while exposing `openclaw_agent_consult` for the OpenClaw brain.
- `voice.tts` overrides `messages.tts` for `stt-tts` voice playback only. Realtime modes use `voice.realtime.voice`.
- `voice.mode` controls the conversation path. The default is `agent-proxy`: a realtime voice shell handles turn timing, transcription, interruption, and playback, while the routed OpenClaw agent produces the answer with the same session/tool permissions as a typed Discord prompt from that speaker. `stt-tts` keeps the older batch STT plus TTS flow. `bidi` lets the realtime model converse directly while exposing `openclaw_agent_consult` for the OpenClaw brain.
- `voice.agentSession` controls which OpenClaw conversation receives voice turns. Leave it unset for the voice channel's own session, or set `{ mode: "target", target: "channel:<text-channel-id>" }` to make the voice channel act as the microphone/speaker extension of an existing Discord text channel session such as `#maintainers`.
- `voice.model` overrides the OpenClaw agent brain for Discord voice responses and realtime consults. Leave it unset to inherit the routed agent model. It is separate from `voice.realtime.model`.
- `agent-proxy` routes speech through `discord-voice`, which preserves normal owner/tool authorization for the speaker and target session but hides the agent `tts` tool because Discord voice owns playback.
- In `stt-tts` mode, STT uses `tools.media.audio`; `voice.model` does not affect transcription.
- In realtime modes, `voice.realtime.provider`, `voice.realtime.model`, and `voice.realtime.voice` configure the realtime audio session. For OpenAI Realtime 2 plus the Codex brain, use `voice.realtime.model: "gpt-realtime-2"` and `voice.model: "openai-codex/gpt-5.5"`.
- `voice.realtime.bargeIn` controls whether Discord speaker-start events interrupt active realtime playback. If unset, it follows the realtime provider's input-audio interruption setting.
@@ -1234,7 +1232,29 @@ STT plus TTS pipeline:
- `voice.model`, when set, overrides only the response LLM for this voice-channel turn.
- `voice.tts` is merged over `messages.tts`; streaming-capable providers feed the player directly, otherwise the resulting audio file is played in the joined channel.
Default voice-channel session example:
Default agent-proxy voice-channel session example:
```json5
{
channels: {
discord: {
voice: {
enabled: true,
model: "openai-codex/gpt-5.5",
realtime: {
provider: "openai",
model: "gpt-realtime-2",
voice: "cedar",
},
},
},
},
}
```
With no `voice.agentSession` block, each voice channel gets its own routed OpenClaw session. For example, `/vc join channel:234567890123456789` talks to the session for that Discord voice channel.
Legacy STT plus TTS example:
```json5
{
@@ -1244,28 +1264,12 @@ Default voice-channel session example:
enabled: true,
mode: "stt-tts",
model: "openai/gpt-5.4-mini",
},
},
},
}
```
With no `voice.agentSession` block, each voice channel gets its own routed OpenClaw session. For example, `/vc join channel:234567890123456789` talks to the session for that Discord voice channel.
Realtime talk-buffer example:
```json5
{
channels: {
discord: {
voice: {
enabled: true,
mode: "talk-buffer",
model: "openai-codex/gpt-5.5",
realtime: {
tts: {
provider: "openai",
model: "gpt-realtime-2",
voice: "cedar",
openai: {
model: "gpt-4o-mini-tts",
voice: "cedar",
},
},
},
},
@@ -1304,7 +1308,7 @@ Voice as an extension of an existing Discord channel session:
discord: {
voice: {
enabled: true,
mode: "bidi",
mode: "agent-proxy",
model: "openai-codex/gpt-5.5",
agentSession: {
mode: "target",
@@ -1314,8 +1318,6 @@ Voice as an extension of an existing Discord channel session:
provider: "openai",
model: "gpt-realtime-2",
voice: "cedar",
toolPolicy: "safe-read-only",
consultPolicy: "always",
},
},
},
@@ -1323,7 +1325,7 @@ Voice as an extension of an existing Discord channel session:
}
```
In this mode the bot joins the configured voice channel, but OpenClaw agent turns use the target channel's normal routed session and agent. The realtime voice session speaks the returned result back into the voice channel. The supervisor agent can still use normal message tools according to its tool policy, including sending a separate Discord message if that is the right action.
In `agent-proxy` mode the bot joins the configured voice channel, but OpenClaw agent turns use the target channel's normal routed session and agent. The realtime voice session speaks the returned result back into the voice channel. The supervisor agent can still use normal message tools according to its tool policy, including sending a separate Discord message if that is the right action.
Useful target forms:

View File

@@ -166,7 +166,7 @@ describe("discord config schema", () => {
it("accepts Discord realtime voice modes", () => {
const cfg = expectValidDiscordConfig({
voice: {
mode: "bidi",
mode: "agent-proxy",
model: "openai-codex/gpt-5.5",
realtime: {
provider: "openai",
@@ -186,7 +186,7 @@ describe("discord config schema", () => {
},
});
expect(cfg.voice?.mode).toBe("bidi");
expect(cfg.voice?.mode).toBe("agent-proxy");
expect(cfg.voice?.model).toBe("openai-codex/gpt-5.5");
expect(cfg.voice?.realtime?.provider).toBe("openai");
expect(cfg.voice?.realtime?.model).toBe("gpt-realtime-2");
@@ -200,11 +200,12 @@ describe("discord config schema", () => {
it("rejects invalid Discord realtime voice modes", () => {
for (const voice of [
{ mode: "realtime" },
{ mode: "talk-buffer" },
{ mode: "bidi", realtime: { toolPolicy: "dangerous" } },
{ mode: "talk-buffer", realtime: { consultPolicy: "substantive" } },
{ mode: "talk-buffer", realtime: { debounceMs: 10_001 } },
{ mode: "talk-buffer", realtime: { minBargeInAudioEndMs: -1 } },
{ mode: "talk-buffer", realtime: { minBargeInAudioEndMs: 10_001 } },
{ mode: "agent-proxy", realtime: { consultPolicy: "substantive" } },
{ mode: "agent-proxy", realtime: { debounceMs: 10_001 } },
{ mode: "agent-proxy", realtime: { minBargeInAudioEndMs: -1 } },
{ mode: "agent-proxy", realtime: { minBargeInAudioEndMs: 10_001 } },
{ agentSession: { mode: "target" } },
]) {
expectInvalidDiscordConfig({ voice });

View File

@@ -183,7 +183,7 @@ export const discordChannelConfigUiHints = {
},
"voice.mode": {
label: "Discord Voice Mode",
help: "Conversation mode: stt-tts uses batch speech-to-text plus TTS, talk-buffer uses a realtime voice shell with the OpenClaw agent as the brain, and bidi lets the realtime provider converse directly with the OpenClaw consult tool.",
help: "Conversation mode: agent-proxy (default) uses realtime voice as the microphone/speaker for the routed OpenClaw agent, stt-tts uses batch speech-to-text plus TTS, and bidi lets the realtime provider converse directly with the OpenClaw consult tool.",
},
"voice.agentSession": {
label: "Discord Voice Agent Session",
@@ -195,7 +195,7 @@ export const discordChannelConfigUiHints = {
},
"voice.realtime.provider": {
label: "Discord Realtime Provider",
help: "Realtime voice provider for talk-buffer or bidi Discord voice modes, such as openai.",
help: "Realtime voice provider for agent-proxy or bidi Discord voice modes, such as openai.",
},
"voice.realtime.model": {
label: "Discord Realtime Model",

View File

@@ -302,7 +302,7 @@ describe("DiscordVoiceManager", () => {
const createManager = (
discordConfig: ConstructorParameters<
typeof managerModule.DiscordVoiceManager
>[0]["discordConfig"] = { voice: { enabled: true } },
>[0]["discordConfig"] = { voice: { enabled: true, mode: "stt-tts" } },
clientOverride?: ReturnType<typeof createClient>,
cfgOverride: ConstructorParameters<typeof managerModule.DiscordVoiceManager>[0]["cfg"] = {},
) =>
@@ -805,7 +805,7 @@ describe("DiscordVoiceManager", () => {
groupPolicy: "open",
voice: {
enabled: true,
mode: "talk-buffer",
mode: "agent-proxy",
realtime: { provider: "openai" },
},
});
@@ -832,7 +832,7 @@ describe("DiscordVoiceManager", () => {
groupPolicy: "open",
voice: {
enabled: true,
mode: "talk-buffer",
mode: "agent-proxy",
realtime: { provider: "openai" },
},
});
@@ -848,13 +848,12 @@ describe("DiscordVoiceManager", () => {
expect(manager.status()).toStrictEqual([]);
});
it("starts Discord realtime voice in talk-buffer mode", async () => {
agentCommandMock.mockResolvedValueOnce({ payloads: [{ text: "buffered brain answer" }] });
it("uses agent-proxy realtime voice by default", async () => {
agentCommandMock.mockResolvedValueOnce({ payloads: [{ text: "agent proxy answer" }] });
const manager = createManager({
groupPolicy: "open",
voice: {
enabled: true,
mode: "talk-buffer",
model: "openai-codex/gpt-5.5",
realtime: {
provider: "openai",
@@ -907,11 +906,12 @@ describe("DiscordVoiceManager", () => {
expect.objectContaining({
model: "openai-codex/gpt-5.5",
messageProvider: "discord-voice",
toolsAllow: undefined,
}),
expect.anything(),
);
expect(realtimeSessionMock.sendUserMessage).toHaveBeenCalledWith(
expect.stringContaining("buffered brain answer"),
expect.stringContaining("agent proxy answer"),
);
});
@@ -920,7 +920,7 @@ describe("DiscordVoiceManager", () => {
groupPolicy: "open",
voice: {
enabled: true,
mode: "talk-buffer",
mode: "agent-proxy",
realtime: { provider: "openai" },
},
});
@@ -960,7 +960,7 @@ describe("DiscordVoiceManager", () => {
groupPolicy: "open",
voice: {
enabled: true,
mode: "talk-buffer",
mode: "agent-proxy",
realtime: {
model: "gpt-realtime-2",
voice: "cedar",
@@ -991,13 +991,13 @@ describe("DiscordVoiceManager", () => {
);
});
it("keeps talk-buffer realtime transcripts on the audio turn speaker context", async () => {
it("keeps agent-proxy realtime transcripts on the audio turn speaker context", async () => {
agentCommandMock.mockResolvedValueOnce({ payloads: [{ text: "non-owner answer" }] });
const manager = createManager({
groupPolicy: "open",
voice: {
enabled: true,
mode: "talk-buffer",
mode: "agent-proxy",
realtime: { provider: "openai", debounceMs: 1 },
},
});
@@ -1040,13 +1040,13 @@ describe("DiscordVoiceManager", () => {
);
});
it("expires closed talk-buffer turns before later speaker audio", async () => {
it("expires closed agent-proxy turns before later speaker audio", async () => {
agentCommandMock.mockResolvedValueOnce({ payloads: [{ text: "guest answer" }] });
const manager = createManager({
groupPolicy: "open",
voice: {
enabled: true,
mode: "talk-buffer",
mode: "agent-proxy",
realtime: { provider: "openai", debounceMs: 1 },
},
});
@@ -1528,7 +1528,7 @@ describe("DiscordVoiceManager", () => {
allowFrom: ["discord:u-speaker"],
voice: {
enabled: true,
mode: "talk-buffer",
mode: "agent-proxy",
realtime: { provider: "openai" },
},
});

View File

@@ -43,7 +43,7 @@ const DISCORD_REALTIME_PENDING_SPEAKER_CONTEXT_LIMIT = 32;
const DISCORD_REALTIME_LOG_PREVIEW_CHARS = 500;
const DISCORD_REALTIME_DEFAULT_MIN_BARGE_IN_AUDIO_END_MS = 250;
export type DiscordVoiceMode = "stt-tts" | "talk-buffer" | "bidi";
export type DiscordVoiceMode = "stt-tts" | "agent-proxy" | "bidi";
type DiscordRealtimeSpeakerContext = VoiceRealtimeSpeakerContext & { userId: string };
@@ -112,11 +112,18 @@ function readProviderConfigBoolean(
export function resolveDiscordVoiceMode(voice: DiscordAccountConfig["voice"]): DiscordVoiceMode {
const mode = voice?.mode;
return mode === "talk-buffer" || mode === "bidi" ? mode : "stt-tts";
if (mode === "stt-tts" || mode === "bidi") {
return mode;
}
return "agent-proxy";
}
export function isDiscordRealtimeVoiceMode(mode: DiscordVoiceMode): boolean {
return mode === "talk-buffer" || mode === "bidi";
return mode === "agent-proxy" || mode === "bidi";
}
function isDiscordAgentProxyVoiceMode(mode: DiscordVoiceMode): boolean {
return mode === "agent-proxy";
}
export function resolveDiscordRealtimeInterruptResponseOnInputAudio(params: {
@@ -237,7 +244,7 @@ export class DiscordRealtimeVoiceSession implements VoiceRealtimeSession {
`discord voice: realtime ${role} transcript (${text.length} chars): ${formatRealtimeLogPreview(text)}`,
);
}
if (!isFinal || role !== "user" || this.params.mode !== "talk-buffer") {
if (!isFinal || role !== "user" || !isDiscordAgentProxyVoiceMode(this.params.mode)) {
return;
}
this.talkback.enqueue(text, this.consumePendingSpeakerContext());
@@ -546,10 +553,10 @@ function buildDiscordRealtimeInstructions(params: {
"You are OpenClaw's Discord voice interface.",
"Keep spoken replies concise, natural, and suitable for a live Discord voice channel.",
].join("\n");
if (params.mode === "talk-buffer") {
if (isDiscordAgentProxyVoiceMode(params.mode)) {
return [
base,
"Mode: buffered OpenClaw agent talkback.",
"Mode: OpenClaw agent proxy.",
"Use audio input only to transcribe the speaker. Do not answer user speech by yourself.",
"When OpenClaw sends an exact answer to speak, say only that answer.",
].join("\n\n");

File diff suppressed because one or more lines are too long

View File

@@ -129,7 +129,7 @@ export type DiscordVoiceAutoJoinConfig = {
channelId: string;
};
export type DiscordVoiceMode = "stt-tts" | "talk-buffer" | "bidi";
export type DiscordVoiceMode = "stt-tts" | "agent-proxy" | "bidi";
export type DiscordVoiceRealtimeConsultPolicy = "auto" | "always";
@@ -168,13 +168,13 @@ export type DiscordVoiceAgentSessionConfig = {
export type DiscordVoiceConfig = {
/** Enable Discord voice channel conversations (default: true). */
enabled?: boolean;
/** Voice conversation mode. Default: stt-tts. */
/** Voice conversation mode. Default: agent-proxy. */
mode?: DiscordVoiceMode;
/** Route voice turns through an existing OpenClaw Discord conversation. */
agentSession?: DiscordVoiceAgentSessionConfig;
/** Optional LLM model override for Discord voice channel responses. */
model?: string;
/** Realtime provider settings for talk-buffer or bidi modes. */
/** Realtime provider settings for agent-proxy or bidi modes. */
realtime?: DiscordVoiceRealtimeConfig;
/** Voice channels to auto-join on startup. */
autoJoin?: DiscordVoiceAutoJoinConfig[];

View File

@@ -576,7 +576,7 @@ const DiscordVoiceAgentSessionSchema = z
const DiscordVoiceSchema = z
.object({
enabled: z.boolean().optional(),
mode: z.enum(["stt-tts", "talk-buffer", "bidi"]).optional(),
mode: z.enum(["stt-tts", "agent-proxy", "bidi"]).optional(),
agentSession: DiscordVoiceAgentSessionSchema.optional(),
model: z.string().min(1).optional(),
realtime: DiscordVoiceRealtimeSchema.optional(),