fix: stabilize Google Meet realtime talkback

2026-05-06 11:30:43 +00:00 · 2026-05-04 01:23:24 +01:00
parent c42a349b42
commit b13e9f1864
20 changed files with 633 additions and 43 deletions
--- a/docs/plugins/google-meet.md
+++ b/docs/plugins/google-meet.md
@@ -190,7 +190,7 @@ then share the returned `meetingUri`.
 ```

 For an observe-only/browser-control join, set `"mode": "transcribe"`. That does
-not start the duplex realtime model bridge, does not require BlackHole or SoX,
+not start the duplex realtime voice bridge, does not require BlackHole or SoX,
 and will not talk back into the meeting. Chrome joins in this mode also avoid
 OpenClaw's microphone/camera permission grant and avoid the Meet **Use
 microphone** path. If Meet shows an audio-choice interstitial, automation tries
@@ -1027,6 +1027,12 @@ Defaults:
  interruption on `chrome.bargeInInputCommand`
 - `chrome.bargeInCooldownMs: 900`: minimum delay between repeated human
  interruption clears
+- `realtime.strategy: "agent"`: default. Participant speech is transcribed,
+  sent to the configured OpenClaw agent in a per-meeting sub-agent session, and
+  the returned answer is spoken back through the realtime provider.
+- `realtime.strategy: "bidi"`: direct bidirectional realtime model mode. The
+  realtime provider answers participant speech directly and may call
+  `openclaw_agent_consult` for deeper/tool-backed answers.
 - `realtime.provider: "openai"`
 - `realtime.toolPolicy: "safe-read-only"`
 - `realtime.instructions`: brief spoken replies, with
@@ -1072,6 +1078,7 @@ Optional overrides:
    node: "parallels-macos",
  },
  realtime: {
+    strategy: "agent",
    provider: "google",
    agentId: "jay",
    toolPolicy: "owner",
@@ -1124,7 +1131,10 @@ Agents can use the `google_meet` tool:
 Use `transport: "chrome"` when Chrome runs on the Gateway host. Use
 `transport: "chrome-node"` when Chrome runs on a paired node such as a Parallels
 VM. In both cases the realtime model and `openclaw_agent_consult` run on the
-Gateway host, so model credentials stay there.
+Gateway host, so model credentials stay there. With the default
+`realtime.strategy: "agent"`, the realtime provider handles audio and
+transcription while the configured OpenClaw agent produces the spoken answer.
+With `realtime.strategy: "bidi"`, the realtime model answers directly.

 Use `action: "status"` to list active sessions or inspect a session ID. Use
 `action: "speak"` with `sessionId` and `message` to make the realtime agent
@@ -1149,6 +1159,8 @@ a session ended.
  not send the intro/test phrase into the audio bridge.
 - `providerConnected` / `realtimeReady`: realtime voice bridge state
 - `lastInputAt` / `lastOutputAt`: last audio seen from or sent to the bridge
+- `audioOutputRouted` / `audioOutputDeviceLabel`: whether the Meet tab's media
+  output was actively routed to the BlackHole device used by the bridge
 - `lastSuppressedInputAt` / `suppressedInputBytes`: loopback input ignored while
  assistant playback is active

@@ -1164,8 +1176,20 @@ a session ended.

 Chrome realtime mode is optimized for a live voice loop. The realtime voice
 provider hears the meeting audio and speaks through the configured audio bridge.
-When the realtime model needs deeper reasoning, current information, or normal
-OpenClaw tools, it can call `openclaw_agent_consult`.
+The default `realtime.strategy: "agent"` uses the realtime provider for audio
+I/O and transcription, but routes final participant transcripts through the
+configured OpenClaw agent before speaking. Set `realtime.strategy: "bidi"` when
+you want the realtime model to answer directly.
+Nearby final transcript fragments are coalesced before the consult so one spoken
+turn does not produce several stale partial answers.
+
+| Strategy | Who decides the answer        | Context behavior                                                                     | Use when                                              |
+| -------- | ----------------------------- | ------------------------------------------------------------------------------------ | ----------------------------------------------------- |
+| `agent`  | The configured OpenClaw agent | Per-meeting sub-agent session plus normal agent policy, tools, workspace, and memory | You want "my agent is in the meeting" behavior        |
+| `bidi`   | The realtime voice model      | Realtime session context, with optional `openclaw_agent_consult` calls               | You want the lowest-latency conversational voice loop |
+
+In `bidi` strategy, when the realtime model needs deeper reasoning, current
+information, or normal OpenClaw tools, it can call `openclaw_agent_consult`.

 The consult tool runs the regular OpenClaw agent behind the scenes with recent
 meeting transcript context and returns a concise spoken answer to the realtime
@@ -1176,6 +1200,10 @@ By default, consults run against the `main` agent. Set `realtime.agentId` when a
 Meet lane should consult a dedicated OpenClaw agent workspace, model defaults,
 tool policy, memory, and session history.

+Agent strategy consults use a per-meeting `agent:<id>:subagent:google-meet:<session>`
+session key so follow-up questions keep meeting context while inheriting normal
+agent policy from the configured agent.
+
 `realtime.toolPolicy` controls the consult run:

 - `safe-read-only`: expose the consult tool and limit the regular agent to
@@ -1414,7 +1442,8 @@ Also verify:
 - `BlackHole 2ch` is visible on the Chrome host.
 - `sox` exists on the Chrome host.
 - Meet microphone and speaker are routed through the virtual audio path used by
-  OpenClaw.
+  OpenClaw. `doctor` should show `meet output routed: yes` for local Chrome
+  realtime joins.

 `googlemeet doctor [session-id]` prints the session, node, in-call state,
 manual action reason, realtime provider connection, `realtimeReady`, audio
@@ -1578,7 +1607,7 @@ phone dial-in participation.
 Chrome realtime mode needs `BlackHole 2ch` plus either:

 - `chrome.audioInputCommand` plus `chrome.audioOutputCommand`: OpenClaw owns the
-  realtime model bridge and pipes audio in `chrome.audioFormat` between those
+  realtime voice bridge and pipes audio in `chrome.audioFormat` between those
  commands and the selected realtime voice provider. The default Chrome path is
  24 kHz PCM16; 8 kHz G.711 mu-law remains available for legacy command pairs.
 - `chrome.audioBridgeCommand`: an external bridge command owns the whole local