feat(providers): add streaming stt providers

This commit is contained in:
Peter Steinberger
2026-04-23 03:05:44 +01:00
parent 5b68092351
commit 51ed22e608
32 changed files with 2399 additions and 16 deletions

View File

@@ -2,18 +2,22 @@
summary: "Deepgram transcription for inbound voice notes"
read_when:
- You want Deepgram speech-to-text for audio attachments
- You want Deepgram streaming transcription for Voice Call
- You need a quick Deepgram config example
title: "Deepgram"
---
# Deepgram (Audio Transcription)
Deepgram is a speech-to-text API. In OpenClaw it is used for **inbound audio/voice note
transcription** via `tools.media.audio`.
Deepgram is a speech-to-text API. In OpenClaw it is used for inbound
audio/voice-note transcription through `tools.media.audio` and for Voice Call
streaming STT through `plugins.entries.voice-call.config.streaming`.
When enabled, OpenClaw uploads the audio file to Deepgram and injects the transcript
into the reply pipeline (`{{Transcript}}` + `[Audio]` block). This is **not streaming**;
it uses the pre-recorded transcription endpoint.
For batch transcription, OpenClaw uploads the complete audio file to Deepgram
and injects the transcript into the reply pipeline (`{{Transcript}}` +
`[Audio]` block). For Voice Call streaming, OpenClaw forwards live G.711
u-law frames over Deepgram's WebSocket `listen` endpoint and emits partial or
final transcripts as Deepgram returns them.
| Detail | Value |
| ------------- | ---------------------------------------------------------- |
@@ -101,6 +105,52 @@ it uses the pre-recorded transcription endpoint.
</Tab>
</Tabs>
## Voice Call streaming STT
The bundled `deepgram` plugin also registers a realtime transcription provider
for the Voice Call plugin.
| Setting | Config path | Default |
| --------------- | ----------------------------------------------------------------------- | -------------------------------- |
| API key | `plugins.entries.voice-call.config.streaming.providers.deepgram.apiKey` | Falls back to `DEEPGRAM_API_KEY` |
| Model | `...deepgram.model` | `nova-3` |
| Language | `...deepgram.language` | (unset) |
| Encoding | `...deepgram.encoding` | `mulaw` |
| Sample rate | `...deepgram.sampleRate` | `8000` |
| Endpointing | `...deepgram.endpointingMs` | `800` |
| Interim results | `...deepgram.interimResults` | `true` |
```json5
{
plugins: {
entries: {
"voice-call": {
config: {
streaming: {
enabled: true,
provider: "deepgram",
providers: {
deepgram: {
apiKey: "${DEEPGRAM_API_KEY}",
model: "nova-3",
endpointingMs: 800,
language: "en-US",
},
},
},
},
},
},
},
}
```
<Note>
Voice Call receives telephony audio as 8 kHz G.711 u-law. The Deepgram
streaming provider defaults to `encoding: "mulaw"` and `sampleRate: 8000`, so
Twilio media frames can be forwarded directly.
</Note>
## Notes
<AccordionGroup>
@@ -118,12 +168,6 @@ it uses the pre-recorded transcription endpoint.
</Accordion>
</AccordionGroup>
<Note>
Deepgram transcription is **pre-recorded only** (not real-time streaming). OpenClaw
uploads the complete audio file and waits for the full transcript before injecting
it into the conversation.
</Note>
## Related
<CardGroup cols={2}>