mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 14:40:43 +00:00
feat(providers): add streaming stt providers
This commit is contained in:
@@ -2,18 +2,22 @@
|
||||
summary: "Deepgram transcription for inbound voice notes"
|
||||
read_when:
|
||||
- You want Deepgram speech-to-text for audio attachments
|
||||
- You want Deepgram streaming transcription for Voice Call
|
||||
- You need a quick Deepgram config example
|
||||
title: "Deepgram"
|
||||
---
|
||||
|
||||
# Deepgram (Audio Transcription)
|
||||
|
||||
Deepgram is a speech-to-text API. In OpenClaw it is used for **inbound audio/voice note
|
||||
transcription** via `tools.media.audio`.
|
||||
Deepgram is a speech-to-text API. In OpenClaw it is used for inbound
|
||||
audio/voice-note transcription through `tools.media.audio` and for Voice Call
|
||||
streaming STT through `plugins.entries.voice-call.config.streaming`.
|
||||
|
||||
When enabled, OpenClaw uploads the audio file to Deepgram and injects the transcript
|
||||
into the reply pipeline (`{{Transcript}}` + `[Audio]` block). This is **not streaming**;
|
||||
it uses the pre-recorded transcription endpoint.
|
||||
For batch transcription, OpenClaw uploads the complete audio file to Deepgram
|
||||
and injects the transcript into the reply pipeline (`{{Transcript}}` +
|
||||
`[Audio]` block). For Voice Call streaming, OpenClaw forwards live G.711
|
||||
u-law frames over Deepgram's WebSocket `listen` endpoint and emits partial or
|
||||
final transcripts as Deepgram returns them.
|
||||
|
||||
| Detail | Value |
|
||||
| ------------- | ---------------------------------------------------------- |
|
||||
@@ -101,6 +105,52 @@ it uses the pre-recorded transcription endpoint.
|
||||
</Tab>
|
||||
</Tabs>
|
||||
|
||||
## Voice Call streaming STT
|
||||
|
||||
The bundled `deepgram` plugin also registers a realtime transcription provider
|
||||
for the Voice Call plugin.
|
||||
|
||||
| Setting | Config path | Default |
|
||||
| --------------- | ----------------------------------------------------------------------- | -------------------------------- |
|
||||
| API key | `plugins.entries.voice-call.config.streaming.providers.deepgram.apiKey` | Falls back to `DEEPGRAM_API_KEY` |
|
||||
| Model | `...deepgram.model` | `nova-3` |
|
||||
| Language | `...deepgram.language` | (unset) |
|
||||
| Encoding | `...deepgram.encoding` | `mulaw` |
|
||||
| Sample rate | `...deepgram.sampleRate` | `8000` |
|
||||
| Endpointing | `...deepgram.endpointingMs` | `800` |
|
||||
| Interim results | `...deepgram.interimResults` | `true` |
|
||||
|
||||
```json5
|
||||
{
|
||||
plugins: {
|
||||
entries: {
|
||||
"voice-call": {
|
||||
config: {
|
||||
streaming: {
|
||||
enabled: true,
|
||||
provider: "deepgram",
|
||||
providers: {
|
||||
deepgram: {
|
||||
apiKey: "${DEEPGRAM_API_KEY}",
|
||||
model: "nova-3",
|
||||
endpointingMs: 800,
|
||||
language: "en-US",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
<Note>
|
||||
Voice Call receives telephony audio as 8 kHz G.711 u-law. The Deepgram
|
||||
streaming provider defaults to `encoding: "mulaw"` and `sampleRate: 8000`, so
|
||||
Twilio media frames can be forwarded directly.
|
||||
</Note>
|
||||
|
||||
## Notes
|
||||
|
||||
<AccordionGroup>
|
||||
@@ -118,12 +168,6 @@ it uses the pre-recorded transcription endpoint.
|
||||
</Accordion>
|
||||
</AccordionGroup>
|
||||
|
||||
<Note>
|
||||
Deepgram transcription is **pre-recorded only** (not real-time streaming). OpenClaw
|
||||
uploads the complete audio file and waits for the full transcript before injecting
|
||||
it into the conversation.
|
||||
</Note>
|
||||
|
||||
## Related
|
||||
|
||||
<CardGroup cols={2}>
|
||||
|
||||
Reference in New Issue
Block a user