refactor(stt): share transcription helpers

2026-05-06 12:30:44 +00:00 · 2026-04-23 04:29:19 +01:00
parent a58633d809
commit c866820fed
24 changed files with 360 additions and 779 deletions
--- a/docs/plugins/sdk-provider-plugins.md
+++ b/docs/plugins/sdk-provider-plugins.md
@@ -716,6 +716,17 @@ API key auth, and dynamic model resolution.
    as `maxInputImages`, `maxInputVideos`, and `maxDurationSeconds` are not
    enough to advertise transform-mode support or disabled modes cleanly.

+    Prefer the shared WebSocket helper for streaming STT providers. It keeps
+    proxy capture, reconnect backoff, close flushing, ready handshakes, audio
+    queueing, and close-event diagnostics consistent across providers while
+    leaving provider code responsible for only the upstream event mapping.
+
+    Batch STT providers that POST multipart audio should use
+    `buildAudioTranscriptionFormData(...)` from
+    `openclaw/plugin-sdk/provider-http` together with the provider HTTP request
+    helpers. The form helper normalizes upload filenames, including AAC uploads
+    that need an M4A-style filename for compatible transcription APIs.
+
    Music-generation providers should follow the same pattern:
    `generate` for prompt-only generation and `edit` for reference-image-based
    generation. Flat aggregate fields such as `maxInputImages`,