Adds Xiaomi MiMo voicedesign TTS support by registering the v2.5 voicedesign model and omitting audio.voice for that model's prompt-driven voice design flow.
Also accepts generic TTS aliases modelId, speakerVoice, and speakerVoiceId for Xiaomi provider config and request overrides.
Fixes exec timeout classification so a process that exits after a missed timeout callback is still reported as timed out, using monotonic deadlines to avoid wall-clock skew.
Verification:
- node scripts/run-vitest.mjs extensions/xiaomi/speech-provider.test.ts
- node scripts/run-vitest.mjs src/process/supervisor/supervisor.test.ts
- node scripts/run-vitest.mjs src/agents/bash-tools.exec-foreground-failures.test.ts
- git diff --check
- autoreview --mode local
- live Xiaomi MiMo voicedesign call returned wav RIFF/WAVE output, 169004 bytes
- GitHub CI success on fb3018ef31: CI 26708919072, CodeQL Critical Quality 26708919082, CodeQL 26708919091, OpenGrep PR Diff 26708919089, Workflow Sanity 26708919083, Dependency Guard 26708918574, Real behavior proof 26708921767
Thanks @GimingRao.
Co-authored-by: Raoyu <2425198313@qq.com>
Co-authored-by: giming <53329020+GimingRao@users.noreply.github.com>
* feat(browser): add optional vision understanding to screenshot tool
* fix(browser): wrap vision output as external content, enforce maxBytes, forward auth profiles
* fix(browser): remove no-op scope/attachments config, drop profile pass-through lacking runtime support
* feat(media-understanding): add profile/preferredProfile to DescribeImageFileWithModelParams and forward to describeImage
* style(browser): add curly braces to satisfy eslint curly rule
* fix(browser): correct tools.browser.enabled help text to match actual behavior
* fix(browser): thread agentDir/workspaceDir from plugin tool context into browser vision
* refactor(browser): move vision config from tools.browser to browser.models
The browser plugin's vision configuration now lives on the top-level
`browser` config namespace (browser.models, browser.visionEnabled,
browser.visionPrompt, etc.) instead of `tools.browser`. This aligns
with the plugin's existing config location and avoids confusion between
tool-level and plugin-level settings.
- Remove tools.browser from ToolsSchema and ToolsConfig
- Add models/vision* fields to BrowserConfig and its zod schema
- Update getBrowserVisionConfig to read from cfg.browser
- Update schema help, labels, and quality test
- Update vision.test.ts to use new config shape
* docs(browser): add screenshot vision configuration section
Document the new browser.models config for automatic screenshot
description via vision models, enabling text-only main models to
reason about web page content.
* fix(browser): remove deliverable media markers from vision result, drop unused import
P1: Vision-success path no longer exposes the raw screenshot as
deliverable media (removes MEDIA: line and details.media.mediaUrl).
This prevents channel delivery from auto-sending sensitive page content
when the intended output is a text description.
P2: Remove unused ToolsMediaUnderstandingSchema import that would fail
noUnusedLocals typecheck.
* fix(browser): add command/args fields to browser models schema
The browser vision model schema uses .strict(), so CLI-type entries
with command/args were rejected by TypeScript. Add these fields to
align with MediaUnderstandingModelSchema.
* chore(browser): remove debug console.log statements
* fix(browser): harden screenshot vision result against MEDIA: directive injection and restore image sanitization on failure fallback
ClawSweeper #84247 review round 2:
P1 (security, high): neutralize line-start MEDIA: directives in vision descriptions
before wrapping with wrapExternalContent. The agent media extractor scans every
browser tool-result text block via splitMediaFromOutput which treats line-start
MEDIA: as a trusted local-media delivery directive, and browser is on the
trusted-media allowlist. Without neutralization, page or vision-provider output
containing 'MEDIA:/tmp/secret.png' could synthesize a channel-deliverable media
artifact from untrusted content. wrapExternalContent itself does not strip
line-start directives. Introduce neutralizeMediaDirectives in vision.ts that
prepends '[neutralized] ' to any line whose trimStart() begins with MEDIA:
(case-insensitive), defanging the parser anchor while keeping the original
text human-readable.
P2 (compatibility): pass resolveRuntimeImageSanitization() to imageResultFromFile
in the vision-failure catch fallback. The non-vision screenshot path already
forwards this option (d5cc0d53b7) so configured agents.defaults.imageMaxDimensionPx
takes effect. Without this fix, any provider timeout/error silently bypasses the
sanitization guard and returns a raw full-resolution screenshot.
Regression coverage:
- vision.test.ts: 6 unit cases for neutralizeMediaDirectives (no-op fast path,
mid-line MEDIA: untouched, line-start defanged, leading-whitespace defanged,
case-insensitive, multiple directives per blob).
- browser-tool.test.ts: 2 integration cases that drive the full screenshot
tool execute path:
- 'neutralizes MEDIA: directives in vision text and does not attach media'
asserts no line matches /^\s*MEDIA:/i in returned text, secret path text
is preserved verbatim, details.media is absent, and imageResultFromFile
is not called on the success path.
- 'preserves screenshot image sanitization on vision failure fallback'
mocks describeImageFileWithModel to reject and asserts the fallback
imageResultFromFile call receives imageSanitization: {maxDimensionPx:1600}
plus the 'browser screenshot vision failed' extraText.
* fix(browser): apply clawsweeper fallback media fix from PR #84247
* refactor: reuse media image understanding for browser screenshots
* refactor: use structured media delivery
* test: update music completion media instruction expectation
* fix: trim buffered reply directive padding
* test: refresh codex prompt snapshots for message media aliases
---------
Co-authored-by: scotthuang <scotthuang@tencent.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Default sidebar label fell back to title 'Text-to-speech', which is fine
on the page header but readers scanning the Tools sidebar look for the
acronym 'TTS'. Add a sidebarTitle so Mintlify renders 'Text to speech
(TTS)' in the sidebar while keeping the canonical page title intact.
Sentence case matches the rest of the Tools sidebar group (e.g.
'Image generation', 'Music generation', 'Video generation').
- docs/tools/tts.md: alphabetize providers in three places that listed
them: the supported-providers table (Azure Speech ... Xiaomi MiMo),
the configuration Tabs (12 provider presets in A-Z), and the field
reference AccordionGroup. Top-level fields stay first; provider
tabs/accordions follow strict alphabetical order. Wording, schema,
and defaults unchanged.
- docs/docs.json: add tools/tts to the main Tools sidebar group
(slotted between trajectory and video-generation, matching the
alphabetical neighborhood with image-generation, music-generation,
video-generation). Previously tts only appeared under
Nodes > Media capabilities, which was a discoverability gap for
readers looking for TTS alongside the other generation tools.
The TTS doc had grown to 1008 lines with 11 separate flat 'X primary'
config blocks, a 100-line dense 'Notes on fields' bullet list, and
the new provider-personas feature (#70748) buried near the bottom.
Restructure for readability and feature visibility:
- Lead with a Steps-based 'Quick start' so first-time readers can
enable TTS in 4 explicit steps.
- Replace the 13-bullet provider list with a single 'Supported
providers' table that names auth env vars and per-provider notes
inline. Add a Warning callout for the Microsoft/edge legacy alias.
- Collapse the 11 'X primary' config blocks into one Tabs component
('OpenAI + ElevenLabs', 'Google Gemini', 'Azure Speech',
'Microsoft (no key)', 'MiniMax', 'Inworld', 'xAI', 'Volcengine',
'Xiaomi MiMo', 'OpenRouter', 'Gradium', 'Local CLI') so users see
one preset at a time and the page is scannable.
- Promote 'Personas' to its own top-level section with two examples
(minimal and the Alfred provider-neutral persona), and add a new
'How providers use persona prompts' AccordionGroup covering Google
(promptTemplate audio-profile-v1, personaPrompt), OpenAI
(instructions auto-mapping), and Other providers, plus a fallback
policy table.
- Note that agents.list[].tts.persona overrides global persona
per-agent (covers the recent feat(tts) per-agent voice-override
work).
- Convert the 100-line 'Notes on fields' wall into a per-provider
AccordionGroup using ParamField, so the field reference is
scannable and field types/defaults are visually distinct.
- Sentence-case headings, drop redundant body H1, fold the flow
diagram inline with Auto-TTS behavior, and refresh the Output
formats section to a table-first layout.
- Schema fields (label/description/provider/fallbackPolicy/prompt
with profile/scene/sampleContext/style/accent/pacing/constraints
and providers map) verified against src/config/types.tts.ts; all
defaults and env-var fallbacks preserved verbatim.
Net diff: 585 insertions, 684 deletions across the same surface
area.
Add Volcengine/BytePlus Seed Speech as a bundled TTS provider with current API-key auth, legacy AppID/token fallback, native Ogg/Opus voice-note output, and MP3 audio-file output.
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Adds the Gradium bundled plugin with TTS and speech-provider registration, docs, label routing, and focused/live coverage.
Also carries the current main lint cleanup needed for the rebased CI lane.
Co-authored-by: laurent <laurent.mazare@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>