The tts tool previously returned a fixed "Generated audio reply."
string in its content, so session transcripts lost what was actually
spoken. Across every channel, a voice-only reply left no text record
for future turns, forcing users to recover transcripts from the
provider's API. Echo the synthesized text back in the tool result
content (audio still delivered via details.media).
Sanitize the transcript before embedding so crafted utterances cannot
inject reply directives when tool output is rendered in verbose mode:
MEDIA: at line start and [[…]] markers are interrupted with a
zero-width word joiner (U+2060) that defuses parseReplyDirectives
without altering the visible text.