Per @steipete review on #68310: the silent-error retry must not fire when the
failed attempt already recorded potential side effects (messaging tool sent,
cron add, or a mutating tool call that wasn't round-tripped as replay-safe).
Otherwise resubmission can duplicate those actions.
Adds `!attempt.replayMetadata.hadPotentialSideEffects` to the retry condition,
mirroring the gate used by resolveEmptyResponseRetryInstruction and the
planning-only / reasoning-only retry resolvers in run/incomplete-turn.ts.
Adds a new negative regression test:
"does not retry when the failed attempt recorded side effects"
which reproduces the reviewer's repro — stopReason=error + output=0 + empty
content, but replayMetadata={hadPotentialSideEffects: true, replaySafe: false}.
Expected: no retry, surfaces incomplete-turn error. Confirmed locally.
ollama/glm-5.1:cloud (and occasionally other models) can end a turn with
stopReason="error", usage.output=0, and empty content[] after a successful
tool-call sequence. The existing empty-response retry path in
src/agents/pi-embedded-runner/run/incomplete-turn.ts is gated on
isStrictAgenticSupportedProviderModel (gpt-5 family only), so non-frontier
models fall through to "incomplete turn detected" with payloads=0 and no
recovery. The user sees no reply and has to nudge.
Add a narrow, model-agnostic resubmission inside the attempt loop, placed
before the incompleteTurnText surface-to-user return:
- stopReason === "error"
- usage.output === 0
- content.length === 0 (excludes reasoning-only error turns)
- bounded by MAX_EMPTY_ERROR_RETRIES = 3
No instruction injection, no model gating; same prompt, same session
transcript (tool results already captured), just let the loop try again.
New test file run.empty-error-retry.test.ts covers:
1. Retries for ollama/glm-5.1:cloud → succeeds on 2nd attempt.
2. Caps at 3 retries → 4 total attempts → surfaces incomplete-turn error.
3. Does NOT retry when output > 0 (preserve produced text).
4. Does NOT retry when stopReason=stop + output=0 (NO_REPLY path).
5. Retries for anthropic/claude-opus-4-7 too — model-agnostic.
Relates to #68281.
* fix(telegram): release undici dispatchers via TelegramTransport.close()
TelegramTransport now exposes an explicit close() that destroys every
owned undici dispatcher (default Agent plus lazily-created IPv4 and
IP-pinned fallback Agents) and the TCP sockets they hold. Dispatcher
constructors are also given bounded keep-alive defaults
(keepAliveTimeout, keepAliveMaxTimeout, connections, pipelining) as a
defence-in-depth layer so the pool cannot grow unbounded even if a
caller forgets to call close().
Without this, every transport that went through a fallback retry left
its fallback Agents anchored forever in a closure; long-running polling
sessions accumulated hundreds of ESTABLISHED keep-alive sockets to
api.telegram.org, saturating the per-IP quota on upstream forward
proxies and making the currently-active outbound node time out while
every other node still tested healthy.
Mock dispatchers in fetch.test.ts gain destroy() spies so the close()
chain is assertable. Call sites that built caller-owned transports from
globalThis.fetch (delivery.resolve-media, test helpers) return an async
no-op close(), matching the new required surface.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(telegram): dispose polling transport on shutdown and dirty rebuild
Every recoverable network error and stall-watchdog trip sets
TelegramPollingTransportState.#transportDirty so the next polling
cycle rebuilds the transport inside acquireForNextCycle(). Previously
the rebuild simply overwrote the field, leaving the old transport's
keep-alive sockets anchored in the now-unreferenced dispatcher — the
polling loop has no natural GC point for these resources, and Node's
object GC never touches OS-level sockets.
acquireForNextCycle() now closes the previous transport (fire-and-
forget so the polling cycle is not blocked by a slow destroy) before
swapping in the rebuilt one. dispose() is a new method that the owning
TelegramPollingSession calls from the finally block of runUntilAbort(),
so a single transport is always tied to a single polling session
lifetime. After dispose(), acquireForNextCycle() returns undefined to
prevent zombie rebuilds.
Under high sustained polling traffic over long-lived sessions, this is
what stops the per-gateway connection count to api.telegram.org from
growing indefinitely and saturating upstream proxy quotas.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(changelog): note Telegram undici dispatcher lifecycle fix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(telegram): disable HTTP/2 for all Telegram polling dispatchers
Undici 8 enables HTTP/2 ALPN by default, but Telegram's long-polling
connections stall on Windows due to IPv6 + H2 multiplexing issues. The
core fetch-guard already sets allowH2:false for guarded paths, but the
Telegram extension creates its own Agent/ProxyAgent/EnvHttpProxyAgent
instances directly from undici without this flag.
Apply allowH2:false to all dispatcher constructors in the Telegram
transport layer, matching the approach used in src/infra/net/undici-runtime.ts.
Fixes#66885
* fix: avoid false telegram polling stall restarts
* fix(telegram): publish polling health liveness
---------
Co-authored-by: Ethan Chen <ethanbit@qq.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Magicray1217 <magicray1217@users.noreply.github.com>
Co-authored-by: aoao <aoao@openclaw>