The bug: three persist sites accumulated cost instead of snapshotting
it like tokens. This caused cost to be inflated 1x-72x on multi-persist
sessions because the same cumulative usage was added repeatedly.
Root cause: persistSessionUsageUpdate, updateSessionStoreAfterAgentRun,
and the cron isolated-agent run path all used:
estimatedCostUsd = existingCost + runCost
But runCost was already computed from cumulative run usage, so this
added the same cost repeatedly on redundant persists.
Fix: snapshot cost directly like tokens already do:
estimatedCostUsd = runCost
Files affected:
- src/auto-reply/reply/session-usage.ts
- src/agents/command/session-store.ts
- src/cron/isolated-agent/run.ts
Tests added:
- session-store.test.ts: verify cost is snapshotted, not accumulated
- session.test.ts: updated existing test to verify snapshot behavior
Fixes#69347
Three corrections to the auto-failover self-healing introduced in the prior commit:
1. Reset in-memory provider/model to configured primary after clearing auto override.
get-reply-directives.ts preloads provider/model from the stored override before
calling createModelSelectionState, so clearing only session state still ran the
current turn on the fallback. Now provider/model are reset to defaultProvider/
defaultModel so this turn retries the primary immediately, not on the next turn.
2. Remove resetModelOverride = true from the auto-heal path. That flag triggers a
"Model override not allowed for this agent" system event in
applyInlineDirectiveOverrides, which is incorrect: the override was valid and set
by the fallback loop — it just expired once the primary recovered. Auto-heal is
not an allowlist violation.
3. Add a test case that verifies the in-memory reset when the caller pre-loads the
fallback provider/model (simulating the get-reply-directives.ts preload path).
Known limitation (noted in comment): channel model overrides (channels.modelByChannel)
are skipped on the recovery turn because hasSessionModelOverride was true when they
were evaluated at preload time. They resume on the following turn once session state
is clear. Fixing this cleanly requires changes to the get-reply-directives preload
flow and is out of scope for this PR.
When runWithModelFallback falls back to a secondary provider it writes
providerOverride/modelOverride/modelOverrideSource:"auto" to the session.
On subsequent turns createModelSelectionState read this stored override and
passed the fallback provider directly to runWithModelFallback, so the
configured primary was never retried — the session was permanently pinned to
the fallback even after the primary recovered.
Fix: at model-selection ingress, when the direct session override has
modelOverrideSource "auto" (set by a previous automatic fallback, not a user
/model command), clear the override and retry the configured primary. If the
primary is still down runWithModelFallback will fall back and re-set the auto
override for that turn. Once the primary recovers the override stays clear.
User-selected overrides (modelOverrideSource "user" or legacy undefined+model)
are preserved unchanged.
Covered by four new unit tests in model-selection.test.ts:
- auto-failover override cleared and primary retried
- user-selected override preserved
- legacy override without source field preserved
- parent-session auto-override applied to child (not cleared by child logic)
Adds tiered model pricing support for cost tracking, keeps configured pricing ahead of cached catalog values, and includes latest Moonshot Kimi K2.6/K2.5 cost estimates.\n\nThanks @sliverp.
Adds missing compatibility runtime path metadata for bundled SecretRef-capable web-search providers and keeps the manifest registry covered by a regression test.\n\nThanks @afurm!
Raise the Telegram polling watchdog default from 90s to 120s and add bounded channels.telegram.pollingStallThresholdMs overrides, including per-account config.\n\nThanks @Vitalcheffe.
Per @steipete review on #68310: the silent-error retry must not fire when the
failed attempt already recorded potential side effects (messaging tool sent,
cron add, or a mutating tool call that wasn't round-tripped as replay-safe).
Otherwise resubmission can duplicate those actions.
Adds `!attempt.replayMetadata.hadPotentialSideEffects` to the retry condition,
mirroring the gate used by resolveEmptyResponseRetryInstruction and the
planning-only / reasoning-only retry resolvers in run/incomplete-turn.ts.
Adds a new negative regression test:
"does not retry when the failed attempt recorded side effects"
which reproduces the reviewer's repro — stopReason=error + output=0 + empty
content, but replayMetadata={hadPotentialSideEffects: true, replaySafe: false}.
Expected: no retry, surfaces incomplete-turn error. Confirmed locally.
ollama/glm-5.1:cloud (and occasionally other models) can end a turn with
stopReason="error", usage.output=0, and empty content[] after a successful
tool-call sequence. The existing empty-response retry path in
src/agents/pi-embedded-runner/run/incomplete-turn.ts is gated on
isStrictAgenticSupportedProviderModel (gpt-5 family only), so non-frontier
models fall through to "incomplete turn detected" with payloads=0 and no
recovery. The user sees no reply and has to nudge.
Add a narrow, model-agnostic resubmission inside the attempt loop, placed
before the incompleteTurnText surface-to-user return:
- stopReason === "error"
- usage.output === 0
- content.length === 0 (excludes reasoning-only error turns)
- bounded by MAX_EMPTY_ERROR_RETRIES = 3
No instruction injection, no model gating; same prompt, same session
transcript (tool results already captured), just let the loop try again.
New test file run.empty-error-retry.test.ts covers:
1. Retries for ollama/glm-5.1:cloud → succeeds on 2nd attempt.
2. Caps at 3 retries → 4 total attempts → surfaces incomplete-turn error.
3. Does NOT retry when output > 0 (preserve produced text).
4. Does NOT retry when stopReason=stop + output=0 (NO_REPLY path).
5. Retries for anthropic/claude-opus-4-7 too — model-agnostic.
Relates to #68281.