Commit Graph

52 Commits

Author SHA1 Message Date
Peter Steinberger
694ca50e97 Revert "refactor: move runtime state to SQLite"
This reverts commit f91de52f0d.
2026-05-13 13:33:38 +01:00
Peter Steinberger
f91de52f0d refactor: move runtime state to SQLite
* refactor: remove stale file-backed shims

* fix: harden sqlite state ci boundaries

* refactor: store matrix idb snapshots in sqlite

* fix: satisfy rebased CI guardrails

* refactor: store current conversation bindings in sqlite table

* refactor: store tui last sessions in sqlite table

* refactor: reset sqlite schema history

* refactor: drop unshipped sqlite table migration

* refactor: remove plugin index file rollback

* refactor: drop unshipped sqlite sidecar migrations

* refactor: remove runtime commitments kv migration

* refactor: preserve kysely sync result types

* refactor: drop unshipped sqlite schema migration table

* test: keep session usage coverage sqlite-backed

* refactor: keep sqlite migration doctor-only

* refactor: isolate device legacy imports

* refactor: isolate push voicewake legacy imports

* refactor: isolate remaining runtime legacy imports

* refactor: tighten sqlite migration guardrails

* test: cover sqlite persisted enum parsing

* refactor: isolate legacy update and tui imports

* refactor: tighten sqlite state ownership

* refactor: move legacy imports behind doctor

* refactor: remove legacy session row lookup

* refactor: canonicalize memory transcript locators

* refactor: drop transcript path scope fallbacks

* refactor: drop runtime legacy session delivery pruning

* refactor: store tts prefs only in sqlite

* refactor: remove cron store path runtime

* refactor: use cron sqlite store keys

* refactor: rename telegram message cache scope

* refactor: read memory dreaming status from sqlite

* refactor: rename cron status store key

* refactor: stop remembering transcript file paths

* test: use sqlite locators in agent fixtures

* refactor: remove file-shaped commitments and cron store surfaces

* refactor: keep compaction transcript handles out of session rows

* refactor: derive transcript handles from session identity

* refactor: derive runtime transcript handles

* refactor: remove gateway session locator reads

* refactor: remove transcript locator from session rows

* refactor: store raw stream diagnostics in sqlite

* refactor: remove file-shaped transcript rotation

* refactor: hide legacy trajectory paths from runtime

* refactor: remove runtime transcript file bridges

* refactor: repair database-first rebase fallout

* refactor: align tests with database-first state

* refactor: remove transcript file handoffs

* refactor: sync post-compaction memory by transcript scope

* refactor: run codex app-server sessions by id

* refactor: bind codex runtime state by session id

* refactor: pass memory transcripts by sqlite scope

* refactor: remove transcript locator cleanup leftovers

* test: remove stale transcript file fixtures

* refactor: remove transcript locator test helper

* test: make cron sqlite keys explicit

* test: remove cron runtime store paths

* test: remove stale session file fixtures

* test: use sqlite cron keys in diagnostics

* refactor: remove runtime delivery queue backfill

* test: drop fake export session file mocks

* refactor: rename acp session read failure flag

* refactor: rename acp row session key

* refactor: remove session store test seams

* refactor: move legacy session parser tests to doctor

* refactor: reindex managed memory in place

* refactor: drop stale session store wording

* refactor: rename session row helpers

* refactor: rename sqlite session entry modules

* refactor: remove transcript locator leftovers

* refactor: trim file-era audit wording

* refactor: clean managed media through sqlite

* fix: prefer explicit agent for exports

* fix: use prepared agent for session resets

* fix: canonicalize legacy codex binding import

* test: rename state cleanup helper

* docs: align backup docs with sqlite state

* refactor: drop legacy Pi usage auth fallback

* refactor: move legacy auth profile imports to doctor

* refactor: keep Pi model discovery auth in memory

* refactor: remove MSTeams legacy learning key fallback

* refactor: store model catalog config in sqlite

* refactor: use sqlite model catalog at runtime

* refactor: remove model json compatibility aliases

* refactor: store auth profiles in sqlite

* refactor: seed copied auth profiles in sqlite

* refactor: make auth profile runtime sqlite-addressed

* refactor: migrate hermes secrets into sqlite auth store

* refactor: move plugin install config migration to doctor

* refactor: rename plugin index audit checks

* test: drop auth file assumptions

* test: remove legacy transcript file assertions

* refactor: drop legacy cli session aliases

* refactor: store skill uploads in sqlite

* refactor: keep subagent attachments in sqlite vfs

* refactor: drop subagent attachment cleanup state

* refactor: move legacy session aliases to doctor

* refactor: require node 24 for sqlite state runtime

* refactor: move provider caches into sqlite state

* fix: harden virtual agent filesystem

* refactor: enforce database-first runtime state

* refactor: rename compaction transcript rotation setting

* test: clean sqlite refactor test types

* refactor: consolidate sqlite runtime state

* refactor: model session conversations in sqlite

* refactor: stop deriving cron delivery from session keys

* refactor: stop classifying sessions from key shape

* refactor: hydrate announce targets from typed delivery

* refactor: route heartbeat delivery from typed sqlite context

* refactor: tighten typed sqlite session routing

* refactor: remove session origin routing shadow

* refactor: drop session origin shadow fixtures

* perf: query sqlite vfs paths by prefix

* refactor: use typed conversation metadata for sessions

* refactor: prefer typed session routing metadata

* refactor: require typed session routing metadata

* refactor: resolve group tool policy from typed sessions

* refactor: delete dead session thread info bridge

* Show Codex subscription reset times in channel errors (#80456)

* feat(plugin-sdk): consolidate session workflow APIs

* fix(agents): allow read-only agent mount reads

* [codex] refresh plugin regression fixtures

* fix(agents): restore compaction gateway logs

* test: tighten gateway startup assertions

* Redact persisted secret-shaped payloads [AI] (#79006)

* test: tighten device pair notify assertions

* test: tighten hermes secret assertions

* test: assert matrix client error shapes

* test: assert config compat warnings

* fix(heartbeat): remap cron-run exec events to session keys (#80214)

* fix(codex): route btw through native side threads

* fix(auth): accept friendly OpenAI order for Codex profiles

* fix(codex): rotate auth profiles inside harness

* fix: keep browser status page probe within timeout

* test: assert agents add outputs

* test: pin cron read status

* fix(agents): avoid Pi resource discovery stalls

Co-authored-by: dataCenter430 <titan032000@gmail.com>

* fix: retire timed-out codex app-server clients

* test: tighten qa lab runtime assertions

* test: check security fix outputs

* test: verify extension runtime messages

* feat(wake): expose typed sessionKey on wake protocol + system event CLI

* fix(gateway): await session_end during shutdown drain and track channel + compaction lifecycle paths (#57790)

* test: guard talk consult call helper

* fix(codex): scale context engine projection (#80761)

* fix(codex): scale context engine projection

* fix: document Codex context projection scaling

* fix: document Codex context projection scaling

* fix: document Codex context projection scaling

* fix: document Codex context projection scaling

* chore: align Codex projection changelog

* chore: realign Codex projection changelog

* fix: isolate Codex projection patch

---------

Co-authored-by: Eva (agent) <eva+agent-78055@100yen.org>
Co-authored-by: Josh Lehman <josh@martian.engineering>

* refactor: move agent runtime state toward piless

* refactor: remove cron session reaper

* refactor: move session management to sqlite

* refactor: finish database-first state migration

* chore: refresh generated sqlite db types

* refactor: remove stale file-backed shims

* test: harden kysely type coverage

# Conflicts:
#	.agents/skills/kysely-database-access/SKILL.md
#	src/infra/kysely-sync.types.test.ts
#	src/proxy-capture/store.sqlite.test.ts
#	src/state/openclaw-agent-db.test.ts
#	src/state/openclaw-state-db.test.ts

* refactor: remove cron store path runtime

* refactor: keep compaction transcript handles out of session rows

* refactor: derive embedded transcripts from sqlite identity

* refactor: remove embedded transcript locator handoff

* refactor: remove runtime transcript file bridges

* refactor: remove transcript file handoffs

* refactor: remove MSTeams legacy learning key fallback

* refactor: store model catalog config in sqlite

* refactor: use sqlite model catalog at runtime

# Conflicts:
#	docs/cli/secrets.md
#	docs/gateway/authentication.md
#	docs/gateway/secrets.md

* fix: keep oauth sibling sync sqlite-local

# Conflicts:
#	src/commands/onboard-auth.test.ts

* refactor: remove task session store maintenance

# Conflicts:
#	src/commands/tasks.ts

* refactor: keep diagnostics in state sqlite

* refactor: enforce database-first runtime state

* refactor: consolidate sqlite runtime state

* Show Codex subscription reset times in channel errors (#80456)

* fix(codex): refresh subscription limit resets

* fix(codex): format reset times for channels

* Update CHANGELOG with latest changes and fixes

Updated CHANGELOG with recent fixes and improvements.

* fix(codex): keep command load failures on codex surface

* fix(codex): format account rate limits as rows

* fix(codex): summarize account limits as usage status

* fix(codex): simplify account limit status

* test: tighten subagent announce queue assertion

* test: tighten session delete lifecycle assertions

* test: tighten cron ops assertions

* fix: track cron execution milestones

* test: tighten hermes secret assertions

* test: assert matrix sync store payloads

* test: assert config compat warnings

* fix(codex): align btw side thread semantics

* fix(codex): honor codex fallback blocking

* fix(agents): avoid Pi resource discovery stalls

* test: tighten codex event assertions

* test: tighten cron assertions

* Fix Codex app-server OAuth harness auth

* refactor: move agent runtime state toward piless

* refactor: move device and push state to sqlite

* refactor: move runtime json state imports to doctor

* refactor: finish database-first state migration

* chore: refresh generated sqlite db types

* refactor: clarify cron sqlite store keys

* refactor: remove stale file-backed shims

* refactor: bind codex runtime state by session id

* test: expect sqlite trajectory branch export

* refactor: rename session row helpers

* fix: keep legacy device identity import in doctor

* refactor: enforce database-first runtime state

* refactor: consolidate sqlite runtime state

* build: align pi contract wrappers

* chore: repair database-first rebase

* refactor: remove session file test contracts

* test: update gateway session expectations

* refactor: stop routing from session compatibility shadows

* refactor: stop persisting session route shadows

* refactor: use typed delivery context in clients

* refactor: stop echoing session route shadows

* refactor: repair embedded runner rebase imports

# Conflicts:
#	src/agents/pi-embedded-runner/run/attempt.tool-call-argument-repair.ts

* refactor: align pi contract imports

* refactor: satisfy kysely sync helper guard

* refactor: remove file transcript bridge remnants

* refactor: remove session locator compatibility

* refactor: remove session file test contracts

* refactor: keep rebase database-first clean

* refactor: remove session file assumptions from e2e

* docs: clarify database-first goal state

* test: remove legacy store markers from sqlite runtime tests

* refactor: remove legacy store assumptions from runtime seams

* refactor: align sqlite runtime helper seams

* test: update memory recall sqlite audit mock

* refactor: align database-first runtime type seams

* test: clarify doctor cron legacy store names

* fix: preserve sqlite session route projections

* test: fix copilot token cache test syntax

* docs: update database-first proof status

* test: align database-first test fixtures

* docs: update database-first proof status

* refactor: clean extension database-first drift

* test: align agent session route proof

* test: clarify doctor legacy path fixtures

* chore: clean database-first changed checks

* chore: repair database-first rebase markers

* build: allow baileys git subdependency

* chore: repair exp-vfs rebase drift

* chore: finish exp-vfs rebase cleanup

* chore: satisfy rebase lint drift

* chore: fix qqbot rebase type seam

* chore: fix rebase drift leftovers

* fix: keep auth profile oauth secrets out of sqlite

* fix: repair rebase drift tests

* test: stabilize pairing request ordering

* test: use source manifests in plugin contract checks

* fix: restore gateway session metadata after rebase

* fix: repair database-first rebase drift

* fix: clean up database-first rebase fallout

* test: stabilize line quick reply receipt time

* fix: repair extension rebase drift

* test: keep transcript redaction tests sqlite-backed

* fix: carry injected transcript redaction through sqlite

* chore: clean database branch rebase residue

* fix: repair database branch CI drift

* fix: repair database branch CI guard drift

* fix: stabilize oauth tls preflight test

* test: align database branch fast guards

* test: repair build artifact boundary guards

* chore: clean changelog rebase markers

---------

Co-authored-by: pashpashpash <nik@vault77.ai>
Co-authored-by: Eva <eva@100yen.org>
Co-authored-by: stainlu <stainlu@newtype-ai.org>
Co-authored-by: Jason Zhou <jason.zhou.design@gmail.com>
Co-authored-by: Ruben Cuevas <hi@rubencu.com>
Co-authored-by: Pavan Kumar Gondhi <pavangondhi@gmail.com>
Co-authored-by: Shakker <shakkerdroid@gmail.com>
Co-authored-by: Kaspre <36520309+Kaspre@users.noreply.github.com>
Co-authored-by: dataCenter430 <titan032000@gmail.com>
Co-authored-by: Kaspre <kaspre@gmail.com>
Co-authored-by: pandadev66 <nova.full.stack@outlook.com>
Co-authored-by: Eva <admin@100yen.org>
Co-authored-by: Eva (agent) <eva+agent-78055@100yen.org>
Co-authored-by: Josh Lehman <josh@martian.engineering>
Co-authored-by: jeffjhunter <support@aipersonamethod.com>
2026-05-13 13:15:12 +01:00
Peter Steinberger
936c02e22c fix(models): fail over OpenRouter budget-limit 403s 2026-05-10 06:34:39 +01:00
奥森木
1f75d3c187 Cron: honor server_error retries (#45594)
Merged via squash.

Prepared head SHA: d2b1d73d32
Co-authored-by: clovericbot <258443621+clovericbot@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-05-09 20:10:29 +03:00
Peter Steinberger
0a8c98e7cb test: tighten failover recovery assertions 2026-05-09 15:05:16 +01:00
wenxu007
9df0ae6767 fix(agents,failover): propagate sessionId/lane/provider attribution through FailoverError (#73506)
* fix(agents,failover): propagate sessionId/lane/provider attribution through FailoverError

Adds optional `sessionId` and `lane` fields to `FailoverError` and threads
them — together with the existing `provider`, `model`, `profileId` — through
`describeFailoverError` and `coerceToFailoverError` context, so structured
error log ingestion can attribute exhausted-fallback wrapper errors back
to the originating request instead of dropping the per-profile metadata
when the final wrapper is built.

Fixes #42713.

* fix: preserve failover error attribution

---------

Co-authored-by: Altay <altay@uinaf.dev>
2026-05-01 11:26:56 +03:00
Zhang Xiaofeng
a0900926c3 fix: add CJK error patterns to failover classification (#56242)
* fix: add CJK error patterns to failover classification

Chinese LLM providers (ZhipuAI/GLM, Bailian, Kimi/Moonshot, DeepSeek,
etc.) return error messages in Chinese. The existing failover
classification only matches English patterns, causing these errors to
fall through as unclassified — surfacing raw provider errors to users
instead of triggering model fallback.

Real production example: ZhipuAI error code 1234 returns
'网络错误,错误id:xxx,请联系客服。' (network error). This was not
matched by the existing 'network error' English pattern, so no failover
was triggered despite having a configured fallback model.

Changes:
- Add Chinese patterns to all error categories in failover-matches.ts:
  timeout, serverError, rateLimit, billing, auth, overloaded
- Add Chinese network error detection in formatTransportErrorCopy()
  for user-friendly error messages
- Add comprehensive test coverage for all CJK error categories

Follows the existing precedent set by Chinese context overflow patterns
in isContextOverflowError().

* fix: narrow billing pattern and fix placeholder issue URL

- Change '账户余额' to '账户余额不足' to avoid false positives on
  messages that merely mention account balance (per greptile review)
- Replace XXXXX placeholder with actual issue #56242

* fix: wire CJK auth failover patterns

* fix: classify CJK provider failover errors

* fix: place failover changelog entry in unreleased

---------

Co-authored-by: Altay <altay@uinaf.dev>
2026-04-28 10:44:17 +03:00
willamhou
4b5c2f9aa3 fix(agents/failover): classify bare pi-ai stream wrapper as timeout regardless of provider (#71620) 2026-04-25 18:22:06 +01:00
Ted Li
8cc38c1b86 fix: stop session lock failover (#68700) (thanks @MonkeyLeeT)
* fix(agents): stop treating session lock waits as timeout

* fix(agents): ignore abort-wrapped session lock waits

* fix(agents): keep explicit failover metadata authoritative

* fix(agents): respect inferred failover metadata

* fix(agents): ignore generic abort codes for lock waits

* fix(agents): suppress cause-based lock wait fallback

* fix(agents): type session lock timeout errors

* fix: stop session lock failover (#68700) (thanks @MonkeyLeeT)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
2026-04-25 09:33:19 +05:30
Altay
9d3c56d236 fix: don't classify 400/422 with no body as format error (#67024)
* fix: keep no-body 400/422 failover errors out of format

* fix: keep failover changelog entry in unreleased fixes
2026-04-25 01:37:28 +03:00
Peter Steinberger
f600e98e5b fix(agents): handle OpenAI web search schema rejects 2026-04-23 06:51:29 +01:00
Peter Steinberger
69c78fbef0 perf(test): dedupe model config fixtures 2026-04-20 12:26:47 +01:00
xiwuqi
7fbd31818b fix: classify invalid-model fallback errors (#50028)
Merged via squash.

Prepared head SHA: 04b13e09e1
Co-authored-by: xiwuqi <64734786+xiwuqi@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-04-14 20:32:29 +03:00
屈定
95ee120a91 fix: classify openrouter json 404 model errors
Rewrites the stale branch on top of current `main` and preserves the original issue as regression coverage for the exact OpenRouter JSON 404 payload from #51571.

No production behavior changes are introduced here; current `main` already classifies this payload as `model_not_found`, and this merge locks that in across the shared matcher, failover classifier, and fallback loop.

Co-authored-by: 屈定 <mrdear@users.noreply.github.com>
Co-authored-by: Altay <altay@uinaf.dev>
2026-04-13 19:53:55 +01:00
Ayaan Zaidi
004781955c fix: restore model-scoped deprecation fallback matching 2026-04-10 13:57:00 +05:30
Ted Li
d78d91f8c2 fix: continue fallback after OpenRouter no-endpoints 404 (#61472) (thanks @MonkeyLeeT)
* Fix OpenRouter no-endpoints fallback classification

* Restore bare model-not-found matcher coverage

* Preserve model does-not-exist fallback classification

* Narrow does-not-exist model-not-found matching

* Keep runtime model-not-found matcher strict

* style(agents): drop model matcher comment

* fix: continue fallback after OpenRouter no-endpoints 404 (#61472) (thanks @MonkeyLeeT)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
2026-04-10 13:46:14 +05:30
Altay
ae460eff84 fix(failover): scope openrouter-specific matchers (#60909) 2026-04-04 18:24:03 +03:00
Aaron Zhu
983909f826 fix(agents): classify generic provider errors for failover (#59325)
* fix(agents): classify generic provider errors for failover

Anthropic returns bare 'An unknown error occurred' during API instability
and OpenRouter wraps upstream failures as 'Provider returned error'. Neither
message was recognized by the failover classifier, so the error surfaced
directly to users instead of triggering the configured fallback chain.

Add both patterns to the serverError classifier so they are classified as
transient server errors (timeout) and trigger model failover.

Closes #49706
Closes #45834

* fix(agents): scope unknown-error failover by provider

* docs(changelog): note provider-scoped unknown-error failover

---------

Co-authored-by: Aaron Zhu <aaron@Aarons-MacBook-Air.local>
Co-authored-by: Altay <altay@uinaf.dev>
2026-04-04 18:11:46 +03:00
Rockcent
b2f972e364 fix(failover): OpenRouter 403 Key limit exceeded triggers billing fallback (#59892)
Merged via squash.

Prepared head SHA: 7f8265231c
Co-authored-by: rockcent <128210877+rockcent@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-04-04 17:03:21 +03:00
Peter Steinberger
14cfcdba1a docs(test): refresh stale model refs 2026-04-04 08:05:49 +01:00
Peter Steinberger
6739c28718 refactor: clarify auth failover policy 2026-04-04 02:49:18 +09:00
Peter Steinberger
865fa2ba72 fix: narrow auth permanent lockouts 2026-04-04 02:35:27 +09:00
Peter Steinberger
131f6dac37 refactor: unify failover signal classification 2026-04-01 21:50:58 +09:00
Peter Steinberger
cd4b03c568 fix: unify structured provider failover classification (#58856) (thanks @aaron-he-zhu) 2026-04-01 21:31:18 +09:00
nikus-pan
bef4fa55f5 fix(model-fallback): add HTTP 410 to failover reason classification (#55201)
Merged via squash.

Prepared head SHA: 9c1780b739
Co-authored-by: nikus-pan <71585761+nikus-pan@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-28 14:03:20 +03:00
Craig Allan-McWilliams
984f98be95 Fix: treat HTTP 500 as a transient failover error (#55332)
HTTP 500 (Internal Server Error) was not triggering model fallback,
causing agents to fail outright instead of trying the next candidate.
This is inconsistent with TRANSIENT_HTTP_ERROR_CODES which already
includes 500. Aligns the direct status check with that constant.

Co-authored-by: Craig McWilliams <craigamcw@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 20:05:15 -04:00
Darshil
e403ed6546 fix: harden wrapped rate-limit failover (openclaw#39820) thanks @lupuletic 2026-03-13 23:25:04 -07:00
jnMetaCode
7332e6d609 fix(failover): classify HTTP 422 as format and OpenRouter credits as billing (#43823)
Merged via squash.

Prepared head SHA: 4f48e977fe
Co-authored-by: jnMetaCode <12096460+jnMetaCode@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-13 00:50:28 +03:00
bwjoke
fd568c4f74 fix(failover): classify ZenMux quota-refresh 402 as rate_limit (#43917)
Merged via squash.

Prepared head SHA: 1d58a36a77
Co-authored-by: bwjoke <1284814+bwjoke@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-13 00:06:43 +03:00
Wayne
d93db0fc13 fix(failover): classify z.ai network_error stop reason as retryable timeout (#43884)
Merged via squash.

Prepared head SHA: 9660f6cd5b
Co-authored-by: hougangdev <105773686+hougangdev@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-13 00:00:44 +03:00
jnMetaCode
f640326e31 fix(failover): add missing network errno patterns to text-based timeout classifier (#42830)
Merged via squash.

Prepared head SHA: 91761487e8
Co-authored-by: jnMetaCode <12096460+jnMetaCode@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-12 12:34:44 +03:00
alan blount
c9a6c542ef Add HTTP 499 to transient error codes for model fallback (#41468)
Merged via squash.

Prepared head SHA: 0053bae140
Co-authored-by: zeroasterisk <23422+zeroasterisk@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-10 01:55:10 +03:00
Peter Lee
92648f9ba9 fix(agents): broaden 402 temporary-limit detection and allow billing cooldown probe (#38533)
Merged via squash.

Prepared head SHA: 282b9186c6
Co-authored-by: xialonglee <22994703+xialonglee@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-08 10:27:01 +03:00
Altay
6e962d8b9e fix(agents): handle overloaded failover separately (#38301)
* fix(agents): skip auth-profile failure on overload

* fix(agents): note overload auth-profile fallback fix

* fix(agents): classify overloaded failures separately

* fix(agents): back off before overload failover

* fix(agents): tighten overload probe and backoff state

* fix(agents): persist overloaded cooldown across runs

* fix(agents): tighten overloaded status handling

* test(agents): add overload regression coverage

* fix(agents): restore runner imports after rebase

* test(agents): add overload fallback integration coverage

* fix(agents): harden overloaded failover abort handling

* test(agents): tighten overload classifier coverage

* test(agents): cover all-overloaded fallback exhaustion

* fix(cron): retry overloaded fallback summaries

* fix(cron): treat HTTP 529 as overloaded retry
2026-03-07 01:42:11 +03:00
Xinhua Gu
01b20172b8 fix(failover): classify HTTP 402 as rate_limit when payload indicates usage limit (#30484) (#36802)
* fix(failover): classify HTTP 402 as rate_limit when payload indicates usage limit (#30484)

Some providers (notably Anthropic Claude Max plan) surface temporary
usage/rate-limit failures as HTTP 402 instead of 429. Before this change,
all 402s were unconditionally mapped to 'billing', which produced a
misleading 'run out of credits' warning for Max plan users who simply
hit their usage window.

This follows the same pattern introduced for HTTP 400 in #36783: check
the error message for an explicit rate-limit signal before falling back
to the default status-code classification.

- classifyFailoverReasonFromHttpStatus now returns 'rate_limit' for 402
  when isRateLimitErrorMessage matches the payload text
- Added regression tests covering both the rate-limit and billing paths
  on 402

* fix: narrow 402 rate-limit matcher to prevent billing misclassification

The original implementation used isRateLimitErrorMessage(), which matches
phrases like 'quota exceeded' that legitimately appear in billing errors.

This commit replaces it with a narrow, 402-specific matcher that requires
BOTH retry language (try again/retry/temporary/cooldown) AND limit
terminology (usage limit/rate limit/organization usage).

Prevents misclassification of errors like:
'HTTP 402: exceeded quota, please add credits' -> billing (not rate_limit)

Added regression test for the ambiguous case.

---------

Co-authored-by: Val Alexander <bunsthedev@gmail.com>
2026-03-06 03:45:36 -06:00
zhouhe-xydt
a65d70f84b Fix failover for zhipuai 1310 Weekly/Monthly Limit Exhausted (#33813)
Merged via squash.

Prepared head SHA: 3dc441e58d
Co-authored-by: zhouhe-xydt <265407618+zhouhe-xydt@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-06 12:04:09 +03:00
Altay
49acb07f9f fix(agents): classify insufficient_quota 400s as billing (#36783) 2026-03-06 01:17:48 +03:00
Altay
6859619e98 test(agents): add provider-backed failover regressions (#36735)
* test(agents): add provider-backed failover fixtures

* test(agents): cover more provider error docs

* test(agents): tighten provider doc fixtures
2026-03-06 00:42:59 +03:00
Altay
f014e255df refactor(agents): share failover HTTP status classification (#36615)
* fix(agents): classify transient failover statuses consistently

* fix(agents): preserve legacy failover status mapping
2026-03-05 23:50:36 +03:00
AI南柯(KingMo)
30ab9b2068 fix(agents): recognize connection errors as retryable timeout failures (#31697)
* fix(agents): recognize connection errors as retryable timeout failures

## Problem

When a model endpoint becomes unreachable (e.g., local proxy down,
relay server offline), the failover system fails to switch to the
next candidate model. Errors like "Connection error." are not
classified as retryable, causing the session to hang on a broken
endpoint instead of falling back to healthy alternatives.

## Root Cause

Connection/network errors are not recognized by the current failover
classifier:
- Text patterns like "Connection error.", "fetch failed", "network error"
- Error codes like ECONNREFUSED, ENOTFOUND, EAI_AGAIN (in message text)

While `failover-error.ts` handles these as error codes (err.code),
it misses them when they appear as plain text in error messages.

## Solution

Extend timeout error patterns to include connection/network failures:

**In `errors.ts` (ERROR_PATTERNS.timeout):**
- Text: "connection error", "network error", "fetch failed", etc.
- Regex: /\beconn(?:refused|reset|aborted)\b/i, /\benotfound\b/i, /\beai_again\b/i

**In `failover-error.ts` (TIMEOUT_HINT_RE):**
- Same patterns for non-assistant error paths

## Testing

Added test cases covering:
- "Connection error."
- "fetch failed"
- "network error: ECONNREFUSED"
- "ENOTFOUND" / "EAI_AGAIN" in message text

## Impact

- **Compatibility:** High - only expands retryable error detection
- **Behavior:** Connection failures now trigger automatic fallback
- **Risk:** Low - changes are additive and well-tested

* style: fix code formatting for test file
2026-03-03 02:37:23 +00:00
Peter Steinberger
1bd20dbdb6 fix(failover): treat stop reason error as timeout 2026-03-03 01:05:24 +00:00
Peter Steinberger
a2fdc3415f fix(failover): handle unhandled stop reason error 2026-03-03 01:05:24 +00:00
Saurabh
1ef9a2a8ea fix: handle HTTP 529 (Anthropic overloaded) in failover error classification
Classify Anthropic's 529 status code as "rate_limit" so model fallback
triggers reliably without depending on fragile message-based detection.

Closes #28502
2026-03-02 18:59:10 +00:00
Sid
40e078a567 fix(auth): classify permission_error as auth_permanent for profile fallback (#31324)
When an OAuth auth profile returns HTTP 403 with permission_error
(e.g. expired plan), the error was not matched by the authPermanent
patterns. This caused the profile to receive only a short cooldown
instead of being disabled, so the gateway kept retrying the same
broken profile indefinitely.

Add "permission_error" and "not allowed for this organization" to
the authPermanent error patterns so these errors trigger the longer
billing/auth_permanent disable window and proper profile rotation.

Closes #31306

Made-with: Cursor

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-03-01 22:26:05 -08:00
Aleksandrs Tihenko
c0026274d9 fix(auth): distinguish revoked API keys from transient auth errors (#25754)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a200
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-25 19:47:16 -05:00
taw0002
3c57bf4c85 fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) (#21017)
* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes #20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-23 03:01:57 -05:00
Peter Steinberger
adace58505 test: reclassify local helper suites out of agents e2e 2026-02-22 10:53:40 +00:00
Peter Steinberger
9131b22a28 test: migrate suites to e2e coverage layout 2026-02-13 14:28:22 +00:00
Oren
71b4be8799 fix: handle 400 status in failover to enable model fallback (#1879) 2026-02-08 23:12:06 -08:00
Peter Steinberger
c379191f80 chore: migrate to oxlint and oxfmt
Co-authored-by: Christoph Nakazawa <christoph.pojer@gmail.com>
2026-01-14 15:02:19 +00:00