Commit Graph

19 Commits

Author SHA1 Message Date
Altay
6e962d8b9e fix(agents): handle overloaded failover separately (#38301)
* fix(agents): skip auth-profile failure on overload

* fix(agents): note overload auth-profile fallback fix

* fix(agents): classify overloaded failures separately

* fix(agents): back off before overload failover

* fix(agents): tighten overload probe and backoff state

* fix(agents): persist overloaded cooldown across runs

* fix(agents): tighten overloaded status handling

* test(agents): add overload regression coverage

* fix(agents): restore runner imports after rebase

* test(agents): add overload fallback integration coverage

* fix(agents): harden overloaded failover abort handling

* test(agents): tighten overload classifier coverage

* test(agents): cover all-overloaded fallback exhaustion

* fix(cron): retry overloaded fallback summaries

* fix(cron): treat HTTP 529 as overloaded retry
2026-03-07 01:42:11 +03:00
Xinhua Gu
01b20172b8 fix(failover): classify HTTP 402 as rate_limit when payload indicates usage limit (#30484) (#36802)
* fix(failover): classify HTTP 402 as rate_limit when payload indicates usage limit (#30484)

Some providers (notably Anthropic Claude Max plan) surface temporary
usage/rate-limit failures as HTTP 402 instead of 429. Before this change,
all 402s were unconditionally mapped to 'billing', which produced a
misleading 'run out of credits' warning for Max plan users who simply
hit their usage window.

This follows the same pattern introduced for HTTP 400 in #36783: check
the error message for an explicit rate-limit signal before falling back
to the default status-code classification.

- classifyFailoverReasonFromHttpStatus now returns 'rate_limit' for 402
  when isRateLimitErrorMessage matches the payload text
- Added regression tests covering both the rate-limit and billing paths
  on 402

* fix: narrow 402 rate-limit matcher to prevent billing misclassification

The original implementation used isRateLimitErrorMessage(), which matches
phrases like 'quota exceeded' that legitimately appear in billing errors.

This commit replaces it with a narrow, 402-specific matcher that requires
BOTH retry language (try again/retry/temporary/cooldown) AND limit
terminology (usage limit/rate limit/organization usage).

Prevents misclassification of errors like:
'HTTP 402: exceeded quota, please add credits' -> billing (not rate_limit)

Added regression test for the ambiguous case.

---------

Co-authored-by: Val Alexander <bunsthedev@gmail.com>
2026-03-06 03:45:36 -06:00
zhouhe-xydt
a65d70f84b Fix failover for zhipuai 1310 Weekly/Monthly Limit Exhausted (#33813)
Merged via squash.

Prepared head SHA: 3dc441e58d
Co-authored-by: zhouhe-xydt <265407618+zhouhe-xydt@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-06 12:04:09 +03:00
Altay
49acb07f9f fix(agents): classify insufficient_quota 400s as billing (#36783) 2026-03-06 01:17:48 +03:00
Altay
6859619e98 test(agents): add provider-backed failover regressions (#36735)
* test(agents): add provider-backed failover fixtures

* test(agents): cover more provider error docs

* test(agents): tighten provider doc fixtures
2026-03-06 00:42:59 +03:00
Altay
f014e255df refactor(agents): share failover HTTP status classification (#36615)
* fix(agents): classify transient failover statuses consistently

* fix(agents): preserve legacy failover status mapping
2026-03-05 23:50:36 +03:00
AI南柯(KingMo)
30ab9b2068 fix(agents): recognize connection errors as retryable timeout failures (#31697)
* fix(agents): recognize connection errors as retryable timeout failures

## Problem

When a model endpoint becomes unreachable (e.g., local proxy down,
relay server offline), the failover system fails to switch to the
next candidate model. Errors like "Connection error." are not
classified as retryable, causing the session to hang on a broken
endpoint instead of falling back to healthy alternatives.

## Root Cause

Connection/network errors are not recognized by the current failover
classifier:
- Text patterns like "Connection error.", "fetch failed", "network error"
- Error codes like ECONNREFUSED, ENOTFOUND, EAI_AGAIN (in message text)

While `failover-error.ts` handles these as error codes (err.code),
it misses them when they appear as plain text in error messages.

## Solution

Extend timeout error patterns to include connection/network failures:

**In `errors.ts` (ERROR_PATTERNS.timeout):**
- Text: "connection error", "network error", "fetch failed", etc.
- Regex: /\beconn(?:refused|reset|aborted)\b/i, /\benotfound\b/i, /\beai_again\b/i

**In `failover-error.ts` (TIMEOUT_HINT_RE):**
- Same patterns for non-assistant error paths

## Testing

Added test cases covering:
- "Connection error."
- "fetch failed"
- "network error: ECONNREFUSED"
- "ENOTFOUND" / "EAI_AGAIN" in message text

## Impact

- **Compatibility:** High - only expands retryable error detection
- **Behavior:** Connection failures now trigger automatic fallback
- **Risk:** Low - changes are additive and well-tested

* style: fix code formatting for test file
2026-03-03 02:37:23 +00:00
Peter Steinberger
1bd20dbdb6 fix(failover): treat stop reason error as timeout 2026-03-03 01:05:24 +00:00
Peter Steinberger
a2fdc3415f fix(failover): handle unhandled stop reason error 2026-03-03 01:05:24 +00:00
Saurabh
1ef9a2a8ea fix: handle HTTP 529 (Anthropic overloaded) in failover error classification
Classify Anthropic's 529 status code as "rate_limit" so model fallback
triggers reliably without depending on fragile message-based detection.

Closes #28502
2026-03-02 18:59:10 +00:00
Sid
40e078a567 fix(auth): classify permission_error as auth_permanent for profile fallback (#31324)
When an OAuth auth profile returns HTTP 403 with permission_error
(e.g. expired plan), the error was not matched by the authPermanent
patterns. This caused the profile to receive only a short cooldown
instead of being disabled, so the gateway kept retrying the same
broken profile indefinitely.

Add "permission_error" and "not allowed for this organization" to
the authPermanent error patterns so these errors trigger the longer
billing/auth_permanent disable window and proper profile rotation.

Closes #31306

Made-with: Cursor

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-03-01 22:26:05 -08:00
Aleksandrs Tihenko
c0026274d9 fix(auth): distinguish revoked API keys from transient auth errors (#25754)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a200
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-25 19:47:16 -05:00
taw0002
3c57bf4c85 fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) (#21017)
* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes #20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-23 03:01:57 -05:00
Peter Steinberger
adace58505 test: reclassify local helper suites out of agents e2e 2026-02-22 10:53:40 +00:00
Peter Steinberger
9131b22a28 test: migrate suites to e2e coverage layout 2026-02-13 14:28:22 +00:00
Oren
71b4be8799 fix: handle 400 status in failover to enable model fallback (#1879) 2026-02-08 23:12:06 -08:00
Peter Steinberger
c379191f80 chore: migrate to oxlint and oxfmt
Co-authored-by: Christoph Nakazawa <christoph.pojer@gmail.com>
2026-01-14 15:02:19 +00:00
Peter Steinberger
53ec8e36cb refactor: centralize failover error parsing 2026-01-10 01:26:06 +01:00
Peter Steinberger
402c35b91c refactor(agents): centralize failover normalization 2026-01-09 22:15:06 +01:00