Mert Başar
029ca8c268
feat(agents): implement state-aware failover and lane suspension
...
Summary:
- Persist quota-suspension state transitions and reload fresh suspension state before failover handoff injection.
- Restore suspended lanes to configured concurrency and share failover-to-suspension reason mapping across fallback and embedded runner paths.
- Export model.failover diagnostics via OTLP and cover queueing/resume behavior with regressions.
Verification:
- pnpm test src/config/sessions/store.pruning.integration.test.ts src/process/command-queue.test.ts src/agents/session-suspension.test.ts src/agents/model-fallback.test.ts extensions/diagnostics-otel/src/service.test.ts
- git diff --check
- pnpm exec oxfmt --check --threads=1 on changed TypeScript files
- GitHub checks: 92 successful, 0 pending, 0 failed on head 962146be88
- Review threads: none unresolved
2026-05-07 18:34:05 -05:00
Peter Steinberger
330ba1fa31
refactor: move canvas to plugin surfaces
2026-05-07 09:07:18 +01:00
Sally O'Malley
a74894a954
fix(agents): fail fast on session lock fallback ( #78633 )
...
Signed-off-by: sallyom <somalley@redhat.com >
2026-05-06 20:22:47 -04:00
Peter Steinberger
d111605453
test: streamline model fallback probe coverage
2026-05-06 01:12:16 +01:00
Peter Steinberger
cb42efb6e6
test: trim slow agent fallback coverage
2026-05-06 00:53:27 +01:00
Peter Steinberger
64b1f5fbf4
test: speed up changed test paths
2026-05-05 19:48:19 +01:00
Peter Steinberger
59fb9e5ca7
refactor: unify lazy import loaders
2026-05-02 10:55:59 +01:00
wenxu007
9df0ae6767
fix(agents,failover): propagate sessionId/lane/provider attribution through FailoverError ( #73506 )
...
* fix(agents,failover): propagate sessionId/lane/provider attribution through FailoverError
Adds optional `sessionId` and `lane` fields to `FailoverError` and threads
them — together with the existing `provider`, `model`, `profileId` — through
`describeFailoverError` and `coerceToFailoverError` context, so structured
error log ingestion can attribute exhausted-fallback wrapper errors back
to the originating request instead of dropping the per-profile metadata
when the final wrapper is built.
Fixes #42713 .
* fix: preserve failover error attribution
---------
Co-authored-by: Altay <altay@uinaf.dev >
2026-05-01 11:26:56 +03:00
Peter Steinberger
90419df663
[codex] Make external CLI credential discovery explicit ( #75209 )
...
* refactor(auth): make external CLI discovery explicit
* test(auth): update external cli discovery mocks
* test(auth): cover scoped external cli auth mocks
* [codex] Make external CLI credential discovery explicit
---------
Co-authored-by: clawsweeper-repair <clawsweeper-repair@users.noreply.github.com >
2026-04-30 20:32:55 +00:00
Peter Steinberger
aec5efed8d
fix(agents): resolve model aliases before fallback
2026-04-28 20:39:58 +01:00
Peter Steinberger
ab95812d65
fix: record model fallback steps in trajectories
2026-04-28 05:08:34 +01:00
Peter Steinberger
3da4b28d1b
fix(agents): avoid overload classification for live model switches
2026-04-27 12:28:33 +01:00
Vincent Koc
43a003b8a0
fix: short-circuit live model switch fallback redirects ( #72375 )
2026-04-26 14:45:02 -07:00
Vincent Koc
480a3f66c9
fix: shortcut live session model redirects during fallback
2026-04-26 11:14:05 -07:00
EVA
40be5ad581
fix: harden GPT-5 runtime paths
...
Co-authored-by: EVA <100yenadmin@users.noreply.github.com >
2026-04-24 08:55:52 +01:00
Peter Steinberger
5b39be3653
fix(agents): preserve raw fallback schema errors
2026-04-23 07:44:39 +01:00
Peter Steinberger
f600e98e5b
fix(agents): handle OpenAI web search schema rejects
2026-04-23 06:51:29 +01:00
zhulijin1991
92e864a521
fix(image): respect configured provider for bare image overrides
2026-04-21 04:20:22 +01:00
Vincent Koc
93ce76afe3
perf(agents): use lightweight model fallback selection helpers
2026-04-13 18:12:09 +01:00
Vincent Koc
da3977e681
perf(agents): narrow failover helper imports
2026-04-13 17:21:21 +01:00
Vincent Koc
95517edaeb
perf(agents): keep model fallback auth runtime cold
2026-04-13 16:50:30 +01:00
Vincent Koc
bfc77b0f45
perf(agents): keep fallback auth store cold without sources
2026-04-13 15:58:35 +01:00
Vincent Koc
74e7b8d47b
fix(cycles): bulk extract leaf type surfaces
2026-04-11 13:26:50 +01:00
Peter Steinberger
dcc3392a1a
refactor: remove redundant model fallback conversions
2026-04-10 22:24:45 +01:00
Peter Steinberger
0ebeee8b0d
chore: enable consistent-return
2026-04-10 20:56:43 +01:00
Neerav Makwana
75deed54f3
Agents: allow cooldown probe for timeout failover reason
2026-04-10 13:52:37 +05:30
Peter Steinberger
65ea8c60f3
refactor: dedupe agent trimmed readers
2026-04-08 00:09:41 +01:00
Peter Steinberger
a4253deb67
refactor: dedupe agent error formatting
2026-04-07 02:03:34 +01:00
Peter Steinberger
51c6b1c2bc
fix: default fallback decision log targets
2026-04-06 20:57:10 +01:00
Peter Steinberger
e3d6209599
refactor: dedupe model fallback failure tracking
2026-04-06 20:45:32 +01:00
Shakker
11dbcdc46d
refactor: narrow model fallback auth imports
2026-04-03 16:03:10 +01:00
Shakker
fc8ab82aab
refactor: trim cron session startup imports
2026-04-03 16:03:10 +01:00
Han Yang
547154865b
Fix: live session model switch no longer blocks failover ( Resolves #58466 ) ( #58589 )
...
* fix: prevent infinite retry loop when live session model switch blocks failover (#58466 )
* fix: remove unused resolveOllamaBaseUrlForRun import after rebase
2026-03-31 21:09:41 -04:00
kiranvk2011
84401223c7
fix: per-model cooldown scope, stepped backoff, and user-facing rate-limit message ( #49834 )
...
Merged via squash.
Prepared head SHA: 7c488c070c
Co-authored-by: kiranvk-2011 <91108465+kiranvk-2011@users.noreply.github.com >
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com >
Reviewed-by: @altaywtf
2026-03-25 22:03:49 +03:00
Peter Steinberger
80cd8cd6be
refactor: unify minimax model and failover live policies
2026-03-23 00:02:35 -07:00
Catalin Lupuleti
dac220bd88
fix(agents): normalize abort-wrapped RESOURCE_EXHAUSTED into failover errors ( #11972 )
2026-03-13 23:25:04 -07:00
VibhorGautam
4473242b4f
fix: use unknown instead of rate_limit as default cooldown reason ( #42911 )
...
Merged via squash.
Prepared head SHA: bebf6704d7
Co-authored-by: VibhorGautam <55019395+VibhorGautam@users.noreply.github.com >
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com >
Reviewed-by: @altaywtf
2026-03-11 21:34:14 +03:00
Charles Dusek
048e25c2b2
fix(agents): avoid duplicate same-provider cooldown probes in fallback runs ( #41711 )
...
Merged via squash.
Prepared head SHA: 8be8967bcb
Co-authored-by: cgdusek <38732970+cgdusek@users.noreply.github.com >
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com >
Reviewed-by: @altaywtf
2026-03-10 15:26:47 +03:00
Altay
531e8362b1
Agents: add fallback error observations ( #41337 )
...
Merged via squash.
Prepared head SHA: 852469c82f
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com >
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com >
Reviewed-by: @altaywtf
2026-03-10 01:12:10 +03:00
Altay
0669b0ddc2
fix(agents): probe single-provider billing cooldowns ( #41422 )
...
Merged via squash.
Prepared head SHA: bbc4254b94
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com >
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com >
Reviewed-by: @altaywtf
2026-03-10 00:58:51 +03:00
Peter Lee
92648f9ba9
fix(agents): broaden 402 temporary-limit detection and allow billing cooldown probe ( #38533 )
...
Merged via squash.
Prepared head SHA: 282b9186c6
Co-authored-by: xialonglee <22994703+xialonglee@users.noreply.github.com >
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com >
Reviewed-by: @altaywtf
2026-03-08 10:27:01 +03:00
Peter Steinberger
e83094e63f
fix(agents): warn clearly on unresolved model ids ( #39215 , thanks @ademczuk)
...
Co-authored-by: ademczuk <andrew.demczuk@gmail.com >
2026-03-07 22:50:27 +00:00
Altay
6e962d8b9e
fix(agents): handle overloaded failover separately ( #38301 )
...
* fix(agents): skip auth-profile failure on overload
* fix(agents): note overload auth-profile fallback fix
* fix(agents): classify overloaded failures separately
* fix(agents): back off before overload failover
* fix(agents): tighten overload probe and backoff state
* fix(agents): persist overloaded cooldown across runs
* fix(agents): tighten overloaded status handling
* test(agents): add overload regression coverage
* fix(agents): restore runner imports after rebase
* test(agents): add overload fallback integration coverage
* fix(agents): harden overloaded failover abort handling
* test(agents): tighten overload classifier coverage
* test(agents): cover all-overloaded fallback exhaustion
* fix(cron): retry overloaded fallback summaries
* fix(cron): treat HTTP 529 as overloaded retry
2026-03-07 01:42:11 +03:00
Vignesh Natarajan
d45353f95b
fix(agents): honor explicit rate-limit cooldown probes in fallback runs
2026-03-05 20:03:06 -08:00
Peter Steinberger
ab8b8dae70
refactor(agents): dedupe model and tool test helpers
2026-03-02 21:31:36 +00:00
Ramez
acbb93be48
fix(agents): comprehensive quota fallback fixes - session overrides + surgical cooldown logic ( #23816 )
...
Merged via /review-pr -> /prepare-pr -> /merge-pr.
Prepared head SHA: e6f2b4742b
Co-authored-by: ramezgaberiel <844893+ramezgaberiel@users.noreply.github.com >
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com >
Reviewed-by: @gumadeiras
2026-02-25 20:35:40 -05:00
Sid
156f13aa64
fix(agents): continue fallback loop for unrecognized provider errors ( #26106 )
...
* fix(agents): continue fallback loop for unrecognized provider errors
When a provider returns an error that coerceToFailoverError cannot
classify (e.g., custom error messages without standard HTTP status
codes), the fallback loop threw immediately instead of trying the
next candidate. This caused fallback to stop after 2 models even
when 17 were configured.
Only rethrow unrecognized errors when they occur on the last
candidate. For intermediate candidates, record the error as an
attempt and continue to the next model.
Closes #25926
Co-authored-by: Cursor <cursoragent@cursor.com >
* test: cover unknown-error fallback telemetry and land #26106 (thanks @Sid-Qin)
---------
Co-authored-by: Cursor <cursoragent@cursor.com >
Co-authored-by: Peter Steinberger <steipete@gmail.com >
2026-02-25 04:53:26 +00:00
Peter Steinberger
9beec48e9c
refactor(agents): centralize model fallback resolution
2026-02-25 04:32:31 +00:00
Peter Steinberger
d2597d5ecf
fix(agents): harden model fallback failover paths
2026-02-25 03:46:34 +00:00
Peter Steinberger
bf5a96ad63
fix(agents): keep fallback chain reachable on configured fallback models ( #25922 )
2026-02-25 01:46:20 +00:00