Commit Graph

90 Commits

Author SHA1 Message Date
Mert Başar
029ca8c268 feat(agents): implement state-aware failover and lane suspension
Summary:
- Persist quota-suspension state transitions and reload fresh suspension state before failover handoff injection.
- Restore suspended lanes to configured concurrency and share failover-to-suspension reason mapping across fallback and embedded runner paths.
- Export model.failover diagnostics via OTLP and cover queueing/resume behavior with regressions.

Verification:
- pnpm test src/config/sessions/store.pruning.integration.test.ts src/process/command-queue.test.ts src/agents/session-suspension.test.ts src/agents/model-fallback.test.ts extensions/diagnostics-otel/src/service.test.ts
- git diff --check
- pnpm exec oxfmt --check --threads=1 on changed TypeScript files
- GitHub checks: 92 successful, 0 pending, 0 failed on head 962146be88
- Review threads: none unresolved
2026-05-07 18:34:05 -05:00
Peter Steinberger
330ba1fa31 refactor: move canvas to plugin surfaces 2026-05-07 09:07:18 +01:00
Sally O'Malley
a74894a954 fix(agents): fail fast on session lock fallback (#78633)
Signed-off-by: sallyom <somalley@redhat.com>
2026-05-06 20:22:47 -04:00
Peter Steinberger
d111605453 test: streamline model fallback probe coverage 2026-05-06 01:12:16 +01:00
Peter Steinberger
cb42efb6e6 test: trim slow agent fallback coverage 2026-05-06 00:53:27 +01:00
Peter Steinberger
64b1f5fbf4 test: speed up changed test paths 2026-05-05 19:48:19 +01:00
Peter Steinberger
59fb9e5ca7 refactor: unify lazy import loaders 2026-05-02 10:55:59 +01:00
wenxu007
9df0ae6767 fix(agents,failover): propagate sessionId/lane/provider attribution through FailoverError (#73506)
* fix(agents,failover): propagate sessionId/lane/provider attribution through FailoverError

Adds optional `sessionId` and `lane` fields to `FailoverError` and threads
them — together with the existing `provider`, `model`, `profileId` — through
`describeFailoverError` and `coerceToFailoverError` context, so structured
error log ingestion can attribute exhausted-fallback wrapper errors back
to the originating request instead of dropping the per-profile metadata
when the final wrapper is built.

Fixes #42713.

* fix: preserve failover error attribution

---------

Co-authored-by: Altay <altay@uinaf.dev>
2026-05-01 11:26:56 +03:00
Peter Steinberger
90419df663 [codex] Make external CLI credential discovery explicit (#75209)
* refactor(auth): make external CLI discovery explicit

* test(auth): update external cli discovery mocks

* test(auth): cover scoped external cli auth mocks

* [codex] Make external CLI credential discovery explicit

---------

Co-authored-by: clawsweeper-repair <clawsweeper-repair@users.noreply.github.com>
2026-04-30 20:32:55 +00:00
Peter Steinberger
aec5efed8d fix(agents): resolve model aliases before fallback 2026-04-28 20:39:58 +01:00
Peter Steinberger
ab95812d65 fix: record model fallback steps in trajectories 2026-04-28 05:08:34 +01:00
Peter Steinberger
3da4b28d1b fix(agents): avoid overload classification for live model switches 2026-04-27 12:28:33 +01:00
Vincent Koc
43a003b8a0 fix: short-circuit live model switch fallback redirects (#72375) 2026-04-26 14:45:02 -07:00
Vincent Koc
480a3f66c9 fix: shortcut live session model redirects during fallback 2026-04-26 11:14:05 -07:00
EVA
40be5ad581 fix: harden GPT-5 runtime paths
Co-authored-by: EVA <100yenadmin@users.noreply.github.com>
2026-04-24 08:55:52 +01:00
Peter Steinberger
5b39be3653 fix(agents): preserve raw fallback schema errors 2026-04-23 07:44:39 +01:00
Peter Steinberger
f600e98e5b fix(agents): handle OpenAI web search schema rejects 2026-04-23 06:51:29 +01:00
zhulijin1991
92e864a521 fix(image): respect configured provider for bare image overrides 2026-04-21 04:20:22 +01:00
Vincent Koc
93ce76afe3 perf(agents): use lightweight model fallback selection helpers 2026-04-13 18:12:09 +01:00
Vincent Koc
da3977e681 perf(agents): narrow failover helper imports 2026-04-13 17:21:21 +01:00
Vincent Koc
95517edaeb perf(agents): keep model fallback auth runtime cold 2026-04-13 16:50:30 +01:00
Vincent Koc
bfc77b0f45 perf(agents): keep fallback auth store cold without sources 2026-04-13 15:58:35 +01:00
Vincent Koc
74e7b8d47b fix(cycles): bulk extract leaf type surfaces 2026-04-11 13:26:50 +01:00
Peter Steinberger
dcc3392a1a refactor: remove redundant model fallback conversions 2026-04-10 22:24:45 +01:00
Peter Steinberger
0ebeee8b0d chore: enable consistent-return 2026-04-10 20:56:43 +01:00
Neerav Makwana
75deed54f3 Agents: allow cooldown probe for timeout failover reason 2026-04-10 13:52:37 +05:30
Peter Steinberger
65ea8c60f3 refactor: dedupe agent trimmed readers 2026-04-08 00:09:41 +01:00
Peter Steinberger
a4253deb67 refactor: dedupe agent error formatting 2026-04-07 02:03:34 +01:00
Peter Steinberger
51c6b1c2bc fix: default fallback decision log targets 2026-04-06 20:57:10 +01:00
Peter Steinberger
e3d6209599 refactor: dedupe model fallback failure tracking 2026-04-06 20:45:32 +01:00
Shakker
11dbcdc46d refactor: narrow model fallback auth imports 2026-04-03 16:03:10 +01:00
Shakker
fc8ab82aab refactor: trim cron session startup imports 2026-04-03 16:03:10 +01:00
Han Yang
547154865b Fix: live session model switch no longer blocks failover (Resolves #58466) (#58589)
* fix: prevent infinite retry loop when live session model switch blocks failover (#58466)

* fix: remove unused resolveOllamaBaseUrlForRun import after rebase
2026-03-31 21:09:41 -04:00
kiranvk2011
84401223c7 fix: per-model cooldown scope, stepped backoff, and user-facing rate-limit message (#49834)
Merged via squash.

Prepared head SHA: 7c488c070c
Co-authored-by: kiranvk-2011 <91108465+kiranvk-2011@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-25 22:03:49 +03:00
Peter Steinberger
80cd8cd6be refactor: unify minimax model and failover live policies 2026-03-23 00:02:35 -07:00
Catalin Lupuleti
dac220bd88 fix(agents): normalize abort-wrapped RESOURCE_EXHAUSTED into failover errors (#11972) 2026-03-13 23:25:04 -07:00
VibhorGautam
4473242b4f fix: use unknown instead of rate_limit as default cooldown reason (#42911)
Merged via squash.

Prepared head SHA: bebf6704d7
Co-authored-by: VibhorGautam <55019395+VibhorGautam@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-11 21:34:14 +03:00
Charles Dusek
048e25c2b2 fix(agents): avoid duplicate same-provider cooldown probes in fallback runs (#41711)
Merged via squash.

Prepared head SHA: 8be8967bcb
Co-authored-by: cgdusek <38732970+cgdusek@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-10 15:26:47 +03:00
Altay
531e8362b1 Agents: add fallback error observations (#41337)
Merged via squash.

Prepared head SHA: 852469c82f
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-10 01:12:10 +03:00
Altay
0669b0ddc2 fix(agents): probe single-provider billing cooldowns (#41422)
Merged via squash.

Prepared head SHA: bbc4254b94
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-10 00:58:51 +03:00
Peter Lee
92648f9ba9 fix(agents): broaden 402 temporary-limit detection and allow billing cooldown probe (#38533)
Merged via squash.

Prepared head SHA: 282b9186c6
Co-authored-by: xialonglee <22994703+xialonglee@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-08 10:27:01 +03:00
Peter Steinberger
e83094e63f fix(agents): warn clearly on unresolved model ids (#39215, thanks @ademczuk)
Co-authored-by: ademczuk <andrew.demczuk@gmail.com>
2026-03-07 22:50:27 +00:00
Altay
6e962d8b9e fix(agents): handle overloaded failover separately (#38301)
* fix(agents): skip auth-profile failure on overload

* fix(agents): note overload auth-profile fallback fix

* fix(agents): classify overloaded failures separately

* fix(agents): back off before overload failover

* fix(agents): tighten overload probe and backoff state

* fix(agents): persist overloaded cooldown across runs

* fix(agents): tighten overloaded status handling

* test(agents): add overload regression coverage

* fix(agents): restore runner imports after rebase

* test(agents): add overload fallback integration coverage

* fix(agents): harden overloaded failover abort handling

* test(agents): tighten overload classifier coverage

* test(agents): cover all-overloaded fallback exhaustion

* fix(cron): retry overloaded fallback summaries

* fix(cron): treat HTTP 529 as overloaded retry
2026-03-07 01:42:11 +03:00
Vignesh Natarajan
d45353f95b fix(agents): honor explicit rate-limit cooldown probes in fallback runs 2026-03-05 20:03:06 -08:00
Peter Steinberger
ab8b8dae70 refactor(agents): dedupe model and tool test helpers 2026-03-02 21:31:36 +00:00
Ramez
acbb93be48 fix(agents): comprehensive quota fallback fixes - session overrides + surgical cooldown logic (#23816)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: e6f2b4742b
Co-authored-by: ramezgaberiel <844893+ramezgaberiel@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-25 20:35:40 -05:00
Sid
156f13aa64 fix(agents): continue fallback loop for unrecognized provider errors (#26106)
* fix(agents): continue fallback loop for unrecognized provider errors

When a provider returns an error that coerceToFailoverError cannot
classify (e.g., custom error messages without standard HTTP status
codes), the fallback loop threw immediately instead of trying the
next candidate. This caused fallback to stop after 2 models even
when 17 were configured.

Only rethrow unrecognized errors when they occur on the last
candidate. For intermediate candidates, record the error as an
attempt and continue to the next model.

Closes #25926

Co-authored-by: Cursor <cursoragent@cursor.com>

* test: cover unknown-error fallback telemetry and land #26106 (thanks @Sid-Qin)

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-02-25 04:53:26 +00:00
Peter Steinberger
9beec48e9c refactor(agents): centralize model fallback resolution 2026-02-25 04:32:31 +00:00
Peter Steinberger
d2597d5ecf fix(agents): harden model fallback failover paths 2026-02-25 03:46:34 +00:00
Peter Steinberger
bf5a96ad63 fix(agents): keep fallback chain reachable on configured fallback models (#25922) 2026-02-25 01:46:20 +00:00