Commit Graph

74 Commits

Author SHA1 Message Date
Altay
6e962d8b9e fix(agents): handle overloaded failover separately (#38301)
* fix(agents): skip auth-profile failure on overload

* fix(agents): note overload auth-profile fallback fix

* fix(agents): classify overloaded failures separately

* fix(agents): back off before overload failover

* fix(agents): tighten overload probe and backoff state

* fix(agents): persist overloaded cooldown across runs

* fix(agents): tighten overloaded status handling

* test(agents): add overload regression coverage

* fix(agents): restore runner imports after rebase

* test(agents): add overload fallback integration coverage

* fix(agents): harden overloaded failover abort handling

* test(agents): tighten overload classifier coverage

* test(agents): cover all-overloaded fallback exhaustion

* fix(cron): retry overloaded fallback summaries

* fix(cron): treat HTTP 529 as overloaded retry
2026-03-07 01:42:11 +03:00
Tak Hoffman
cc5dad81bc cron: unify stale-run recovery and preserve manual-run every anchors (#35363)
* cron: unify stale-run recovery and preserve manual every anchors

* cron: address unresolved review threads on recovery paths

* cron: remove duplicate timestamp helper after rebase
2026-03-04 22:12:32 -06:00
Tak Hoffman
28dc2e8a40 cron: narrow startup replay backoff guard (#35391) 2026-03-04 22:11:11 -06:00
Tak Hoffman
79d00ae398 fix(cron): stabilize restart catch-up replay semantics (#35351)
* Cron: stabilize restart catch-up replay semantics

* Cron: respect backoff in startup missed-run replay
2026-03-04 21:50:16 -06:00
Peter Steinberger
f9cbcfca0d refactor: modularize slack/config/cron/daemon internals 2026-03-02 22:30:21 +00:00
scoootscooob
a3c5d21b4d fix(cron): suppress HEARTBEAT_OK summary from leaking into main session (#32013)
When an isolated cron agent returns HEARTBEAT_OK (nothing to announce),
the direct delivery is correctly skipped, but the fallback path in
timer.ts still enqueues the summary as a system event to the main
session. Filter out heartbeat-only summaries using isCronSystemEvent
before enqueuing, so internal ack tokens never reach user conversations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 19:51:51 +00:00
scoootscooob
4030de6c73 fix(cron): move session reaper to finally block so it runs reliably (#31996)
* fix(cron): move session reaper to finally block so it runs reliably

The cron session reaper was placed inside the try block of onTimer(),
after job execution and state updates. If the locked persist section
threw, the reaper was skipped — causing isolated cron run sessions to
accumulate indefinitely in sessions.json.

Move the reaper into the finally block so it always executes after a
timer tick, regardless of whether job execution succeeded. The reaper
is already self-throttled (MIN_SWEEP_INTERVAL_MS = 5 min) so calling
it more reliably has no performance impact.

Closes #31946

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: strengthen cron reaper failure-path coverage and changelog (#31996) (thanks @scoootscooob)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-03-02 18:38:59 +00:00
Evgeny Zislis
4b4ea5df8b feat(cron): add failure destination support to failed cron jobs (#31059)
* feat(cron): add failure destination support with webhook mode and bestEffort handling

Extends PR #24789 failure alerts with features from PR #29145:
- Add webhook delivery mode for failure alerts (mode: 'webhook')
- Add accountId support for multi-account channel configurations
- Add bestEffort handling to skip alerts when job has bestEffort=true
- Add separate failureDestination config (global + per-job in delivery)
- Add duplicate prevention (prevents sending to same as primary delivery)
- Add CLI flags: --failure-alert-mode, --failure-alert-account-id
- Add UI fields for new options in web cron editor

* fix(cron): merge failureAlert mode/accountId and preserve failureDestination on updates

- Fix mergeCronFailureAlert to merge mode and accountId fields
- Fix mergeCronDelivery to preserve failureDestination on updates
- Fix isSameDeliveryTarget to use 'announce' as default instead of 'none'
  to properly detect duplicates when delivery.mode is undefined

* fix(cron): validate webhook mode requires URL in resolveFailureDestination

When mode is 'webhook' but no 'to' URL is provided, return null
instead of creating an invalid plan that silently fails later.

* fix(cron): fail closed on webhook mode without URL and make failureDestination fields clearable

- sendCronFailureAlert: fail closed when mode is webhook but URL is missing
- mergeCronDelivery: use per-key presence checks so callers can clear
  nested failureDestination fields via cron.update

Note: protocol:check shows missing internalEvents in Swift models - this is
a pre-existing issue unrelated to these changes (upstream sync needed).

* fix(cron): use separate schema for failureDestination and fix type cast

- Create CronFailureDestinationSchema excluding after/cooldownMs fields
- Fix type cast in sendFailureNotificationAnnounce to use CronMessageChannel

* fix(cron): merge global failureDestination with partial job overrides

When job has partial failureDestination config, fall back to global
config for unset fields instead of treating it as a full override.

* fix(cron): avoid forcing announce mode and clear inherited to on mode change

- UI: only include mode in patch if explicitly set to non-default
- delivery.ts: clear inherited 'to' when job overrides mode, since URL
  semantics differ between announce and webhook modes

* fix(cron): preserve explicit to on mode override and always include mode in UI patches

- delivery.ts: preserve job-level explicit 'to' when overriding mode
- UI: always include mode in failureAlert patch so users can switch between announce/webhook

* fix(cron): allow clearing accountId and treat undefined global mode as announce

- UI: always include accountId in patch so users can clear it
- delivery.ts: treat undefined global mode as announce when comparing for clearing inherited 'to'

* Cron: harden failure destination routing and add regression coverage

* Cron: resolve failure destination review feedback

* Cron: drop unrelated timeout assertions from conflict resolution

* Cron: format cron CLI regression test

* Cron: align gateway cron test mock types

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-03-02 09:27:41 -06:00
Kate Chapman
6df8bd9741 fix(cron): wrap computeJobNextRunAtMs in try-catch inside applyJobResult (#30905)
* fix(cron): wrap computeJobNextRunAtMs in try-catch inside applyJobResult

Without this guard, if the croner library throws during schedule
computation (timezone/expression edge cases), the exception propagates
out of applyJobResult and the entire state update is lost — runningAtMs
never clears, lastRunAtMs never advances, nextRunAtMs never recomputes.
After STUCK_RUN_MS (2h), stuck detection clears runningAtMs and the job
re-fires, creating a ~2h repeat cycle instead of the intended schedule.

The sibling function recomputeJobNextRunAtMs in jobs.ts already wraps
computeJobNextRunAtMs in try-catch; this was an oversight in the
applyJobResult call sites.

Changes:
- Error-backoff path: catch and fall back to backoff-only schedule
- Success path: catch and fall through to the MIN_REFIRE_GAP_MS safety net
- applyOutcomeToStoredJob: log a warning when job not found after forceReload

* fix(cron): use recordScheduleComputeError in applyJobResult catch blocks

Address review feedback: the original catch blocks only logged a warning,
which meant a persistent computeJobNextRunAtMs throw would cause a
MIN_REFIRE_GAP_MS (2s) hot loop on cron-kind jobs.

Now both catch blocks call recordScheduleComputeError (exported from
jobs.ts), which tracks consecutive schedule errors and auto-disables the
job after 3 failures — matching the existing behavior in
recomputeJobNextRunAtMs.

* test(cron): cover applyJobResult schedule-throw fallback paths

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-03-02 07:38:33 -06:00
FlamesCN
aaa7de45fa fix(cron): prevent armTimer tight loop when job has stuck runningAtMs (openclaw#29853) thanks @FlamesCN
Verified:
- pnpm build
- pnpm check
- pnpm test:macmini

Co-authored-by: FlamesCN <12966659+FlamesCN@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-03-01 19:58:58 -06:00
0xbrak
4637b90c07 feat(cron): configurable failure alerts for repeated job errors (openclaw#24789) thanks @0xbrak
Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/service.failure-alert.test.ts src/cli/cron-cli.test.ts src/gateway/protocol/cron-validators.test.ts

Co-authored-by: 0xbrak <181251288+0xbrak@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-03-01 08:18:15 -06:00
NIO
ea3955cd78 fix(cron): add retry policy for one-shot jobs on transient errors (#24355) (openclaw#24435) thanks @hugenshen
Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/service.issue-regressions.test.ts src/config/config-misc.test.ts

Co-authored-by: hugenshen <16300669+hugenshen@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-03-01 06:58:03 -06:00
Vignesh Natarajan
2050fd7539 Cron: preserve session scope for main-target reminders 2026-02-28 14:53:19 -08:00
Marcus Widing
8ae1987f2a fix(cron): pass heartbeat target=last for main-session cron jobs (#28508) (#28583)
* fix(cron): pass heartbeat target=last for main-session cron jobs

When a cron job with sessionTarget=main and wakeMode=now fires, it
triggers a heartbeat via runHeartbeatOnce. Since e2362d35 changed the
default heartbeat target from "last" to "none", these cron-triggered
heartbeats silently discard their responses instead of delivering them
to the last active channel (e.g. Telegram).

Fix: pass heartbeat: { target: "last" } from the cron timer to
runHeartbeatOnce for main-session jobs, and wire the override through
the gateway cron service builder. This restores delivery for
sessionTarget=main cron jobs without reverting the intentional default
change for regular heartbeats.

Regression introduced in: e2362d35 (2026-02-25)

Fixes #28508

* Cron: align server-cron wake routing expectations for main-target jobs

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-02-28 11:14:24 -06:00
Sid
fe9a7c4082 fix(cron): force main-target system events onto main session (#28898)
Ignore persisted sessionKey overrides for sessionTarget=main jobs so cron system events consistently route to the agent main session after upgrades.

Closes #28770
2026-02-28 11:08:53 -06:00
Pierre
e1c8094ad0 fix: schedule nextWakeAtMs for isolated sessionTarget cron jobs (#19541)
* fix(cron): repair isolated next wake scheduling

* cron: harden isolated next-wake timestamp guards

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-02-28 10:48:31 -06:00
Peter Steinberger
b402770f63 refactor(reply): split abort cutoff and timeout policy modules 2026-02-26 14:00:35 +01:00
Peter Steinberger
c397a02c9a fix(queue): harden drain/abort/timeout race handling
- reject new lane enqueues once gateway drain begins
- always reset lane draining state and isolate onWait callback failures
- persist per-session abort cutoff and skip stale queued messages
- avoid false 600s agentTurn timeout in isolated cron jobs

Fixes #27407
Fixes #27332
Fixes #27427

Co-authored-by: Kevin Shenghui <shenghuikevin@github.com>
Co-authored-by: zjmy <zhangjunmengyang@gmail.com>
Co-authored-by: suko <miha.sukic@gmail.com>
2026-02-26 13:43:39 +01:00
Peter Steinberger
b37dc42240 fix(cron): suppress fallback summary after attempted announce delivery 2026-02-26 03:09:14 +00:00
Tak Hoffman
a54dc7fe80 Cron: suppress fallback main summary for delivery-target errors (openclaw#24074) thanks @Takhoffman
Verified:
- pnpm build
- pnpm check
- pnpm test:macmini

Co-authored-by: Takhoffman <781889+Takhoffman@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-02-22 20:24:08 -06:00
Tak Hoffman
3efe63d1ad Cron: respect aborts in main wake-now retries (#23967)
* Cron: respect aborts in main wake-now retries

* Changelog: add main-session cron abort retry fix note

* Cron tests: format post-rebase conflict resolution
2026-02-22 17:19:27 -06:00
Tak Hoffman
73e5bb7635 Cron: apply timeout to startup catch-up runs (#23966)
* Cron: apply timeout to startup catch-up runs

* Changelog: add cron startup timeout catch-up note
2026-02-22 17:04:30 -06:00
Tak Hoffman
556af3f08b fix(cron): cancel timed-out runs before side effects (openclaw#22411) thanks @Takhoffman
Verified:
- pnpm check
- pnpm vitest run src/memory/qmd-manager.test.ts src/cron/service.issue-regressions.test.ts src/cron/isolated-agent.delivers-response-has-heartbeat-ok-but-includes.test.ts --maxWorkers=1

Co-authored-by: Takhoffman <781889+Takhoffman@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-02-22 15:45:27 -06:00
Peter Steinberger
5d90e31807 refactor(cron): share timed job-execution helper 2026-02-22 20:18:20 +00:00
Peter Steinberger
aa4c250eb8 fix(cron): split run and delivery status tracking 2026-02-22 20:19:23 +01:00
Peter Steinberger
8bf3c37c6c fix(cron): keep watchdog timer armed during ticks 2026-02-22 20:19:23 +01:00
Peter Steinberger
5db1ee4ec6 fix(cron): keep manual runs non-blocking 2026-02-22 20:19:22 +01:00
Peter Steinberger
34ea33f057 refactor: dedupe core config and runtime helpers 2026-02-22 17:11:54 +00:00
Frank Yang
1051f42f96 fix(stability): patch regex retries and timeout abort handling 2026-02-22 10:59:34 +01:00
Vignesh Natarajan
2830dafbe9 Cron: keep list/status responsive during startup catch-up 2026-02-21 19:13:04 -08:00
Simone Macario
09d5f508b1 fix(cron): persist delivered flag in job state to surface delivery failures (openclaw#19174) thanks @simonemacario
Verified:
- pnpm build
- pnpm check
- pnpm test:macmini

Co-authored-by: simonemacario <2116609+simonemacario@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-02-21 12:47:29 -06:00
Tak Hoffman
7417c36268 fix(cron): honor maxConcurrentRuns in timer loop (openclaw#22413) thanks @Takhoffman
Verified:
- pnpm install --frozen-lockfile
- pnpm build
- pnpm check
- pnpm test:macmini (failed on unrelated baseline test: src/memory/qmd-manager.test.ts > throws when sqlite index is busy)

Co-authored-by: Takhoffman <781889+Takhoffman@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-02-20 22:31:58 -06:00
Peter Steinberger
b8b43175c5 style: align formatting with oxfmt 0.33 2026-02-18 01:34:35 +00:00
Peter Steinberger
31f9be126c style: run oxfmt and fix gate failures 2026-02-18 01:29:02 +00:00
Tyler Yust
75001a0490 fix cron announce routing and timeout handling 2026-02-17 11:40:04 -08:00
cpojer
d0cb8c19b2 chore: wtf. 2026-02-17 13:36:48 +09:00
Sebastian
ed11e93cf2 chore(format) 2026-02-16 23:20:16 -05:00
Vignesh Natarajan
f988abf202 Cron: route reminders by session namespace 2026-02-17 01:54:59 +01:00
cpojer
c70597daeb chore: Fix formatting. 2026-02-17 09:40:00 +09:00
Peter Steinberger
80c7d04ad2 refactor(cron): reuse shared run outcome telemetry types 2026-02-17 00:32:34 +00:00
cpojer
90ef2d6bdf chore: Update formatting. 2026-02-17 09:18:40 +09:00
Marcus Widing
8af4712c40 fix(cron): prevent spin loop when job completes within scheduled second (#17821)
When a cron job fires and completes within the same wall-clock second it
was scheduled for, the next-run computation could return undefined or the
same second, causing the scheduler to re-trigger the job hundreds of
times in a tight loop.

Two-layer fix:

1. computeJobNextRunAtMs: When computeNextRunAtMs returns undefined for a
   cron-kind schedule (edge case where floored nowSecondMs matches the
   schedule), retry with the ceiling (next second) as reference time.
   This ensures we always get the next valid occurrence.

2. applyJobResult: Add MIN_REFIRE_GAP_MS (2s) safety net for cron-kind
   jobs.  After a successful run, nextRunAtMs is guaranteed to be at
   least 2s in the future.  This breaks any remaining spin-loop edge
   cases without affecting normal daily/hourly schedules (where the
   natural next run is hours/days away).

Fixes #17821
2026-02-16 23:59:44 +01:00
Rob Dunn
ddea5458d0 cron: log model+token usage per run + add usage report script 2026-02-16 23:58:38 +01:00
pierreeurope
fec4be8dec fix(cron): prevent daily jobs from skipping days (48h jump) #17852 (#17903)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 1ffe6a45af
Co-authored-by: pierreeurope <248892285+pierreeurope@users.noreply.github.com>
Co-authored-by: sebslight <19554889+sebslight@users.noreply.github.com>
Reviewed-by: @sebslight
2026-02-16 08:35:49 -05:00
Peter Steinberger
17e5a5015c perf: avoid async cron timer callbacks 2026-02-16 02:45:00 +00:00
Peter Steinberger
5b2cb8ba11 refactor(cron): dedupe finished event emit 2026-02-16 01:37:03 +00:00
Peter Steinberger
a73e7786e7 refactor(cron): share runnable job filter 2026-02-16 01:29:01 +00:00
Gustavo Madeira Santana
88caa4b50c chore(cron): simplify enabled checks for lint 2026-02-15 10:30:19 -05:00
Alejandro Santander
9a344da298 fix(cron): treat missing enabled as true in update() (openclaw#15477) thanks @eternauta1337
Verified:
- pnpm exec vitest src/cron/service.issue-regressions.test.ts

Co-authored-by: eternauta1337 <550409+eternauta1337@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-02-15 08:52:02 -06:00
Vignesh Natarajan
4c4d2558e3 fix (heartbeat/cron): preserve cron prompts for tagged interval events 2026-02-14 19:46:31 -08:00