* feat(gateway): add auth rate-limiting & brute-force protection
Add a per-IP sliding-window rate limiter to Gateway authentication
endpoints (HTTP, WebSocket upgrade, and WS message-level auth).
When gateway.auth.rateLimit is configured, failed auth attempts are
tracked per client IP. Once the threshold is exceeded within the
sliding window, further attempts are blocked with HTTP 429 + Retry-After
until the lockout period expires. Loopback addresses are exempt by
default so local CLI sessions are never locked out.
The limiter is only created when explicitly configured (undefined
otherwise), keeping the feature fully opt-in and backward-compatible.
* fix(gateway): isolate auth rate-limit scopes and normalize 429 responses
---------
Co-authored-by: buerbaumer <buerbaumer@users.noreply.github.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
* fix(gateway): increase WebSocket max payload to 5 MB for image uploads
The 512 KB limit was too small for base64-encoded images — a 400 KB
image becomes ~532 KB after encoding, exceeding the limit and closing
the connection with code 1006.
Bump MAX_PAYLOAD_BYTES to 5 MB and MAX_BUFFERED_BYTES to 8 MB to
support standard image uploads via webchat.
Closes#14400
* fix: align gateway WS limits with 5MB image uploads (#14486) (thanks @0xRaini)
* docs: fix changelog conflict for #14486
---------
Co-authored-by: 0xRaini <0xRaini@users.noreply.github.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
* fix(cron): pass agentId to runHeartbeatOnce for main-session jobs
Main-session cron jobs with agentId always ran the heartbeat under
the default agent, ignoring the job's agent binding. enqueueSystemEvent
correctly routed the system event to the bound agent's session, but
runHeartbeatOnce was called without agentId, so the heartbeat ran under
the default agent and never picked up the event.
Thread agentId from job.agentId through the CronServiceDeps type,
timer execution, and the gateway wrapper so heartbeat-runner uses the
correct agent.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* cron: add heartbeat agentId propagation regression test (#14140) (thanks @ishikawa-pro)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
* fix: prune stale session entries, cap entry count, and rotate sessions.json
The sessions.json file grows unbounded over time. Every heartbeat tick (default: 30m)
triggers multiple full rewrites, and session keys from groups, threads, and DMs
accumulate indefinitely with large embedded objects (skillsSnapshot,
systemPromptReport). At >50MB the synchronous JSON parse blocks the event loop,
causing Telegram webhook timeouts and effectively taking the bot down.
Three mitigations, all running inside saveSessionStoreUnlocked() on every write:
1. Prune stale entries: remove entries with updatedAt older than 30 days
(configurable via session.maintenance.pruneDays in openclaw.json)
2. Cap entry count: keep only the 500 most recently updated entries
(configurable via session.maintenance.maxEntries). Entries without updatedAt
are evicted first.
3. File rotation: if the existing sessions.json exceeds 10MB before a write,
rename it to sessions.json.bak.{timestamp} and keep only the 3 most recent
backups (configurable via session.maintenance.rotateBytes).
All three thresholds are configurable under session.maintenance in openclaw.json
with Zod validation. No env vars.
Existing tests updated to use Date.now() instead of epoch-relative timestamps
(1, 2, 3) that would be incorrectly pruned as stale.
27 new tests covering pruning, capping, rotation, and integration scenarios.
* feat: auto-prune expired cron run sessions (#12289)
Add TTL-based reaper for isolated cron run sessions that accumulate
indefinitely in sessions.json.
New config option:
cron.sessionRetention: string | false (default: '24h')
The reaper runs piggy-backed on the cron timer tick, self-throttled
to sweep at most every 5 minutes. It removes session entries matching
the pattern cron:<jobId>:run:<uuid> whose updatedAt + retention < now.
Design follows the Kubernetes ttlSecondsAfterFinished pattern:
- Sessions are persisted normally (observability/debugging)
- A periodic reaper prunes expired entries
- Configurable retention with sensible default
- Set to false to disable pruning entirely
Files changed:
- src/config/types.cron.ts: Add sessionRetention to CronConfig
- src/config/zod-schema.ts: Add Zod validation for sessionRetention
- src/cron/session-reaper.ts: New reaper module (sweepCronRunSessions)
- src/cron/session-reaper.test.ts: 12 tests covering all paths
- src/cron/service/state.ts: Add cronConfig/sessionStorePath to deps
- src/cron/service/timer.ts: Wire reaper into onTimer tick
- src/gateway/server-cron.ts: Pass config and session store path to deps
Closes#12289
* fix: sweep cron session stores per agent
* docs: add changelog for session maintenance (#13083) (thanks @skyfallsin, @Glucksberg)
* fix: add warn-only session maintenance mode
* fix: warn-only maintenance defaults to active session
* fix: deliver maintenance warnings to active session
* docs: add session maintenance examples
* fix: accept duration and size maintenance thresholds
* refactor: share cron run session key check
* fix: format issues and replace defaultRuntime.warn with console.warn
---------
Co-authored-by: Pradeep Elankumaran <pradeepe@gmail.com>
Co-authored-by: Glucksberg <markuscontasul@gmail.com>
Co-authored-by: max <40643627+quotentiroler@users.noreply.github.com>
Co-authored-by: quotentiroler <max.nussbaumer@maxhealth.tech>
Problem:
When users execute `/think off`, they still receive `reasoning_content`
from models configured with `reasoning: true` (e.g., GLM-4.7, GLM-4.6,
Kimi K2.5, MiniMax-M2.1).
Expected: `/think off` should completely disable reasoning content.
Actual: Reasoning content is still returned.
Root Cause:
The directive handlers delete `sessionEntry.thinkingLevel` when user
executes `/think off`. This causes the thinking level to become undefined,
and the system falls back to `resolveThinkingDefault()`, which checks the
model catalog and returns "low" for reasoning-capable models, ignoring the
user's explicit intent.
Why We Must Persist "off" (Design Rationale):
1. **Model-dependent defaults**: Unlike other directives where "off" means
use a global default, `thinkingLevel` has model-dependent defaults:
- Reasoning-capable models (GLM-4.7, etc.) → default "low"
- Other models → default "off"
2. **Existing pattern**: The codebase already follows this pattern for
`elevatedLevel`, which persists "off" explicitly to override defaults
that may be "on". The comment explains:
"Persist 'off' explicitly so `/elevated off` actually overrides defaults."
3. **User intent**: When a user explicitly executes `/think off`, they want
to disable thinking regardless of the model's capabilities. Deleting the
field breaks this intent by falling back to the model's default.
Solution:
Persist "off" value instead of deleting the field in all internal directive handlers:
- `src/auto-reply/reply/directive-handling.impl.ts`: Directive-only messages
- `src/auto-reply/reply/directive-handling.persist.ts`: Inline directives
- `src/commands/agent.ts`: CLI command-line flags
Gateway API Backward Compatibility:
The original implementation incorrectly mapped `null` to "off" in
`sessions-patch.ts` for consistency with internal handlers. This was a
breaking change because:
- Previously, `null` cleared the override (deleted the field)
- API clients lost the ability to "clear to default" via `null`
- This contradicts standard JSON semantics where `null` means "no value"
Restored original null semantics in `src/gateway/sessions-patch.ts`:
- `null` → delete field, fall back to model default (clear override)
- `"off"` → persist explicit override
- Other values → normalize and persist
This ensures backward compatibility for API clients while fixing the `/think off`
issue in internal handlers.
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
- Create shared PNG encoder module (src/media/png-encode.ts)
- Refactor qr-image.ts and live-image-probe.ts to use shared encoder
- Add safeParseJson to utils.ts and plugin-sdk exports
- Update msteams and pairing-store to use centralized safeParseJson
* refactor: consolidate duplicate utility functions
- Add escapeRegExp to src/utils.ts and remove 10 local duplicates
- Rename bash-tools clampNumber to clampWithDefault (different signature)
- Centralize formatError calls to use formatErrorMessage from infra/errors.ts
- Re-export formatErrorMessage from cli/cli-utils.ts to preserve API
* refactor: consolidate remaining escapeRegExp duplicates
* refactor: consolidate sleep, stripAnsi, and clamp duplicates
* Gateway: preserve Pi transcript parentId for injected messages
Thread: unknown
When: 2026-02-08 20:08 CST
Repo: https://github.com/openclaw/openclaw.git
Branch: codex/wip/2026-02-09/compact-post-compaction-parentid-fix
Problem
- Post-compaction turns sometimes lost the compaction summary + kept suffix in the *next* provider request.
- Root cause was session graph corruption: gateway appended "stopReason: injected" transcript lines via raw JSONL writes without `parentId`.
- Pi's `SessionManager.buildSessionContext()` walks the `parentId` chain from the current leaf; missing `parentId` can sever the active branch and hide the compaction entry.
Fix
- Use `SessionManager.appendMessage(...)` for injected assistant transcript writes so `parentId` is set to the current leaf.
- Route `chat.inject` through the same helper to avoid duplicating the broken raw JSONL append logic.
Why This Matters
- The compaction algorithm may be correct, but if the leaf chain is broken right after compaction, the provider payload cannot include the summary/suffix "shape" Pi expects.
Testing
- pnpm test src/agents/pi-embedded-helpers.post-compaction-shape.test.ts src/agents/pi-embedded-runner/run.overflow-compaction.post-context.test.ts
- pnpm build
Notes
- This is provider-shape agnostic: it fixes transcript structure so Anthropic/Gemini/etc all see the same post-compaction context.
Resume
- If post-compaction looks wrong again, inspect the session transcript for entries missing `parentId` immediately after `type: compaction`.
* Gateway: guardrail test for transcript parentId (chat.inject)
* Gateway: guardrail against raw transcript appends (chat.ts)
* Gateway: add local AGENTS.md note to preserve Pi transcript parentId chain
* Changelog: note gateway post-compaction amnesia fix
* Gateway: store injected transcript messages with valid stopReason
* Gateway: use valid stopReason in injected fallback
* fix(paths): structurally resolve home dir to prevent Windows path bugs
Extract resolveRawHomeDir as a private function and gate the public
resolveEffectiveHomeDir through a single path.resolve() exit point.
This makes it structurally impossible for unresolved paths (missing
drive letter on Windows) to escape the function, regardless of how
many return paths exist in the raw lookup logic.
Simplify resolveRequiredHomeDir to only resolve the process.cwd()
fallback, since resolveEffectiveHomeDir now returns resolved values.
Fix shortenMeta in tool-meta.ts: the colon-based split for file:line
patterns (e.g. file.txt:12) conflicts with Windows drive letters
(C:\...) because indexOf(":") matches the drive colon first.
shortenHomeInString already handles file:line patterns correctly via
split/join, so the colon split was both unnecessary and harmful.
Update test assertions across all affected files to use path.resolve()
in expected values and input strings so they match the now-correct
resolved output on both Unix and Windows.
Fixes#12119
* fix(changelog): add paths Windows fix entry (#12125)
---------
Co-authored-by: Sebastian <19554889+sebslight@users.noreply.github.com>
* fix: use .js extension for ESM imports of RoutePeerKind
The imports incorrectly used .ts extension which doesn't resolve
with moduleResolution: NodeNext. Changed to .js and added 'type'
import modifier.
* fix tsconfig
* refactor: unify peer kind to ChatType, rename dm to direct
- Replace RoutePeerKind with ChatType throughout codebase
- Change 'dm' literal values to 'direct' in routing/session keys
- Keep backward compat: normalizeChatType accepts 'dm' -> 'direct'
- Add ChatType export to plugin-sdk, deprecate RoutePeerKind
- Update session key parsing to accept both 'dm' and 'direct' markers
- Update all channel monitors and extensions to use ChatType
BREAKING CHANGE: Session keys now use 'direct' instead of 'dm'.
Existing 'dm' keys still work via backward compat layer.
* fix tests
* test: update session key expectations for dmdirect migration
- Fix test expectations to expect :direct: in generated output
- Add explicit backward compat test for normalizeChatType('dm')
- Keep input test data with :dm: keys to verify backward compat
* fix: accept legacy 'dm' in session key parsing for backward compat
getDmHistoryLimitFromSessionKey now accepts both :dm: and :direct:
to ensure old session keys continue to work correctly.
* test: add explicit backward compat tests for dmdirect migration
- session-key.test.ts: verify both :dm: and :direct: keys are valid
- getDmHistoryLimitFromSessionKey: verify both formats work
* feat: backward compat for resetByType.dm config key
* test: skip unix-path Nix tests on Windows
* initial commit
* feat: implement deriveSessionTotalTokens function and update usage tests
* Added deriveSessionTotalTokens function to calculate total tokens based on usage and context tokens.
* Updated usage tests to include cases for derived session total tokens.
* Refactored session usage calculations in multiple files to utilize the new function for improved accuracy.
* fix: restore overflow truncation fallback + changelog/test hardening (#11551) (thanks @tyler6204)
* fix(cron): comprehensive cron scheduling and delivery fixes
- Fix delivery target resolution for isolated agent cron jobs
- Improve schedule parsing and validation
- Add job retry logic and error handling
- Enhance cron ops with better state management
- Add timer improvements for more reliable cron execution
- Add cron event type to protocol schema
- Support cron events in heartbeat runner (skip empty-heartbeat check,
use dedicated CRON_EVENT_PROMPT for relay)
* fix: remove cron debug test and add changelog/docs notes (#11641) (thanks @tyler6204)
* fix(gateway): use LAN IP for WebSocket/probe URLs when bind=lan (#11329)
When gateway.bind=lan, the HTTP server correctly binds to 0.0.0.0
(all interfaces), but WebSocket connection URLs, probe targets, and
Control UI links were hardcoded to 127.0.0.1. This caused CLI commands
and status probes to show localhost-only URLs even in LAN mode, and
made onboarding display misleading connection info.
- Add pickPrimaryLanIPv4() to gateway/net.ts to detect the machine's
primary LAN IPv4 address (prefers en0/eth0, falls back to any
external interface)
- Update pickProbeHostForBind() to use LAN IP when bind=lan
- Update buildGatewayConnectionDetails() to use LAN IP and report
"local lan <ip>" as the URL source
- Update resolveControlUiLinks() to return LAN-accessible URLs
- Update probe note in status.gather.ts to reflect new behavior
- Add tests for pickPrimaryLanIPv4 and bind=lan URL resolution
Closes#11329
Co-authored-by: Cursor <cursoragent@cursor.com>
* test: move vi.restoreAllMocks to afterEach in pickPrimaryLanIPv4
Per review feedback: avoid calling vi.restoreAllMocks() inside
individual tests as it restores all spies globally and can cause
ordering issues. Use afterEach in the describe block instead.
Co-authored-by: Cursor <cursoragent@cursor.com>
* Changelog: note LAN bind URLs fix (#11448) (thanks @AnonO6)
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>