Commit Graph

17 Commits

Author SHA1 Message Date
Peter Steinberger
8e0ab35b0e refactor(plugins): decouple bundled plugin runtime loading 2026-03-29 09:10:38 +01:00
Tak Hoffman
3ce48aff66 Memory: add configurable FTS5 tokenizer for CJK text support (openclaw#56707)
Verified:
- pnpm build
- pnpm check
- pnpm test -- extensions/memory-core/src/memory/manager-search.test.ts packages/memory-host-sdk/src/host/query-expansion.test.ts
- pnpm test -- extensions/memory-core/src/memory/index.test.ts -t "reindexes when extraPaths change"
- pnpm test -- src/config/schema.base.generated.test.ts
- pnpm test -- src/media-understanding/image.test.ts
- pnpm test

Co-authored-by: Mitsuyuki Osabe <24588751+carrotRakko@users.noreply.github.com>
2026-03-28 20:53:29 -05:00
AaronLuo00
f8547fcae4 fix: guard fine-split against breaking UTF-16 surrogate pairs
When re-splitting CJK-heavy segments at chunking.tokens, check whether the
slice boundary falls on a high surrogate (0xD800–0xDBFF) and if so extend
by one code unit to keep the pair intact.  Prevents producing broken
surrogate halves for CJK Extension B+ characters (U+20000+).

Add test verifying no lone surrogates appear when splitting lines of
surrogate-pair characters with an odd token budget.

Addresses third-round Codex P2 review comment.
2026-03-29 10:22:43 +09:00
AaronLuo00
3b95aa8804 fix: address second-round review — Latin backward compat and emoji consistency
- Two-pass line splitting: first slice at maxChars (unchanged for Latin),
  then re-split only CJK-heavy segments at chunking.tokens. This preserves
  the original ~800-char segments for ASCII lines while keeping CJK chunks
  within the token budget.

- Narrow surrogate-pair adjustment to CJK Extension B+ range (D840–D87E)
  only, so emoji surrogate pairs are not affected. Mixed CJK+emoji text
  is now handled consistently regardless of composition.

- Add tests: emoji handling (2), Latin backward-compat long-line (1).

Addresses Codex P1 (oversized CJK segments) and P2s (Latin over-splitting,
emoji surrogate inconsistency).
2026-03-29 10:22:43 +09:00
AaronLuo00
a5147d4d88 fix: address bot review — surrogate-pair counting and CJK line splitting
- Use code-point length instead of UTF-16 length in estimateStringChars()
  so that CJK Extension B+ surrogate pairs (U+20000+) are counted as 1
  character, not 2 (fixes ~25% overestimate for rare characters).

- Change long-line split step from maxChars to chunking.tokens so that
  CJK lines are sliced into token-budget-sized segments instead of
  char-budget-sized segments that produce ~4x oversized chunks.

- Add tests for both fixes: surrogate-pair handling and long CJK line
  splitting.

Addresses review feedback from Greptile and Codex bots.
2026-03-29 10:22:43 +09:00
AaronLuo00
971ecabe80 fix(memory): account for CJK characters in QMD memory chunking
The QMD memory system uses a fixed 4:1 chars-to-tokens ratio for chunk
sizing, which severely underestimates CJK (Chinese/Japanese/Korean) text
where each character is roughly 1 token. This causes oversized chunks for
CJK users, degrading vector search quality and wasting context window space.

Changes:
- Add shared src/utils/cjk-chars.ts module with CJK-aware character
  counting (estimateStringChars) and token estimation helpers
- Update chunkMarkdown() in src/memory/internal.ts to use weighted
  character lengths for chunk boundary decisions and overlap calculation
- Replace hardcoded estimateTokensFromChars in the context report
  command with the shared utility
- Add 13 unit tests for the CJK estimation module and 5 new tests for
  CJK-aware memory chunking behavior

Backward compatible: pure ASCII/Latin text behavior is unchanged.

Closes #39965
Related: #40216
2026-03-29 10:22:43 +09:00
Peter Steinberger
4c27c90fc2 refactor: finish moving provider runtime into extensions 2026-03-27 05:38:58 +00:00
Peter Steinberger
64bf80d4d5 refactor: move provider runtime into extensions 2026-03-27 05:38:58 +00:00
Peter Steinberger
eebce9e9c7 refactor: move memory host into sdk package 2026-03-27 04:12:04 +00:00
Peter Steinberger
bd6c7969ea refactor: extract memory host sdk package 2026-03-27 02:49:33 +00:00
Peter Steinberger
7695b4842b chore: bump version to 2026.2.12 2026-02-12 18:20:46 +01:00
Peter Steinberger
1872d0c592 chore: bump version to 2026.2.10 2026-02-11 11:27:23 +01:00
cpojer
6fb2d3d7d7 feat: remove slop. 2026-02-03 22:04:17 +09:00
cpojer
8cab78abbc chore: Run pnpm format:fix. 2026-01-31 21:13:13 +09:00
Peter Steinberger
9a7160786a refactor: rename to openclaw 2026-01-30 03:16:21 +01:00
Peter Steinberger
6d16a658e5 refactor: rename clawdbot to moltbot with legacy compat 2026-01-27 12:21:02 +00:00
Peter Steinberger
72fea5e305 chore: bump version to 2026.1.26 2026-01-27 09:10:47 +00:00