fix: address bot review — surrogate-pair counting and CJK line splitting

- Use code-point length instead of UTF-16 length in estimateStringChars() so that CJK Extension B+ surrogate pairs (U+20000+) are counted as 1 character, not 2 (fixes ~25% overestimate for rare characters). - Change long-line split step from maxChars to chunking.tokens so that CJK lines are sliced into token-budget-sized segments instead of char-budget-sized segments that produce ~4x oversized chunks. - Add tests for both fixes: surrogate-pair handling and long CJK line splitting. Addresses review feedback from Greptile and Codex bots.
2026-04-23 23:22:32 +00:00 · 2026-03-08 18:44:22 -04:00
parent 971ecabe80
commit a5147d4d88
2 changed files with 7 additions and 3 deletions
--- a/src/utils/cjk-chars.test.ts
+++ b/src/utils/cjk-chars.test.ts
@@ -81,7 +81,6 @@ describe("estimateStringChars", () => {
    // "你" counts as 4, emoji remains 2 => total 6
    expect(estimateStringChars("你😀")).toBe(6);
  });
-
  it("yields ~1 token per CJK char when divided by CHARS_PER_TOKEN_ESTIMATE", () => {
    // 10 CJK chars should estimate as ~10 tokens
    const cjk = "这是一个测试用的句子呢";