* fix(memory): build FTS index when no embedding provider is available
* fix(memory): trigger full reindex on provider→FTS-only transition
* fix(memory): return FTS-only keyword hits at default threshold
* fix: keep FTS-only memory hits at default threshold (#56473) (thanks @opriz)
---------
Co-authored-by: Ayaan Zaidi <hi@obviy.us>
The tokenize() function only matched [a-z0-9_]+ patterns, returning an
empty set for CJK-only text. This made Jaccard similarity always 0 (or
always 1 for two empty sets) for CJK content, effectively disabling MMR
diversity detection.
Add support for:
- CJK Unified Ideographs (U+4E00–U+9FFF, U+3400–U+4DBF)
- Hiragana (U+3040–U+309F) and Katakana (U+30A0–U+30FF)
- Hangul Syllables (U+AC00–U+D7AF) and Jamo (U+1100–U+11FF)
Characters are extracted as unigrams, and bigrams are generated only
from characters that are adjacent in the original text (no spurious
bigrams across ASCII boundaries).
Fixes#28000