-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default tokenize function splits words containing hyphens #197
Comments
Hi @sunknudsen , Yes, the default tokenizer splits by space or punctuation, and the hyphen is considered punctuation. Therefore, const miniSearch = new MiniSearch({
fields: [/* ...my fields */],
tokenize: (text) => text.split(/[^\w-]/)
}) Consider that splitting by For example: const tokenize = (text) => text.split(/[^\w-]/)
// This is fine:
tokenize("I'm drinking a coca-cola")
// => ["I", "m", "drinking", "a", "coca-cola"]
// But this is not:
tokenize("The tokenizer is too naïve")
// => ["The", "tokenizer", "is", "too", "na", "ve"] |
Excellent points… I naively (pun intended) expected that |
On modern browsers this much simpler regular expression gets the job done: See this transpiler from ES6 Unicode-aware regular expressions to ES5. |
#198 translates the long regexp to an equivalent short Unicode one. I will research the implications for older browsers, and I might need to release it as part of the next major release, but it looks like the browser support is basically universal by now, if we don't need to care about IE anymore. Thanks for bringing my attention to it :) |
See
minisearch/src/MiniSearch.ts
Line 1934 in 1eb584c
Curious… couldn’t we use
/[^\w-]+/g
instead? Btw, thanks for minisearch @lucaong! Very helpful package. 🙌The text was updated successfully, but these errors were encountered: