Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default tokenize function splits words containing hyphens #197

Closed
sunknudsen opened this issue Jan 11, 2023 · 4 comments
Closed

Default tokenize function splits words containing hyphens #197

sunknudsen opened this issue Jan 11, 2023 · 4 comments

Comments

@sunknudsen
Copy link

sunknudsen commented Jan 11, 2023

See

const SPACE_OR_PUNCTUATION = /[\n\r -#%-*,-/:;?@[-\]_{}\u00A0\u00A1\u00A7\u00AB\u00B6\u00B7\u00BB\u00BF\u037E\u0387\u055A-\u055F\u0589\u058A\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0609\u060A\u060C\u060D\u061B\u061E\u061F\u066A-\u066D\u06D4\u0700-\u070D\u07F7-\u07F9\u0830-\u083E\u085E\u0964\u0965\u0970\u09FD\u0A76\u0AF0\u0C77\u0C84\u0DF4\u0E4F\u0E5A\u0E5B\u0F04-\u0F12\u0F14\u0F3A-\u0F3D\u0F85\u0FD0-\u0FD4\u0FD9\u0FDA\u104A-\u104F\u10FB\u1360-\u1368\u1400\u166E\u1680\u169B\u169C\u16EB-\u16ED\u1735\u1736\u17D4-\u17D6\u17D8-\u17DA\u1800-\u180A\u1944\u1945\u1A1E\u1A1F\u1AA0-\u1AA6\u1AA8-\u1AAD\u1B5A-\u1B60\u1BFC-\u1BFF\u1C3B-\u1C3F\u1C7E\u1C7F\u1CC0-\u1CC7\u1CD3\u2000-\u200A\u2010-\u2029\u202F-\u2043\u2045-\u2051\u2053-\u205F\u207D\u207E\u208D\u208E\u2308-\u230B\u2329\u232A\u2768-\u2775\u27C5\u27C6\u27E6-\u27EF\u2983-\u2998\u29D8-\u29DB\u29FC\u29FD\u2CF9-\u2CFC\u2CFE\u2CFF\u2D70\u2E00-\u2E2E\u2E30-\u2E4F\u3000-\u3003\u3008-\u3011\u3014-\u301F\u3030\u303D\u30A0\u30FB\uA4FE\uA4FF\uA60D-\uA60F\uA673\uA67E\uA6F2-\uA6F7\uA874-\uA877\uA8CE\uA8CF\uA8F8-\uA8FA\uA8FC\uA92E\uA92F\uA95F\uA9C1-\uA9CD\uA9DE\uA9DF\uAA5C-\uAA5F\uAADE\uAADF\uAAF0\uAAF1\uABEB\uFD3E\uFD3F\uFE10-\uFE19\uFE30-\uFE52\uFE54-\uFE61\uFE63\uFE68\uFE6A\uFE6B\uFF01-\uFF03\uFF05-\uFF0A\uFF0C-\uFF0F\uFF1A\uFF1B\uFF1F\uFF20\uFF3B-\uFF3D\uFF3F\uFF5B\uFF5D\uFF5F-\uFF65]+/u

Curious… couldn’t we use /[^\w-]+/g instead? Btw, thanks for minisearch @lucaong! Very helpful package. 🙌

@lucaong
Copy link
Owner

lucaong commented Jan 11, 2023

Hi @sunknudsen ,
Thanks for the kind words :)

Yes, the default tokenizer splits by space or punctuation, and the hyphen is considered punctuation. Therefore, "foo-bar" is tokenized as ["foo", "bar"]. Applications can configure a custom tokenizer to change this behavior. For example, to split by /[^\w-]/ one can do:

const miniSearch = new MiniSearch({
  fields: [/* ...my fields */],
  tokenize: (text) => text.split(/[^\w-]/)
})

Consider that splitting by /[^\w-]/ might work for simple English text, but behaves badly with non-ASCII characters like accents, umlauts, diacritics, etc., which are considered non-word characters even though are very common in languages other than English.

For example:

const tokenize = (text) => text.split(/[^\w-]/)

// This is fine:
tokenize("I'm drinking a coca-cola")
// => ["I", "m", "drinking", "a", "coca-cola"]

// But this is not:
tokenize("The tokenizer is too naïve")
// => ["The", "tokenizer", "is", "too", "na", "ve"]

@sunknudsen
Copy link
Author

Excellent points… I naively (pun intended) expected that \w included accented characters.

@lucaong
Copy link
Owner

lucaong commented Jan 11, 2023

On modern browsers this much simpler regular expression gets the job done: /[\p{Z}\p{P}]/u. I need to check compatibility and see if it makes sense to use that instead of the huge explicit form you linked from the source. The original reason for the long regexp was browser compatibility.

See this transpiler from ES6 Unicode-aware regular expressions to ES5.

@lucaong
Copy link
Owner

lucaong commented Jan 13, 2023

#198 translates the long regexp to an equivalent short Unicode one. I will research the implications for older browsers, and I might need to release it as part of the next major release, but it looks like the browser support is basically universal by now, if we don't need to care about IE anymore.

Thanks for bringing my attention to it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants