-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider pros and cons of having project search ignore all non-alphanumeric characters #1172
Comments
Would it make sense to do a sort of tokenization of the title and search query that splits input on punctuation (quotes, hyphens, commas, etc.), then rank results based on token matching? This is more in line what solutions oriented towards search do. An exmaple would be User input of search query of "James Copeland". Let's say there is a project with the title of "Life and bloody career of the executed criminal, James Copeland, the great Southern land pirate., [2d ed.]" we turn that into a set of tokens
Then rank it against other projects by comparing to the tokenized input "James", "Copeland". This project would rank highly since 2 tokens match exactly. We could get more fancy with weighting words token like "the" less than other tokens or partial match ranking, but I wonder how far we could get with this. The solution here would require where to split the project title into tokens to be well defined. A simpler solution initially could be to just remove those token splitters (commas, apostrophe, etc.) and then do a string search on that string. Just throwing ideas out there. |
Any thoughts on which would be easier on the database? |
I'd have to defer to someone with more database experience than me. I'm not even sure the full tokenization approach is possible in database. String replacement methods could be done I think in SQL for "normalizing" our project titles during the search. |
We might be able to use the full-text search feature of MySQL. That's what the forums use for their search too. I don't know exactly how it handles punctuation. |
I suspect that it's often frustrating for everyone using project search (and may actually feed into PMs doing duplicate searches missing obvious duplicates), but the straw that finally motivated this issue was a curly apostrophe in a project title that is automatically converted to a straight apostrophe in the PG posted notice. 😁
Other problems caused in project search by non-alphanumerics:
I'm sure there are others, but these are the ones that spring to mind. (This obviously won't fix all problems that cause non-matches in project search, but should help with the punctuation-related ones.)
Edit:
After a bit of discussion with other squirrels, I realize this is not one issue. There are two basic issues I've identified:
For the first case I'd recommend treating them all the same, whether straight or curly. For the second, I think ignoring them for search purposes myght be best, but would like to hear discussion on both.
The text was updated successfully, but these errors were encountered: