Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider pros and cons of having project search ignore all non-alphanumeric characters #1172

Open
srjfoo opened this issue Apr 1, 2024 · 4 comments

Comments

@srjfoo
Copy link
Member

srjfoo commented Apr 1, 2024

I suspect that it's often frustrating for everyone using project search (and may actually feed into PMs doing duplicate searches missing obvious duplicates), but the straw that finally motivated this issue was a curly apostrophe in a project title that is automatically converted to a straight apostrophe in the PG posted notice. 😁

Other problems caused in project search by non-alphanumerics:

  • presence or absence of punctuation between title and subtitle
  • use of em-dash vs. double-hyphen in project titles
  • spacey vs non-spacey punctuation between title and subtitle

I'm sure there are others, but these are the ones that spring to mind. (This obviously won't fix all problems that cause non-matches in project search, but should help with the punctuation-related ones.)

Edit:
After a bit of discussion with other squirrels, I realize this is not one issue. There are two basic issues I've identified:

  1. Curly apostrophes within words, and single or double curly quotes that are part of a title (possibly, rarely, in author fields).
  2. Punctuation between words, usually to separate parts of the title. Could be commas, semicolons, colons, em-dashes (either the character or the double-hyphen version).

For the first case I'd recommend treating them all the same, whether straight or curly. For the second, I think ignoring them for search purposes myght be best, but would like to hear discussion on both.

@chrismiceli
Copy link
Collaborator

Would it make sense to do a sort of tokenization of the title and search query that splits input on punctuation (quotes, hyphens, commas, etc.), then rank results based on token matching? This is more in line what solutions oriented towards search do. An exmaple would be

User input of search query of "James Copeland".

Let's say there is a project with the title of "Life and bloody career of the executed criminal, James Copeland, the great Southern land pirate., [2d ed.]" we turn that into a set of tokens

  1. Life
  2. and
  3. bloddy
  4. career
  5. of
  6. the
  7. executed criminal
  8. James
  9. Copleand
  10. the
  11. great
  12. Southern
  13. land
  14. pirate
  15. 2d
  16. ed

Then rank it against other projects by comparing to the tokenized input "James", "Copeland".

This project would rank highly since 2 tokens match exactly. We could get more fancy with weighting words token like "the" less than other tokens or partial match ranking, but I wonder how far we could get with this. The solution here would require where to split the project title into tokens to be well defined.

A simpler solution initially could be to just remove those token splitters (commas, apostrophe, etc.) and then do a string search on that string. Just throwing ideas out there.

@srjfoo
Copy link
Member Author

srjfoo commented Apr 2, 2024

Any thoughts on which would be easier on the database?

@chrismiceli
Copy link
Collaborator

I'd have to defer to someone with more database experience than me. I'm not even sure the full tokenization approach is possible in database. String replacement methods could be done I think in SQL for "normalizing" our project titles during the search.

@cpeel
Copy link
Member

cpeel commented Apr 2, 2024

We might be able to use the full-text search feature of MySQL. That's what the forums use for their search too. I don't know exactly how it handles punctuation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants