-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the initial version of the Polish stemmer #159
base: master
Are you sure you want to change the base?
Conversation
@@ -28,6 +28,7 @@ italian UTF_8,ISO_8859_1 italian,it,ita | |||
lithuanian UTF_8 lithuanian,lt,lit | |||
nepali UTF_8 nepali,ne,nep | |||
norwegian UTF_8,ISO_8859_1 norwegian,no,nor | |||
polish UTF_8 polish,pl,pol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be UTF_8,ISO_8859_2
(if we're going to have ISO-8859-2 support in libstemmer, we ought to have it available for languages for which it covers the alphabet).
The tests need to pass for all programming languages, but currently this fails the tests for C (try (The Pascal backend currently only supports iso-8859-1 and I wasn't able to test Go as there's something up with my local Go setup.) The pattern here is that it's failing for languages that use UTF-8 and working for those that use wide characters. I'll comment on a line of code where I think the problem is. The CI should have shown this, but it hasn't run for this PR. I'm not sure why not as it ran for a push I just made to master, and "Build pushed pull requests" is on in the travis-ci settings. I'll try to get that fixed, but meanwhile please try to run at least the C tests locally (they shouldn't need anything beyond what you must have installed to have built the snowball compiler). |
|
||
define remove_nouns as ( | ||
($(len > 7) | ||
test ($pos = (len - 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 5 here assumes that each character counts as 1
which isn't true when we're working in UTF-8 - it's assumptions such as this which are causing the tests for fail when we're working in UTF-8.
I think this is going to need to significant restructuring to fix satisfactorily.
($pos = (len -2) | ||
hop pos | ||
([tolimit] delete)) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would benefit from using snowball's among
feature. This can be used to remove suffixes from the string without having to do all this tedious calculating of offsets, and that would fix a lot of the current problems when working in UTF-8.
No description provided.