-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New hyphenation patterns #383
Conversation
You really don't want to get friend with git? Because this branch being your
I'd say it's not necessary as it's no more used. But I used it as the reference when I updated frontend's readertypography.lua for alternative lang tags (that you can't put anywhere else, except in frontend and in this languages.json). Do you plan on adding hyphenation for all the languages of the world ? :) How many left? |
That worked :) |
Could you find a epub for each of these languages ? :) 😅😱😂 |
@poire-z I'll have some more time to sit down and go through your git cheatsheet in the next few weeks. Once again, sorry, my mental acuity hasn't been up to learning these past few weeks. I think the only ones I'd like to fix are the ones in #373 and maybe a select few others I might have a small interest in adding from the usual 2 sources - Afrikaans and Belarusian that I can think of. Honestly the size changes were bothering me a bit too in the long run, and I was curious whether these could somehow be handled in a more discreet manner - having a base pack so to speak, and just offering smaller languages as a bonus download. This is part of the reason why I haven't considered adding the hyphenation patterns available for languages spoken in India, since if I remember correctly some of those were rather large. Example: Friulian pattern isn't available by default in koreader, but if a document tagged for it is opened - an update/download is offered. Another avenue for the same would be switching to a hyphenation that isn't prepackaged. If I'd have to point out files that I never understood why we had - the Russian+English patterns seem pointless to me, though I'm sure there's some use for them, but I believe they were rather large, which seemed extravagant to me when I noticed them. |
Are these different from Dutch and Russian/Ukrainian ? :) or can't they use NL and RU hyphen dicts? (thanks for the cultural insight to come :) Anyway, I can be fine with the 5 from this PR (very small file sizes, sure they are enough/worth?) and the ones you'd like added, even with the Indians ones (large books and users base I guess :) even if we never had much feedback from India... but you'd have to test how crengine hyphenation works with indian letters, and harfbuzz at correctly drawing now-terminating glyphs at hyphenation cut (if that's a thing) - it's possible there's something limited to latin/cyrillic in crengine in what it considers words... would have to be tested). My question was whether you might add 30 or 40 more of them :) and then, it might be a problem. Well, translator.lua lists 105 languages, so we have our languages count :) Dunno about the additional file size they would take, but we already have a menu with 10 pages there - so it could still be ok in the Typography laguage menu.
Well, currently, frontend is not aware of what languages cre is meeting (we could meet languages in lang= tags, not only in the book metadata known to frontend). Adding new/unknown language callbacks + ability to auto-download stuff when needed - is quite a lot of work, again for these little languages/users base (days of work, for something that might be used by one person once a year :) We could have a zip of additional hyph dicts to download and unzip in koreader dir, but I don't know how crengine deal with absent hyph filenames (I'm not looking at the code until it's needed :)
I guess it's from the crengine Russian and FB2 heritage: FB2 probably doesn't have any |
@cramoisi You're welcome, sorry I couldn't find any free options for the Amazon ones, but they might be available somewhere else with some more in-depth searching from your side. @poire-z Afrikaans can be considered a daughter language of Dutch that started off from the Holland dialect a few hundred years ago, so barring a miracle it's probably diverged too much for hyphenation patterns to be that close to one another. @Frenzie could take a look and give his opinion as a Dutch speaker, I'm not fluent in either sadly. I'd say it's worth adding, since it covers a large amount of people and countries at the same time, but whether they have a ton of epubs is something that would have to be looked into. As far as Belarusian, it falls within the same language group, that of East Slavic languages, but the language, while sharing a lot of characteristics with Ukrainian and Russian which fall into the same group, has drifted well enough to make hyphenation patterns unique I'd say. I'm looking at this from the outside, mind you, since I speak Bulgarian which isn't in that Slavic language group. I do know that Belarusian and Ukrainian were both influenced by Polish, unlike Russian, so those two might be closer to one another. Personally I'm guesstimating that it's something akin to how I feel when I read or hear Macedonian - a language that is by far the closest to Bulgarian you can get, and I still don't understand a lot of their grammatical constructs, hyphenation rules and so on, even though I can get the gist of what I read or hear. Here is an example of an eBook seller for their country. For Friulian, Piedmontese and Romansh - I did check the hyphenation patterns and they do share some commonalities, but of the three I'd say Romansh is a must-have, since it covers territories in both Italy and Switzerland. I'm not sure whether they can be covered by the same hyphenation pattern, though at a glance they share enough. If we had a system where we could tell koreader to use file X for as the base hyphenation rules and then other smaller files for small differences, that would be a way to save up space. I'm personally not asking for these - I just saw that there were readily available tex files that were small enough to not be a gigantic burden. If it would take days to do that, it's not worth it in my opinion either. If cre can handle an extra download being unpacked, that seems like a far easier way. Ah, that makes sense, I rarely if ever read something that handles 2 languages in the same document, hence my lack of need for such a combination, I thought there was some other reason for those files. And yeah, I don't plan on adding that much more files - a few others maybe, but there aren't that many sources to get tex or dic files that are clean enough to use the script I have handy on, so that would definitely limit me even if I wanted to, hahah. Ideally I'd like to have hyphenation patterns for most of Europe and a few major languages outside of it, but I think Afrikaans and Belarusian would just about cover what I'd planned on adding for now. All in all I agree that size is a constraint and that if there's a way to lessen the load on koreader's size by offering some of these languages as a bonus pack, I'd be all for that. |
Thanks to CoolReader creators for that, it's useful.
Extremely yes, as well as Belarusian.
Exactly.
Mostly yes. |
The basic rules behind Afrikaans hyphenation are likely similar (as they are for German) but the spelling is different. After all, most of these patterns aren't about what a human would do but about how to get a machine to approximate what we'd do intuitively. ;-)
Take a look at what, the patterns? At least link them then, although I probably won't give them more than a cursory glance regardless. :-P If you mean Afrikaans itself, I can understand that well enough (slowly and clearly) spoken and of course in writing. |
May be an idea for the future, to remove the 2 Russian+English*: Also, dunno why size differences between GB and US, and if there's really a need to differentiate them:
@hius07 : which one of Russian_EnGB or Russian_EnUS do you use? Any reason for Hungarian to be 10x the size of most others ? Bug/crap, or really needed by this complex language?
|
Personally this one. |
@poire-z Keep in mind that some hyphenation patterns are probably more optimized than others. As far as Hungarian, the only thing that comes to mind is that it has an absurd amount of grammatical cases - I think Basque and Finnish are the same on that part. Another part of it comes down to language complexity - does it use genders, does it have grammatical cases, does it depend on conjugation to express tenses, etc. There's a ton of other things I'm definitely missing, since I'm not that into linguistics, but you get the drift of it. Bulgarian, for instance, relies on verb conjugations to express tenses. On top of that it has a few remnants of grammatical cases here and there, loanwords from different languages which use different rules, etc. All of this theoretically could lead to more complex hyphenation rules and a bigger pattern file, as long as a lot of work was put into it. I'm pretty sure a lot of our patterns could be a lot bigger had there been more people volunteering to make tex patterns, which just isn't the case for languages linked to smaller native speaker populations. |
I've been using Algorithmic hyphenation also, it is good for russian texts (don't know about other languages). |
Waiting on a quick review whenever you have the time @cramoisi - especially on Friulian, Piedmontese and Romansh - check the source files and note that all 3 files start with 10-20 lines that I removed. I'm not sure if I should just add those back with a dot instead of the |
@poire-z If you can figure out how we're supposed to hook up Brazilian Portuguese, I've got a file pattern for it fairly ready :) |
I guess this should be enough (uppsecase is fine here, order is important: longer first so they match before shorter): --- a/crengine/src/textlang.cpp
+++ b/crengine/src/textlang.cpp
@@ -62,2 +62,3 @@ static struct {
{ "pl", "Polish", "Polish.pattern", 2, 2 },
+ { "pt-BR", "Portuguese_BR", "Portuguese_BR.pattern", 2, 3 },
{ "pt", "Portuguese", "Portuguese.pattern", 2, 3 }, We have this in the quotes specs (should be lowercased): { "pt-pt", L"\x00ab", L"\x00bb", L"\x201c", L"\x201d" },
{ "pt", L"\x201c", L"\x201d", L"\x2018", L"\x2019" }, dunno why pt-pt, and what "pt" standalone means - but it should be the catch-all pt* In frontend, we have:
https://en.wikipedia.org/wiki/Language_localisation#Language_tags_and_codes
I let you wikipeding more if needed :) |
@poire-z Should be good now. Quotation marks were off, as can be read here. Basically: Since only Brazil follows a different quotation mark system, I changed pt to match pt-pt, so that the other 9 or so countries that have Portuguese as an official language are okay, since they follow the rules/habits of European Portuguese, but don't have a hyphenation pattern themselves. |
So, they will match "pt" and should use the european pt hyph dict. |
Still a CI warning:
|
@poire-z Uh, that looks normal to me in Notepad++? The original .dic file I ran through my script had "ISO8859-1" noted down as the preferred encoding, but none of that should matter? As far as the dialects: Not sure if we should do something to make sure anything that isn't pt-br is handled by Portuguese.pattern |
Welp, that should fix it. As long as nothing pops up for the other files and the |
Not adding Belarus ? |
@poire-z I might handle that this weekend or in a week or so. Afrikaans has a lot of lines with |
I'll look into it this weekend - no time before that :/ But I can say that the rules in Zulu are made for a right hyphenmin at 1. |
Need a tiny bit of advice for Belarusian - @cramoisi @poire-z Basically I removed the comments and removed all of the lines with a |
It said we can hyphenate after a hyphen. Which is the default behaviour of crengine: a break is normally allowed after a real hyphen, and both the part below and the part after are considered independant words and given (without any info about the other part) to the hyphenation algo if hyphenation is needed on one of these parts. |
@poire-z is right : Anyway, I 've the feeling I've already made this comment 2 or 3 times ;-) nothing new about what I've just wrote. (I've looked into your Belarusian hyph-be-test.zip and it seems OK to me :) ) No need to worry about lines like these one as you have just to set left/right hyphenmin at 2 for deal with them. If the template is not design for 2,2, then you are out of luck ;)
|
Took the liberty to rebase and make them nice commits. If it is ready as is, just tell me and I'll merge and bump it in a coming up bump. |
Count this one to my bad memory in general, hah :D Feel free to merge, yeah, if anything pops up it can be fixed on its own or with the Afrikaans + Belarusian PR. |
Regarding my comment #383 (comment) :
May be we wouldn't need any option: we could mark some languages/hyphenation dicts as being "english/latin-orthogonal" when their hyph dict contains only non-latin words (no Quickly looking at the *.pattern, these languages could be candidates for that:
@hius07 answered "Russian_EnUS". @virxkane @pkb : which one do you use? Dunno how much all these Russian* dicts are up to date - but if Russian.pattern and English_US.pattern are fine, we could get rid of the others and avoid having non-standard ru-GB and ru-US lang tags that would never be found in books. @roshavagarga : would that work/be welcomed with Bulgarian? Or can some users find it preferable to have latin words non-hyphenated when they happen in text in these languages - or better non-hyphenated than wrongly hyphenated as english when we don't know their language? Thoughts ? |
@poire-z Typically that should work for some of these. As far as Bulgarian, the only use of the latin alphabet I can think of is medical (Latin), mathematics (maybe?) and brands (some people use the latin alphabet original, others transliterate, no norm I think?). Another option, which I highly doubt but is possible, is if somebody decided to use shlyokavitsa, which is the informal way people transliterate Bulgarian when messaging each other online - typically because they're lazy, are used to doing so, can't install or use the original national keyboard standard (or know that there's a qwerty-friendly equivalent) or for numerous other reasons - I can see an author maybe using that in a young adult novel, though I'll admit I've never seen it done, though that maybe due to me not being into that genre. Either way, the English pattern should be good, since I doubt that if somebody does use shlyokavitsa it'll be a common thing. Maybe use the GB one, since that's the type taught in schools here, rather than the US norm? Personally, I'd be more annoyed at things not being hyphenated, but that might just be my preference. You might want to have a look at Macedonian, since their alphabet has Serbian is a special case, and while we only have the one hyphenation pattern and a long talk when it was being added, I will just remind you that Serbian has a Cyrillic and a Latin alphabet, both of which are legally equivalent and it's the only country in Europe that does that, so technically we should have separate hyphenation patterns for both of those, though I'll admit I have no idea how ebooks work in Serbia and whether only one or the other alphabet is used, or if books get published in both and how common each one is? I'd love to hear more about it from @strn. |
For my personal use, it doesn't really matter, but for people that read translations of old greek text it might be handy to have the right hyphenation for both languages. My main problem with hyphenation is that is not so good (for both greek and english), and I was wondering if I could find a way to use a file from somewhere else (e.g. an office suite). |
@roshavagarga asked, so @strn responded:
The statement is not correct. Serbian Cyrillic alphabet has absolute pre-eminence and pre-precedence over any other writing system in Serbian language. It is ensured by Serbian constitution, article 10. Hence "Serbian language having Latin alphabet" cannot be true or "legal equivalent" in any way. What is thought to be "Serbian Latin" alphabet, is in fact Croatian Latin alphabet. It was used for writing Serbian language during time of Yugoslav state (1918-1992) and is a political construct since at that time Serbo-Croatian language existed (ISO code: sh). Since Yugoslavia dissolved, a need for Croatian Latin alphabet in Serbian language ceased to exist. Serbian language written by Croatian Latin alphabet is equivalent to Bulgarian language written in shlyokavitsa . A similar mess exists (existed?) for Moldavian language - it can be written in Moldavian Cyrillic or Romanian Latin alphabet. This is just for those wanting to draw comparisons.
Croatian Latin hyphenation patterns will work perfectly for those ebooks written in so-called "Serbian Latin" alphabet.
Ebooks work in Serbia as follows: if ebook is in Serbian language, it can be written (printed?) in either Serbian Cyrillic or Croatian Latin alphabets. Mixing alphabets is strictly forbidden by grammar rules. Only exception of Latin script appearing among Serbian Cyrillic text would be writing foreign (non-Serbian) personal names for the first time when they appear in text, writing chemical formulae, measurement units etc.
Official ebook publishing in Serbia is almost non-existent. What exists of ebooks in Serbian language are 99% pirated editions - scans of printed books used for creation of EPUBs. Paper editions are either in Serbian Cyrillic or Croatian Latin alphabet. |
@strn Did some checking, turns out I was mistaken on the legal aspect and it's just a linguistic quirk (digraphia if you're curious). Wasn't meant to offend you in any way, sorry if that was the case. Bulgarian does have an official transliteration scheme or whatever you'd like to call it, so shlyokavitsa is more along the lines of internet-speak or jargon, but I get the point you're making.
Could you give examples of some (not copyrighted, of course)? My thinking is what codes are used in ePubs that use Gaj's Latin Alphabet and what are used for those in Serbian Cyrillic - especially if somebody might have made a pirated version of, let's say, a book from the Yugoslav era maybe? I'd use classic books or popular modern novels as the baseline myself :) There's a tiny segment of paid epubs here, and a centralized free library (chitanka), which exists because it's legal to recreate written books in any form as long as there's no profit and they're already available in a library for instance. @noembryo Feel free to look at the last few comments in #373 and chime in, I'll try and offer a replacement pattern file this week, been a bit busy. |
No offence taken, do not worry. I just know that situation around Serbian language is complex, even more because it was in unnatural union with Croatian language. It is not easy for foreigners to understand ;-)
Since EPUBs in Serbian language are mostly produced by amateurs (myself included), EPUB tags Hence, hyphenation rules for Serbian language I contributed to koreader will work only if an ebook is in Serbian Cyrillic and has correct language code sr. That cannot be ensured or enforced across Serbian EPUB space. I have yet to find ebook written in so-called "Serbian Latin" alphabet that uses correct language code |
Sometimes yes.
We call it "Support for multilingual documents".
It seems to me that the option is needed. Firstly, there are not always English words in the text, secondly, they have already written, not everyone needs the hyphenation of English words in the Cyrillic text, and thirdly, the option should probably not be a boolean, we should choose "English US" or "English GB".
There is no definite answer here. In one non-English-language book, there may be inserts in "English US", in another - in "English GB". Probably, it would be correct to choose a specific option in this new hypothetical option. |
Well, I just don't want to have to add any more UI hyphenation option stuff :) But we can't have RU+FR, BG+EnUS. I'm not saying we should have them :) I'm just wondering if having only RU+EnUS | BG+EnUS wouldn't be a better alternative.
In that case (except for possible performance/memory usage), having english patterns in the hyphenation data used would not hurt and just not have any effect.
That's the "not everytone" I'd like to estimate :) I guess one reason to not want them is if these English words are mostly person names, which usually should not be hyphenated.
Dunno much about the difference between EnglishUS and EnglishGB hyphenation - but I think these are the same language :) The difference in hyphenation might be only stylistic ones that only native US/GB snobs might care about :) and hoping the set of such snobs and the set of Russian FB2 book readers do not intersect :) |
@poire-z : i am this kind of snob. Can’t read uk books with US enable. I
just can’t.
Le sam. 2 janv. 2021 à 10:11, poire-z <[email protected]> a écrit :
… It seems to me that the option is needed.
Well, I just don't want to have to add any more UI hyphenation option
stuff :)
I'd just like to have by default what would make the most sense.
The current situation is we can have RU | RU+EnUS | Ru+EnGB - which I
guess is fine for FB2 books which are/were CoolReader original target and
are mostly Russian text.
But we can't have RU+FR, BG+EnUS. I'm not saying we should have them :)
I'm just wondering if having only RU+EnUS | BG+EnUS wouldn't be a better
alternative.
Firstly, there are not always English words in the text
In that case (except for possible performance/memory usage), having
english patterns in the hyphenation data used would not hurt and just not
have any effect.
secondly, they have already written, not everyone needs the hyphenation of
English words in the Cyrillic text
That's the "not everytone" I'd like to estimate :) I guess one reason to
not want them is if these English words are mostly person names, which
usually should not be hyphenated.
But on the other hand, if one choses Russian hyphenation, it's also to
have less whitespace in justified lines - and he would be best served also
having any english/latin words also hyphenated, even if a little bad (ie.
french words in russian hyphenated as english).
Dunno.
and thirdly, the option should probably not be a boolean, we should choose
"English US" or "English GB".
Dunno much about the difference between EnglishUS and EnglishGB
hyphenation - but I think these are the same language :) The difference in
hyphenation might be only stylistic ones that only native US/GB snobs might
care about :) and hoping the set of such snobs and the set of Russian FB2
book readers do not intersect :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#383 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGVXMM6PCYN3T2CKTRTLSJLSX3PNTANCNFSM4RVYGFHA>
.
|
What do you mean with "uk books" ? Books published in the UK, or author is english? (And what when you don't know the book origin ? Can you guess its origin depending on how you can or can't read it with English_US ? :) |
UK English have specified words that don’t exist or are somewhat different
than in US English. I was speaking of UK authors. The fact is I can’t say
for US English because every time I read in English it’s UK author’s books
not by choice but it happens to be like that). And each time I read them I
notice something is a little off with the hyphen so I check it and switch
to uk and everything is right after that. Making a diff of the two files
and analyzing the possibility to merge them should interesting but it will
be possible only if there is not two rules for the same set of words
Le sam. 2 janv. 2021 à 10:25, poire-z <[email protected]> a écrit :
… What do you mean with "uk books" ? Books published in the UK, or author is
english?
Or do you mean you just can't read any english (including books published
in the US or author is american) because English_US sucks and English_GB is
better - or more suited to your reading of generic english?
(And what when you don't know the book origin ? Can you guess its origin
depending on how you can or can't read it with English_US ? :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#383 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGVXMMZCPJVNXQKQSZSHNUDSX3Q77ANCNFSM4RVYGFHA>
.
|
Incidentally, can you remember any words that mess up? I admit I haven't checked but I'd expect spelling differences like leveller vs leveler to more or less automatically result in different hyphenation. |
OK, why did you ask then?
Why not? Russian + French: https://en.wikipedia.org/wiki/War_and_Peace
OK.
They are different people, one needs one thing, the other needs another.
OK, let's combine the English US and English UК hyphenation dictionaries. Of course a joke. |
@Frenzie : Nope but it will be thing like that. I imagine we can find a
paper which already do that for us. The Wikipedia entry on this manner is
super long and it’s enough to argue that because so much words are
different it will have an impact on the hyphenation result.
Le sam. 2 janv. 2021 à 10:36, Frans de Jonge <[email protected]> a
écrit :
… Incidentally, can you remember any words that mess up? I admit I haven't
checked but I'd expect spelling differences like leveller vs leveler to
more or less automatically result in different hyphenation.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#383 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGVXMM2G5CLESNWIQFJEBGLSX3SJRANCNFSM4RVYGFHA>
.
|
@cramoisi Remember, I have a degree in Dutch & English. I'm well aware. ;-) But hyphenation isn't really different — my point is that it follows more or less automatically from the spelling. In the example above: lev·el·er There's no difference there in how to correctly hyphenate a word. If it were spelled leveller in American English, it would be hyphenated as lev·el·ler too. |
@Frenzie ; I remembered ;) It’s just a subjective impression that the US
hyphen file seemed more agressive to me last time I read LOTR. But perhaps
it’s only be being paranoid 😅
Le sam. 2 janv. 2021 à 12:10, Frans de Jonge <[email protected]> a
écrit :
… @cramoisi <https://github.com/cramoisi> Remember, I have a degree in
Dutch & English. I'm well aware. ;-) But hyphenation isn't really different
— my point is that it follows more or less automatically from the spelling.
In the example above:
lev·el·er
lev·el·ler
There's no difference there in how to correctly hyphenate a word. If it
were spelled leveller in American English, it would be hyphenated as
lev·el·ler too.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#383 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGVXMMZZUVALED4U3UDZGIDSX35J7ANCNFSM4RVYGFHA>
.
|
It could be done by letting the user select a second (or even third) checkbox with a long press (popup maybe?) Edit: OK, I know there is the addition of the popup, but.. :o) |
We can't currently.
This UI stuff would be the most fun stuff - but there's the whole interface/passing these settings from frontend to crengine - and the internal handling of all that by crengine itself - which I really don't want to get into :) |
That's why I'm so eager to think suggestions about these..
... (he whistles looking at the ceiling) .. ;o) |
@poire-z It's a bad idea to add "English US" (or GB) hyphenation dictionary by default for Russian books - what happens to the hyphenation if the Russian book contains fragments in French? See "War and Peace". We cannot decide for the user which second hyphenation language to use! Therefore, I am in favor of the hypothetical option "additional hyphenation dictionary".
But we can discuss hypothetical variants? :) |
Sure, we can discuss that, as long as I don't have to implement it :) Also, it's not only Russian+FR that would be needed to be supported. It's the multiple combinations of "orthogonal alphabets" - and preventing combinations of same alphabets hyph dicts. And have that working (and clear to the user in the settings) with lang tags... I think the only proper solution to have all that done correctly is for publishers to properly set lang= attributes in the HTML - and KOReader/CoolReader would support them perfectly. |
@poire-z Ok, let's leave it as it is :)
I agree.
As far as I know, it doesn't support it, that's the problem.
Most likely it is. So what? Can I think a little about the future? :) |
Proposed changes:
@cramoisi @poire-z Friulian, Piedmontese and Romansh had 10-20 lines in the beginning which might be added back that all had
'
or''
in some manner, if they're not copies. I also have a fairly complete pattern for Brazilian Portuguese, but I'm not sure how it'll be handled in textlang and the json.Also, should I add hyphenmin's and aliases to all entries in languages.json?
This change is