Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextLangMan for text typography by language, use libunibreak #337

Merged
merged 8 commits into from
Apr 18, 2020

Conversation

poire-z
Copy link
Contributor

@poire-z poire-z commented Apr 17, 2020

Use libunibreak for line breaking
Adds TextLangMan for text typography by language
Implement a bit more of the stuff discussed in #307.
What these commits will allow is detailed at #307 (comment)

Parse and store values from lang= attributes, so we can
propagate a TextlangCfg object to all calls dealing with
text, which will allow to:
- Use specific libunibreak rules for line breaking per lang
  (i.e. reverted quotation marks in German vs French).
- Use the right hyphenation dictionary for each language
- Add more specific line breaking tweaks for some languages
  (some single letter prepositions should not be at end of
  line in Polish and Czech, real hyphens should be duplicated
  at start of next line in Portuguese and Polish...)
- Give the language tag to Harfbuzz so it can pick the
  right glyphs for the language (e.g. different glyphs
  for the same codepoint in zh-CN, zh-TW and ja, and for
  Bulgarian Cyrillic with some fonts).

Update existing global HyphMan to use services from
TextLangMan to ensure legacy single global hyphenation.
TextLangMan still uses the hyphenation methods defined
in hyphman.cpp.

So, this:
image
will render in "best" mode (full harfbuzz) as:
image

I'll bump this up to frontend first without any change to base and frontend, as it should work as-currently with our ReaderHyphenation module (just to have a nightly with this for reference).
And the next day, I'll do the ReaderHyphenation > ReaderTypography swap, that we can discuss in its PR.

One thing to note is that now, we might be loading and keep loaded multiple hyphenation dictionaries (which will use at max 1Mb of RAM per hyph dict). The TextLangCfg objects are also kept globally and will stick even when switching documents (but they are cheap).

Also note for CoolReader devs: CR on Android might use HyphMan::activateDictionaryFromStream(), which I tried to adapt and make right - but I couldn't test it.

Also includes:

Add support for <img src="data:image/png;base64,...>
will allow closing koreader/koreader#5529

Text: fix standalone BR not making an empty line
Fix BR with "display: block" not making an empty line
Fix issues noticed at #172 (comment)

XML parsing: add more HTML5 named entities, optimize search
because why not ? (note that this may cause shifts in highlights in a text nodes that have some of the previously unsupported named entities...)


This change is Reviewable

@poire-z poire-z force-pushed the libunibreak_textlangman branch from 9b110d1 to 3e3b6c8 Compare April 17, 2020 19:57
@poire-z
Copy link
Contributor Author

poire-z commented Apr 17, 2020

Codacy Quality Review checks are just complains because of #ifdef and macros using variables, that Codacy doesn't see. Added some comments to make that less confusing.

@poire-z
Copy link
Contributor Author

poire-z commented Apr 18, 2020

(Travis CI checks are faster in the european mornings :) it just ran in 42m, while yesterday evening, the 4 runs exceeded a timeout of 60m (or 50m, don't remember).

@poire-z poire-z force-pushed the libunibreak_textlangman branch from 3e3b6c8 to d17c777 Compare April 18, 2020 08:44
poire-z added 2 commits April 18, 2020 10:44
Mostly some refactoring to make the private LVBase64Stream
in lvxml.cpp be public in lvxml.h.
@poire-z poire-z force-pushed the libunibreak_textlangman branch from d17c777 to d89ae37 Compare April 18, 2020 08:46
@Frenzie
Copy link
Member

Frenzie commented Apr 18, 2020

@poire-z The weekend confounds that further.

Copy link
Member

@Frenzie Frenzie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real comments, looks pretty good to me 👍

} ent_def_t;

// From https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooray! :-)

if ( !lStr_cmp( def_entity_table[n].name, entname ) ) {
code = def_entity_table[n].code;
break;
// Straight comparisons for the most common ones
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nbsp is definitely quite common, I've also seen quot and apos a fair bit but much less so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had early checks for some of them.
But testing, &amp; took 12 loops (in the binary search), &gt; and &lt; around 10, and the ones you suggest may be 5-6.
Adding too many early tests (say 5 if checking for & < > ' nbsp) will make them still use 1-5 loops, and all other will then be +5.
So, I'm a bit torn :)
Going to re-check these numbers.

Copy link
Contributor Author

@poire-z poire-z Apr 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the early checks:

entities iterations
amp      7
gt      10
lt       8
nbsp     5
quot     9
apos    10
shy     10
eacute  10

Help me decide which are worth an early check (and so, adding a check that will give false to all others).
(nbsp is 5 and may not need an early check, but it's indeed one of the most common - not really sure amp, gt and lt are that popular and need that early check - not sure about apos & quot in ebooks (where the U+20xx left/right angled/not quotations marks have more chances to be used).
Soft hyphens shy is 10 - there might be thousands of them in some books.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amp is quite widely used online (252 times on this very page); presumably much less so in ebooks because you won't have URL parameters.

I wasn't necessarily suggesting anything though, what was the rationale behind these ones specifically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the rationale behind these ones specifically?

Just the ones I know I always have to substitute with their named entities in other web related projects. So, no real thinking about if it matters here in our ebook context :)
I think I'll go with 2 early checks, just for &nbsp; and &shy;.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, in the previous code, that used a linear iteration in a 350-items table, nbsp was first in that table, shy was 14th - and all others far further - so, we won't be slower than before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me! ^_^

@poire-z poire-z force-pushed the libunibreak_textlangman branch from d89ae37 to 45b29ca Compare April 18, 2020 10:38
poire-z added 6 commits April 18, 2020 12:41
This just adds generic support for libunibreak,
which will be tweaked by next commit.
Parse and store values from lang= attributes, so we can
propagate a TextlangCfg object to all calls dealing with
text, which will allow to:
- Use specific libunibreak rules for line breaking per lang
  (i.e. reverted quotation marks in German vs French).
- Use the right hyphenation dictionary for each language
- Add more specific line breaking tweaks for some languages
  (some single letter prepositions should not be at end of
  line in Polish and Czech, real hyphens should be duplicated
  at start of next line in Portuguese and Polish...)
- Give the language tag to Harfbuzz so it can pick the
  right glyphs for the language (e.g. different glyphs
  for the same codepoint in zh-CN, zh-TW and ja, and for
  Bulgarian Cyrillic with some fonts).

Update existing global HyphMan to use services from
TextLangMan to ensure legacy single global hyphenation.
TextLangMan still uses the hyphenation methods defined
in hyphman.cpp.
@poire-z poire-z force-pushed the libunibreak_textlangman branch from 45b29ca to e19f4ff Compare April 18, 2020 10:42
@poire-z poire-z merged commit 44eacb3 into koreader:master Apr 18, 2020
@poire-z poire-z deleted the libunibreak_textlangman branch April 18, 2020 11:31
Comment on lines +10 to +21
#include <linebreak.h>
// linebreakdef.h is not wrapped by this, unlike linebreak.h
// (not wrapping results in "undefined symbol" with the original
// function name kinda obfuscated)
#ifdef __cplusplus
extern "C" {
#endif
#include <linebreakdef.h>
#ifdef __cplusplus
}
#endif
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, code does the inverse of comment?

I'd naively assume you'd want C linking on both, actually? It's a C API, it expects C unmangled symbols.

Copy link
Member

@NiLuJe NiLuJe Apr 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it works because upstream's <linebreak.h> already enforces C linking w/ C++, but NOT <linebreakdef.h>.

TL;DR: It works as-is, but I'd still explicitly move both under C linking here, to avoid future readers having to delve into libunibreak's headers like I just did ;).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it works because upstream's <linebreak.h> already enforces C linking w/ C++, but NOT <linebreakdef.h>.

Isn't what my comment says ?:
// linebreakdef.h is not wrapped by this, unlike linebreak.h
Guess my indentation (to make that stuff an aside) is confusing :)

I'd still explicitly move both under C linking here,

This would result in
extern "C" { extern "C" { <linebreak.h content> } }
right ? No issue with that ? It compiles.

Copy link
Member

@NiLuJe NiLuJe Apr 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, I see what you meant. I initially read that as "I'm not wrapping this...", while you actually meant the header itself ;).

I hadn't thought about the nested externs, but if it builds, I'll take it ,p.

Specs apparently say:

Linkage specifications nest. When linkage specifications nest, the innermost one determines the language linkage.

So, we're good to go ;).

if ( lang_cfg->hasLBCharSubFunc() ) {
next_c = lang_cfg->getLBCharSubFunc()(txt+start, i+1, len-1 - (i+1));
}
int brk = lb_process_next_char(&lbCtx, (utf32_t)next_c);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D

(I cringe every time I remember CRe uses uint16_t for text, which is just wrong on Linux).

(IIRC, in this context, that shouldn't be an issue with libunibreak, stuff is sane if you happen to point to the middle of a multibyte codepoint).

Copy link
Member

@NiLuJe NiLuJe Apr 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, saving grace: it's actually a wchar_t, which is why stuff mostly works. Name is just confusing, because wrong on Linux (where wchar_t is actually sane and 32 bits, unlike on Windows where it's 16 bits for some probably stupid legacy reason).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we had this conversation before :) #252 (comment)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, wchar_t is 16 bits on Windows, because Windows uses UTF-16 for all unicode string handling. Therefore a null terminated array of wchar_t on Windows is a UTF-16 string.

Wheras on most other OS's, I imagine wchar_t is mainly used to store codepoints.

This also makes cross platform path handling a right PITA, because unicode filenames must be in UTF-16 (or UCS2), and fopen() doesn't work... :( The Win32 API SUCKS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shermp : can I request your native english speaker opinion about the use of the word "Honor" in Would you like to honor or ignore embedded lang tags by default?, cf koreader/koreader#6072 (comment) and followup discussion ? Or alternative suggestions ? Thanks :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, honor (or honour, because I speak the queen's English, dammit) is probably the right term here.

Although one might just ask: (Do you wish/would you like) to ignore embedded lang tags by default? with a yes/no, or perhaps, more verbosely: We honor embedded lang tags by default, would you like to ignore them instead?

@poire-z
Copy link
Contributor Author

poire-z commented May 14, 2020

Some slightly related observation:

I have a book which uses inline-block for footnote links:
image
which, because inline-block and images are considered breakable before and after by http://www.unicode.org/reports/tr14/#CB , renders as:
image
which is not super nice - but Calibre renders it the same.

Some of these footnotes links follow a closing quote », and there, there is a difference depending on the chosen typography language:

French, which considers » as a closing punctuation, forbids a break before it, but allows a break after it:
image

but if I select EnglishUS, which considers » as a quotation (http://www.unicode.org/reports/tr14/#QU) prevents a break on both sides:
image

So, I'd be happier with that EnglishUS rendering - but it does not help with the first case above when there is no » to help with it.

I considered for a moment adding an option to not enable language specific line breaking rules, that we could use with books that do it properly with appropriate nbsp, like this book does - but as this would not totally solve this situation (the first case above), I'm dropping the idea.

Anyway, in that case, as it shows badly with Calibre too, I guess it's a publisher issue.

In UAX#14:

Object-specific line break behavior is best implemented by querying the object itself, not by replacing the CB line breaking class by another class.

LB1 Assign a line breaking class to each code point of the input. Resolve AI, CB, CJ, SA, SG, and XX into other line breaking classes depending on criteria outside the scope of this algorithm.

LB20 Break before and after unresolved CB (= objects)
Conditional breaks should be resolved external to the line breaking rules. However, the default action is to treat unresolved CB as breaking before and after.

I guess it's fine/better to break before/after images in general, so probaly best to not do anything in the code.

Any idea how I could go at solving that (preventing a break before such inline-block), with style tweaks or else?
Only idea that comes to mind would be using this (that crengine does not support):
a.footnotecall:before { content: "&nbsp;" }
that I guess would prevent libunibreak from allowing the break before.

Any other idea/thought?

@Frenzie
Copy link
Member

Frenzie commented May 14, 2020

Not really I'm afraid.

@poire-z
Copy link
Contributor Author

poire-z commented Jun 4, 2020

Regarding my issue above, I can now solve it after #345 in 2 ways:

With:

a.footnotecall { display: inline !important; }
a.footnotecall:before { content: "\2060" }

or

inlineBox { white-space: nowrap; }

to prevent a wrap on both side of inline-block.

Actually, I initially went to quickly implement pseudo elements, to be able to add a &nbsp; before the inlineBox with ::before.
And when I get to test it, I realized (and I knew that all along while coding it since I read the specs) that the :before is inserted inside the inlineBox... so it doesn't help at all :)
Then, I realized I could just switch the publisher display: inline-block to display: inline and it's mostly fine. But better when using content: "\2060".

But I was frustrated with the inline-block issue, so I went to hack white-space to be able to specify white-space: nowrap on images and inlineBox, so we can prevent these wraps around.

Oh, and when all that was coded and ready to test, I had finished that book where I needed it...

@poire-z
Copy link
Contributor Author

poire-z commented Jun 29, 2020

@virxkane : I regularly follow your https://github.com/virxkane/coolreader/commits/koreader-merge-post - I look and usually pick your stuff - but when you're cherry picking some of my (huge) commits, I can't really notice if you did fix some bug when adapting them. Could you keep letting me know if you find some bug and fix it as part of the cherry picked commit (just bugs or typos, not the needed adaptations you have to do for the few differences we have).
You could just leave some small comment around the affected lines by reviewing our commit in https://github.com/koreader/crengine/commits/master - and I'll go look at how you fixed it around there in your cherry picked commit.

I just by chance noticed this minor thing in your todays' picks - that I'll fix on my side:

-    friend TextLangCfg;
+    friend class TextLangCfg;

Btw, for the TextLangMan stuff, dunno if you saw that in the first post of this PR:

I'll bump this up to frontend first without any change to base and frontend, as it should work as-currently with our ReaderHyphenation module (just to have a nightly with this for reference).
And the next day, I'll do the ReaderHyphenation > ReaderTypography swap, that we can discuss in its PR.

Which means it should stay compatible with your current frontend code, and should not need any change as a first step: you can keep just setting hyphenation dicts with the current Hyphen:: methods, and it will pick the language associated.
There's just one thing that I have not tested, and you might need to check/fix:

Also note for CoolReader devs: CR on Android might use HyphMan::activateDictionaryFromStream(), which I tried to adapt and make right - but I couldn't test it.

@virxkane
Copy link
Contributor

@poire-z

Could you keep letting me know if you find some bug and fix it as part of the cherry picked commit (just bugs or typos, not the needed adaptations you have to do for the few differences we have).

I always try not to change the source while making cherry-pick (exception - conflict resolution). I do adaptation in the next commit. To prevent this from happening: your commit under someone else's authorship: plotn/coolreader@cba0e06. Or this is really you wrote?
Yes, of course, if I find some bugs, I will write about them.

-    friend TextLangCfg;
+    friend class TextLangCfg;

This change is so small that I did not make a separate commit. But, I think, ommiting the keyword 'class' in 'friend' clause is not error.

Btw, for the TextLangMan stuff, dunno if you saw that in the first post of this PR:

At this moment not all you things work yet. I must do some work around this PR. Can you upload some test files wich you demonstated in #337 (comment)?

@poire-z
Copy link
Contributor Author

poire-z commented Jun 30, 2020

To prevent this from happening: your commit under someone else's authorship: plotn/coolreader@cba0e06. Or this is really you wrote?

Of course not :) This fork/branck is really a mess and totally unusable/unfollowable. Hopefully, it's mostly android frontend changes, and nothing much about the engine.

Can you upload some test files wich you demonstated

linebreaking_lang_test_files.zip
A few test files I've been using these last months. The one you're after is test-linebreaking.html - but others might be useful, for some commits you haven't yet picked.

@virxkane
Copy link
Contributor

virxkane commented Jul 1, 2020

Adapting for CoolReader... Sorry, but I can't not write this. This spaghetti code such... Om nom nom :)
Sorry, again.

@poire-z
Copy link
Contributor Author

poire-z commented Jul 1, 2020

I initially thought the same about the whole crengine :)
And I always try to adjust to the style of what I'm modifying, so I guess I suceeded ! :)

Seriously, which part ?
HyphMan, that was initially spaghetti, and that I just tried to adapt, keeping the same API, and make it a wrapper to the new TextLangMan ? (doing this was painful, and I did it mainly for you :) I thought it shouldn't need any adaptation.)
Or TextLangMan itself, which is really simple :/ (if that, you'll have made me sad :)
Or else ?
Anyway, I'm always learning, so comments and suggestions welcome.

@virxkane
Copy link
Contributor

virxkane commented Jul 1, 2020

Yes, crengine (HyphMan also) already spaghetti.
But in new code: TextLangMan -> HyphMan, HyphMan -> TextLangMan, static fields of TextLangMan penetrate LVDocView, uhh it is very difficult to understand...
Of course, no complaints.

@poire-z
Copy link
Contributor Author

poire-z commented Jul 1, 2020

Well, I made TextLangMan like HyphMan a single/global/static class instance - because somehow, that makes sense: hyphenation and TextLangCfg instances can (and should, to avoid duplicating hyphenation dicts or lang properties) be shared between multiple documents (you can have multiple docs on CR, we don't on KOReader).
As far as I can see, there are the same little things in lvdocview.cpp for TextLangMan and Hyphman: setting 4 or 5 properties by calling some methods of these 2 global static class instances.

Oh, and yes: I think you should just use of one these 2 ! Either you use only the legacy Hyphman props like PROP_HYPHENATION_DICT - or you use the PROP_TEXTLANG_MAIN_LANG and friends.

And yes, the interaction between TextLangMan <> Hyphman are complicated and tedious. I did not want to change Hyphman too much (mainly, because I want a clean git history with a real log of the past), otherwise, I would have just taken the hyph methods code, and drop the rest.
So, yes, it's ugly. If you need help on some parts, just tell where.

For me, the only issue for you would have been with HyphMan::activateDictionaryFromStream() on Android, because there's no obvious lang associated: you just provide a stream. That's if as a first step, you keep using the old HyphMan/PROP_HYPHENATION_DICT from frontend.
If you want to switch to using from frontend PROP_TEXTLANG_MAIN_LANG and friends, yes, you'll need more work in your frontend code. But that's optional. You still benefit from the new stuff with PROP_HYPHENATION_DICT.

@virxkane
Copy link
Contributor

virxkane commented Jul 1, 2020

Ok, @poire-z thank you very much.
Screenshot_20200702_004015
https://github.com/virxkane/coolreader/commit/bf60ffe2b67aa0de38d7be33f27c7eb08fa80637
Android build not fixed yet, I think, we must change function prototype (add lang_tag, etc...).

@poire-z
Copy link
Contributor Author

poire-z commented Jul 1, 2020

I had 2 targets:

  • legacy behaviour: input is hyph dict, get the associated lang to set a main lang, no embedded lang tag support
  • new behavious: input is only a lang tag, embedded lang tags are supported.

So, your adaptation looks a bit hybrid :)

Dunno if you went looking at our frontend changes for this switch from HyphMan to TextLangMan: see koreader/koreader#6072.
We have other mapping in our frontend code https://github.com/koreader/koreader/blob/master/frontend/apps/reader/modules/readertypography.lua, like HYPH_DICT_NAME_TO_LANG_NAME_TAG or LANGUAGES.
It's a bit ugly to have all this on both sides - but it was the simplest (otherwise, having all this in crengine, I would have needed some API to transfer the info from crengine to koreader to build the menu of available languages, etc... too much work.

@virxkane
Copy link
Contributor

virxkane commented Jul 2, 2020

While testing this PR: if I replace ISO639-1 language code with ISO639-2 (or ISO639-3) tag 'lang' not work anymore - nor hyphenation nor HarfBuzz's font scripting selection. For example, replace 'bg' with 'bul'. Tag 'lang' specification:
https://www.w3.org/International/questions/qa-html-language-declarations
https://developer.mozilla.org/ru/docs/Web/HTML/Global_attributes/lang
BCP47: https://www.rfc-editor.org/rfc/bcp/bcp47.txt
it is clear why hyphenation don't work - table '_hyph_dict_table' not contains any ISO-639-2(3) codes, but what about HarfBuzz?
I think we must embed into sources full languages table with ISO639-1, ISO639-2, ISO639-3, full language name and write some functions to lookup language (or language code) in this table.

added:
I think it’s not difficult to find a document with the specified language 'eng'.

@virxkane
Copy link
Contributor

virxkane commented Jul 2, 2020

Ok, in https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry 3-ALPHA codes ommited if exist suitable 2-ALPHA code. But I unsure that all files in internet strictly conforms to the specification.

@poire-z
Copy link
Contributor Author

poire-z commented Jul 2, 2020

I think 'eng' or 'bul' are not expected in HTML lang tags (and I guess HarfBuzz does not accept them, https://github.com/harfbuzz/harfbuzz/blob/d5439232946333b60f655d9ed37ec7dadf439287/src/hb-ot-tag-table.hh#L16-L114 ).
https://www.w3.org/International/articles/language-tags/
Dunno about other formats, like FB2.

But we may find them in books metadata.
We handle them (and translate them to 'en' or 'bg') in our frontend code:
https://github.com/koreader/koreader/blob/f7d538b108167a6bb4e89880d2b0cf8b4c69b42f/frontend/apps/reader/modules/readertypography.lua#L52-L62
(It's a lot easier for me to add that kind of stuff in Lua than it is in C :)

@virxkane
Copy link
Contributor

virxkane commented Jul 2, 2020

Dunno about other formats, like FB2.

I found fb2 book with specified language 'eng'.

We handle them (and translate them to 'en' or 'bg') in our frontend code

Ok, I'll think about it.

@virxkane
Copy link
Contributor

virxkane commented Jul 2, 2020

@poire-z As you requested report about SEGFAULT (related to this PR). You introduce multiple construction m_flags[pos-1] in lvtextfm.cpp and when pos == 0 SEGFAULT catched. Found on file Dostoievsky.RU.epub that you uploaded earlier.
Maybe you can fix it yourself, I'm not sure I won’t break your code.

@poire-z
Copy link
Contributor Author

poire-z commented Jul 2, 2020

You don't mention the line, but may be I've already fixed it in bc4500a that you may be have not yet picked ?
(No crash for me with our latest master on that Dostoievsky.RU.)

@virxkane
Copy link
Contributor

virxkane commented Jul 2, 2020

It seems like that.

@virxkane
Copy link
Contributor

@poire-z But still bug is not fixed. Try file Petra.AR.epub

m_flags[pos-1] |= LCHAR_ALLOW_WRAP_AFTER;

If pos is equal zero asan tell me about heap-buffer-overflow.

@virxkane
Copy link
Contributor

@poire-z
Copy link
Contributor Author

poire-z commented Jul 15, 2020

But still bug is not fixed. Try file Petra.AR.epub

You mean you get a crash? I don't get any crash with the Petra.AR.epub from the DocumentsForTestingRTL.zip from buggins/coolreader#125 (comment) :/

If pos is equal zero asan tell me about heap-buffer-overflow.

Crash or just analyzer warning ? Of course, if pos=0 and we write at pos-1, it should complain. But aren't we wrapping this with if ( pos > 0 ) ? I don't see any access to pos-1 not wrapped with pos > 0 in your lvtextfm.cpp...
So, need more info to understand what you mean :)

Am I corrected correctly?

Looks correct.
And OK :) I fixed it for text (that you picked), but I forgot to fix it for images and inlineBoxes (that your commit fixes the right way it seems)...

@poire-z
Copy link
Contributor Author

poire-z commented Jul 15, 2020

Or do you mean https://github.com/virxkane/coolreader/commit/086c571c8bfaa711a8c6f9e13b9e52f349fdcf12 did fix your asan issue and all is now fine.

(And it's just that for some reason I don't really need to know, I did not get a crash.)

@virxkane
Copy link
Contributor

virxkane commented Jul 15, 2020

You mean you get a crash?

it's not true crash, it's AddressSanitizer error log (this does not make the bugs less harmless).

Or do you mean virxkane/coolreader@086c571 did fix your asan issue and all is now fine.

Yes.

(And it's just that for some reason I don't really need to know, I did not get a crash.)

Ok, but you are overwriting some data on the heap:

==9505==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x604000963ace at pc 0x55e05f077987 bp 0x7ffc77859b50 sp 0x7ffc77859b40
READ of size 2 at 0x604000963ace thread T0
    #0 0x55e05f077986 in LVFormatter::copyText(int, int) coolreader/crengine/src/lvtextfm.cpp:1109
    #1 0x55e05f0a0acb in LVFormatter::processParagraph(int, int, bool) coolreader/crengine/src/lvtextfm.cpp:3387
    #2 0x55e05f0aa1f5 in LVFormatter::splitParagraphs() coolreader/crengine/src/lvtextfm.cpp:4101
    #3 0x55e05f0aada9 in LVFormatter::format() coolreader/crengine/src/lvtextfm.cpp:4149
    #4 0x55e05f0606c7 in LFormattedText::Format(unsigned short, unsigned short, int, BlockFloatFootprint*) coolreader/crengine/src/lvtextfm.cpp:4246
    #5 0x55e05ec2f523 in ldomNode::renderFinalBlock(LVRef<LFormattedText>&, RenderRectAccessor*, int, BlockFloatFootprint*) coolreader/crengine/src/lvtinydom.cpp:16640
    #6 0x55e05f0d66dc in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:7121
    #7 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #8 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #9 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #10 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #11 0x55e05f0d9381 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*, int) coolreader/crengine/src/lvrend.cpp:7337
    #12 0x55e05f0d9547 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*) coolreader/crengine/src/lvrend.cpp:7354
    #13 0x55e05f0820b3 in LVFormatter::measureText() coolreader/crengine/src/lvtextfm.cpp:1888
    #14 0x55e05f0a0b05 in LVFormatter::processParagraph(int, int, bool) coolreader/crengine/src/lvtextfm.cpp:3389
    #15 0x55e05f0aa1f5 in LVFormatter::splitParagraphs() coolreader/crengine/src/lvtextfm.cpp:4101
    #16 0x55e05f0aada9 in LVFormatter::format() coolreader/crengine/src/lvtextfm.cpp:4149
    #17 0x55e05f0606c7 in LFormattedText::Format(unsigned short, unsigned short, int, BlockFloatFootprint*) coolreader/crengine/src/lvtextfm.cpp:4246
    #18 0x55e05ec2f523 in ldomNode::renderFinalBlock(LVRef<LFormattedText>&, RenderRectAccessor*, int, BlockFloatFootprint*) coolreader/crengine/src/lvtinydom.cpp:16640
    #19 0x55e05f0d66dc in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:7121
    #20 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #21 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #22 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #23 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #24 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #25 0x55e05f0d9381 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*, int) coolreader/crengine/src/lvrend.cpp:7337
    #26 0x55e05f0d9547 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*) coolreader/crengine/src/lvrend.cpp:7354
    #27 0x55e05eb5f7ef in ldomDocument::render(LVRendPageList*, LVDocViewCallback*, int, int, bool, int, LVProtectedFastRef<LVFont>, int, LVFastRef<CRPropAccessor>) coolreader/crengine/src/lvtinydom.cpp:4583
    #28 0x55e05ef961ca in LVDocView::Render(int, int, LVRendPageList*) coolreader/crengine/src/lvdocview.cpp:2822
    #29 0x55e05ef48b14 in LVDocView::checkRender() coolreader/crengine/src/lvdocview.cpp:604
    #30 0x55e05efa474a in LVDocView::updateBookMarksRanges() coolreader/crengine/src/lvdocview.cpp:3304
    #31 0x55e05efb0c57 in LVDocView::restorePosition() coolreader/crengine/src/lvdocview.cpp:3791
    #32 0x55e05e92a83d in CR3View::loadDocument(QString) coolreader/cr3qt/src/cr3widget.cpp:474
    #33 0x55e05e981bcc in MainWindow::on_actionOpen_triggered() coolreader/cr3qt/src/mainwindow.cpp:248
    #34 0x55e05ea91cf2 in MainWindow::qt_static_metacall(QObject*, QMetaObject::Call, int, void**) coolreader-debug-build/cr3qt/src/moc_mainwindow.cpp:253
    #35 0x55e05ea92870 in MainWindow::qt_metacall(QMetaObject::Call, int, void**) coolreader-debug-build/cr3qt/src/moc_mainwindow.cpp:295
    #36 0x7fcb510d61fe  (/usr/lib64/libQt5Core.so.5+0x2d91fe)
    #37 0x7fcb51b20791 in QAction::triggered(bool) (/usr/lib64/libQt5Widgets.so.5+0x15d791)
    #38 0x7fcb51b23357 in QAction::activate(QAction::ActionEvent) (/usr/lib64/libQt5Widgets.so.5+0x160357)
    #39 0x7fcb51c2dc31  (/usr/lib64/libQt5Widgets.so.5+0x26ac31)
    #40 0x7fcb51c2dd86 in QAbstractButton::mouseReleaseEvent(QMouseEvent*) (/usr/lib64/libQt5Widgets.so.5+0x26ad86)
    #41 0x7fcb51d36929 in QToolButton::mouseReleaseEvent(QMouseEvent*) (/usr/lib64/libQt5Widgets.so.5+0x373929)
    #42 0x7fcb51b6f7a5 in QWidget::event(QEvent*) (/usr/lib64/libQt5Widgets.so.5+0x1ac7a5)
    #43 0x7fcb51d369da in QToolButton::event(QEvent*) (/usr/lib64/libQt5Widgets.so.5+0x3739da)
    #44 0x7fcb51b284ce in QApplicationPrivate::notify_helper(QObject*, QEvent*) (/usr/lib64/libQt5Widgets.so.5+0x1654ce)
    #45 0x7fcb51b3019d in QApplication::notify(QObject*, QEvent*) (/usr/lib64/libQt5Widgets.so.5+0x16d19d)
    #46 0x7fcb510a0ddf in QCoreApplication::notifyInternal2(QObject*, QEvent*) (/usr/lib64/libQt5Core.so.5+0x2a3ddf)
    #47 0x7fcb51b2f283 in QApplicationPrivate::sendMouseEvent(QWidget*, QMouseEvent*, QWidget*, QWidget*, QWidget**, QPointer<QWidget>&, bool, bool) (/usr/lib64/libQt5Widgets.so.5+0x16c283)
    #48 0x7fcb51b8bc85  (/usr/lib64/libQt5Widgets.so.5+0x1c8c85)
    #49 0x7fcb51b8ebbc  (/usr/lib64/libQt5Widgets.so.5+0x1cbbbc)
    #50 0x7fcb51b284ce in QApplicationPrivate::notify_helper(QObject*, QEvent*) (/usr/lib64/libQt5Widgets.so.5+0x1654ce)
    #51 0x7fcb51b2ff57 in QApplication::notify(QObject*, QEvent*) (/usr/lib64/libQt5Widgets.so.5+0x16cf57)
    #52 0x7fcb510a0ddf in QCoreApplication::notifyInternal2(QObject*, QEvent*) (/usr/lib64/libQt5Core.so.5+0x2a3ddf)
    #53 0x7fcb514a9d7c in QGuiApplicationPrivate::processMouseEvent(QWindowSystemInterfacePrivate::MouseEvent*) (/usr/lib64/libQt5Gui.so.5+0x128d7c)
    #54 0x7fcb514ab3e4 in QGuiApplicationPrivate::processWindowSystemEvent(QWindowSystemInterfacePrivate::WindowSystemEvent*) (/usr/lib64/libQt5Gui.so.5+0x12a3e4)
    #55 0x7fcb514847ea in QWindowSystemInterface::sendWindowSystemEvents(QFlags<QEventLoop::ProcessEventsFlag>) (/usr/lib64/libQt5Gui.so.5+0x1037ea)
    #56 0x7fcb499caec9  (/usr/lib64/libQt5XcbQpa.so.5+0x75ec9)
    #57 0x7fcb4fdf0c3c in g_main_context_dispatch (/usr/lib64/libglib-2.0.so.0+0x4fc3c)
    #58 0x7fcb4fdf0eb7  (/usr/lib64/libglib-2.0.so.0+0x4feb7)
    #59 0x7fcb4fdf0f4e in g_main_context_iteration (/usr/lib64/libglib-2.0.so.0+0x4ff4e)
    #60 0x7fcb510f895f in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) (/usr/lib64/libQt5Core.so.5+0x2fb95f)
    #61 0x7fcb5109f95a in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) (/usr/lib64/libQt5Core.so.5+0x2a295a)
    #62 0x7fcb510a7941 in QCoreApplication::exec() (/usr/lib64/libQt5Core.so.5+0x2aa941)
    #63 0x55e05e8efb63 in main coolreader/cr3qt/src/main.cpp:205
    #64 0x7fcb4ff2ce9a in __libc_start_main (/lib64/libc.so.6+0x23e9a)
    #65 0x55e05e8bd439 in _start (coolreader-debug-build/cr3qt/cr3+0x15ac439)

0x604000963ace is located 2 bytes to the left of 34-byte region [0x604000963ad0,0x604000963af2)
allocated by thread T0 here:
    #0 0x7fcb524f8d29 in realloc (/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/libasan.so.5+0x10cd29)
    #1 0x55e05f0ab05f in unsigned short* cr_realloc<unsigned short>(unsigned short*, unsigned long) coolreader/crengine/src/../include/lvmemman.h:42
    #2 0x55e05f073093 in LVFormatter::allocate(int, int) coolreader/crengine/src/lvtextfm.cpp:880
    #3 0x55e05f0a0a83 in LVFormatter::processParagraph(int, int, bool) coolreader/crengine/src/lvtextfm.cpp:3385
    #4 0x55e05f0aa1f5 in LVFormatter::splitParagraphs() coolreader/crengine/src/lvtextfm.cpp:4101
    #5 0x55e05f0aada9 in LVFormatter::format() coolreader/crengine/src/lvtextfm.cpp:4149
    #6 0x55e05f0606c7 in LFormattedText::Format(unsigned short, unsigned short, int, BlockFloatFootprint*) coolreader/crengine/src/lvtextfm.cpp:4246
    #7 0x55e05ec2f523 in ldomNode::renderFinalBlock(LVRef<LFormattedText>&, RenderRectAccessor*, int, BlockFloatFootprint*) coolreader/crengine/src/lvtinydom.cpp:16640
    #8 0x55e05f0d66dc in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:7121
    #9 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #10 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #11 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #12 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #13 0x55e05f0d9381 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*, int) coolreader/crengine/src/lvrend.cpp:7337
    #14 0x55e05f0d9547 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*) coolreader/crengine/src/lvrend.cpp:7354
    #15 0x55e05f0820b3 in LVFormatter::measureText() coolreader/crengine/src/lvtextfm.cpp:1888
    #16 0x55e05f0a0b05 in LVFormatter::processParagraph(int, int, bool) coolreader/crengine/src/lvtextfm.cpp:3389
    #17 0x55e05f0aa1f5 in LVFormatter::splitParagraphs() coolreader/crengine/src/lvtextfm.cpp:4101
    #18 0x55e05f0aada9 in LVFormatter::format() coolreader/crengine/src/lvtextfm.cpp:4149
    #19 0x55e05f0606c7 in LFormattedText::Format(unsigned short, unsigned short, int, BlockFloatFootprint*) coolreader/crengine/src/lvtextfm.cpp:4246
    #20 0x55e05ec2f523 in ldomNode::renderFinalBlock(LVRef<LFormattedText>&, RenderRectAccessor*, int, BlockFloatFootprint*) coolreader/crengine/src/lvtinydom.cpp:16640
    #21 0x55e05f0d66dc in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:7121
    #22 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #23 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #24 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #25 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #26 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #27 0x55e05f0d9381 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*, int) coolreader/crengine/src/lvrend.cpp:7337
    #28 0x55e05f0d9547 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*) coolreader/crengine/src/lvrend.cpp:7354
    #29 0x55e05eb5f7ef in ldomDocument::render(LVRendPageList*, LVDocViewCallback*, int, int, bool, int, LVProtectedFastRef<LVFont>, int, LVFastRef<CRPropAccessor>) coolreader/crengine/src/lvtinydom.cpp:4583

SUMMARY: AddressSanitizer: heap-buffer-overflow coolreader/crengine/src/lvtextfm.cpp:1109 in LVFormatter::copyText(int, int)
Shadow bytes around the buggy address:
  0x0c0880124700: fa fa fd fd fd fd fd fa fa fa fd fd fd fd fd fd
  0x0c0880124710: fa fa fd fd fd fd fd fa fa fa fd fd fd fd fd fd
  0x0c0880124720: fa fa fd fd fd fd fd fd fa fa fd fd fd fd fd fd
  0x0c0880124730: fa fa fd fd fd fd fd fd fa fa fd fd fd fd fd fd
  0x0c0880124740: fa fa fd fd fd fd fd fa fa fa fd fd fd fd fd fa
=>0x0c0880124750: fa fa fd fd fd fd fd fa fa[fa]00 00 00 00 02 fa
  0x0c0880124760: fa fa 00 00 00 00 02 fa fa fa fa fa fa fa fa fa
  0x0c0880124770: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0880124780: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0880124790: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c08801247a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==9505==ABORTING

Of course, line numbers are different ...

@poire-z
Copy link
Contributor Author

poire-z commented Jul 15, 2020

OK, I get it - I witnessed in the past that I sometimes did not crash when I wrote just one byte to the left - I needed to write to the 2nd one to get a crash :)
But OK, your fix if perfect, picking it as part of #357.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants