Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tables rendering: fixes and improvements #243

Merged
merged 12 commits into from
Dec 5, 2018

Conversation

poire-z
Copy link
Contributor

@poire-z poire-z commented Dec 2, 2018

It started with the 30 lines first commit, to support symbol fonts (details in koreader/koreader#4196 (comment)), and...:
image

image

and it ended up with 1500+ lines so we can get instead:
image

new_symbol2

Fixed all the buggy stuff I met in that journey (including a right border drawing bug, noticable in all the screenshots in #233: the right border was thicker - and now the various Vocalu-lo are no more cut - and the background issue noticed in #233 (comment)).

Main improvements and changes:

More adequate computation of column widths, so they don't go smaller than the longest word inside a cell. This should avoid words being cut, and a proper sizing when a cell content uses a different font than the cell itself.
Added getRenderedWidths() to get the min and max content width from any node. Used for now for sizing table cells, but could be used for other things (like floats).
Added support for table and cells width when specified with CSS (there was support for that when specified as HTML tag attributes, which worked only with plain HTML documents, not with EPUBs).
When no table width specified, defaults to "shrink to fit" (as most browsers do) and no more to 100%.
Implement vertical-align: middle/bottom for table cells. Closes koreader/koreader#2846
Details in the individual commit messages.

More comparisons (stuff look constrained in the After screenshots, because I used padding:0 for easier testing - we'd probably find some padding in regular books):

Before:
old1
After:
new1

Before:
old2
After:
new2

Before:
old3
After:
new3
(There must be a bug in the font with the first symbol glyph, as its width is different than its rendered width... the other symbols glyphs are ok.)

Some notes and questions:
Supporting cells specified widths is a bit questionable, as I feel it's always better when not, and just using the computed width according to min/max content width feels always good, and specified width can be boggus and change with DPI. We'll see if we need to remove that.

Unless tables are set with a specified width, they will now take only the width necessary to display cells content (shrink to fit) just like current browsers do.
This might be questionable: I feel like always taking 100% for unstyled tables is most often better than not on our devices (otherwise, smaller tables are stuck to the left, which feels strange with justified text... a 100% width table feels like a justified table :)

Got this feeling with most Wikipedia EPUBs, that looked before/after like:
image image
(there is actually a style="width:9em" on the first TD in a row, that's the reason they look aligned)

What do you think?
If we'd want to restore previous behaviour for unstyled tables, we could just add to epub.css: table { width: 100%; }.

I'll add a few Style tweaks so people can choose alternatives (ignore table and td widths, make tables full width, make tables shrink, center shrinked tables...) but we should choose the most adequate default.

This is a followup to #233, and change quite a bit of what @frankyifei introduced in #35 and #38.

There might be something left to do for CJK, but I can't judge.
I left some code related to CJK commented out in getRenderedWidths(), as logically, one CJK char being itself a word and candidate for wrap, we could get when sizing (in this example, there's a witdh=100% specified in the middle TD...)
image
but not doing it gives a better looking:
image
@frankyifei or other CJK readers: any feeling about that? Any minimal number of CJK chars to aggregate on a single line, or are there some kind of rules for deciding that?

Allow local or embedded fonts like symbol.ttf or
Wingding3.ttf to be used.
Fix infinite loop in measureText() when neither
current font nor fallback font have a glyph for
a char.
Issues mostly noticable in table cells:
- Really ignore leading spaces, so they don't cause a wrap
(and so a leading blank line) when the next word doesn't
fit in the available width.
- When a word exactly fits in the available width, don't
forget to check if the following char (even if it itself
doesn't fit) is a space, in which case it is a candidate
for wrap. Otherwise, such a word would be splitted by
the hyphenation rules (This also helps in normal text
when the last word fits exactly the line: it will avoid
hyphenating that last word, and reduce the amount of added
spacing between words because of that.)
- Fix some crash when some image gets resized to have
a zero width (can happen with some mix of CSS properties)
When consecutive lines or blocks with all page-break: avoid,
if they wouldn't fit on a page, some of them may be missing
in the computed pages (while still visible in scroll mode).
This was because StartPage() did set 'last=line', which
was wrong in the case we split on an earlier line (when
all current lines have pb avoid, we split on a past one
where page break was allowed we remembered (as 'next').
(Mostly noticable with tables with tall rows: currently,
a table with 3 rows has all pb avoid between rows).
This commit's real change is just the removal of 'last=line'
in StartPage().
Also moved the single 'last = line' that was at the end
of AddLine() to the proper logical place in each of the
if/else branches as it makes more sense there when reading.
Added many comments and debugging code (lvpagesplitter
may need a revisit.)
- Adds support for named value 'auto' in some length'ed
properties (width: and margin:, will only be supported
for now when applied to a TABLE element).
- Remove support of units in % for border* properties
according to CSS specs (in some cases, when the container
width is not yet known, it couldn't be computed anyway).
- Fixed computing of border named values (thin, medium,
thick) that I forgot to *256 when I added support for
dpi-based css units (koreader#213).
It was too short by 1px since koreader#211 (bottom border needed koreader#211 fix,
and I added in koreader#211 the same kind of fix for the right border as it
looked needed - but the more obvious error, easier seen
on the right than on the bottom, was already workarounded by
some removed '+1' in the their drawing code.
Now, they are all similar.
Add forgotten styles, and re-order in the same order
styles are defined in lvstyles.h, for easier futur
comparisons and updates.
Added more comments about how things work.

Adds rendering method 'erm_killed' as a last resort
fallback when there is no room to draw some element.
Might be needed in some situations when width < 0,
and used only with tables for now (was needed while
fixing tables as some bug made negative widths, but
it looks like I can't reproduce the need for it...
but well, let's keep it, might be handy someday).

Draw a little figure when erm_killed, so the user
can notice some content is missing.

Also adds optional SHOW_REND_METHOD option to
writeNodeEx() used to return HTML of selection (not
enabled in KOReader 'View HTML', but useful to me
when debugging).
Needed for proper sizing and positionning of some elements in
various contexts: will be used to properly size table cells,
but could be useful if we were to add support for 'float:'.

Estimate min and max content width of node when rendered,
the way it is mentionned in various CSS or HTML specs:
- maxWidth: width if it would be rendered on an infinite
  width area, breaking only on explicit <BR/>.
- minWidth: width with a break on all spaces (no hyphenation),
  so the width taken by the longest word or image.

This does mostly what renderBlockElement, renderFinalBlock and
lvtextfm.cpp do, but caring only about widths and horizontal
margin/border/padding and indent (with no width constraint, so no
line splitting and hyphenation - and we don't care about vertical
spacing and alignment).
(Many related changes that can't easily be splitted into
individual commits... So, describing them as they appear
top to bottom in this commit.)

LookupElem():
- Table cells (TD/TH): adds support for CSS properties
  vertical-align, text-align and width.
- In ldomDocumentWriterFilter(), used for plain HTML files,
  translate deprecated tag attributes width=, align= and
  valign= to their equivalent CSS properties.

PlaceCells():
- Fixed oversized colspan and rowspan by adjusting everything
  to the right amount of really used cols and rows.
- border-collapse: proper collapsing from TR/THEAD/TABLE (now
  correctly displays border from TRs when TDs don't have any).
  Proper collapsing from TDs to adjacent TDs is not implemented.
- Proper computation of cells min and max rendered width
  with the help of getRenderedWidths() (previously,
  some bad approximations were used by just computing
  the length of the whole plain text of the content with
  the TD cell font size).
- Use cells min and max widths to better compute final
  column widths. Do that as a 2nd pass for colspan cells
  while trying to keep single-cells column widths ratios.
  (previously, any colspan was adding too much to all the
  cells it spanned, with the result of all the columns
  having the same widths).
- Properly compute the assignable width for columns, and
  avoid modifying final column widths once computed
  (border-spacing was taken care of too late, and was
  reducing computed cells width, noticable on cells where
  the width has been set to its min-content width)
- Add optional shrink_to_fit parameter to reduce table
  width instead of distributing remaining space to cells.
  Used by default and when table has "width: auto".
  Previous behaviour can be obtained with "width: 100%"

renderCells():
- Never modify computed table cells widths.
- Clearer computation of elements' relative positions to
  their parent (RenderRectAccessor x/y/w/h)
- Fix TRs borders being drawn over cells content by just
  never drawing them: they should only be drawn when
  border-collapse to the cells, by the cells.
- Clearer page splitting (context.Addline()) by doing
  it in the flow as we met elements (instead of only at
  the end as it was done previously).
- Implement vertical-align: middle/bottom of cell content.
- (Moved wrapping function renderTable() from far below
  to just after the table rendering code).

renderBlockElement():
- Switch to erm_killed when erm_table has zero or negative
  width (just in case, hard to reproduce).
- Use shrink_to_fit or not depending on table width: style:
  - when unspecified or "width: auto": shrink_to_fit
  - when it's a length (in %, px or em): no shrink_to_fit
- Properly align (center or right) table when table's
  margin-left: and/or margin-right: is 'auto'.

DrawDocument():
- Don't draw background for TR and THEAD/TBODY/TFOOT:
  bgcolor is now spread to cells (TD) by setNodeStyle().
  (Firefox never draws the TR bgcolor outside of cells:
  the border-spacing stays with the bgcolor of the table).
- Don't draw borders for TR and THEAD/TBODY/TFOOT:
  they are never drawned by Firefox when no border-collapse,
  and when border-collapse, they were collapsed to the cells
  if needed, and will be drawn by the cells

ldomDocumentWriterFilter::OnAttribute():
- HTML autoclosing, with plain HTML documents: fix TR
  that were autoclosing the THEAD/TBODY/TFOOT they are
  part of (so these THEAD... styles were not applied
  to/inherited by any of their TRs)
@poire-z
Copy link
Contributor Author

poire-z commented Dec 2, 2018

Not to be merged/bumped before 2018.12 is released ! as there will probably be new bugs or glitches... :|

@Frenzie
Copy link
Member

Frenzie commented Dec 2, 2018

Oh, I forgot all about that. :-P

I'll probably toss in a MuPDF default scaleByDPI thingy (for EPUB) before I do, and also I don't really feel like it right now. :-)

@Eduardomb22
Copy link

Eduardomb22 commented Dec 2, 2018

This might be questionable: I feel like always taking 100% for unstyled tables is most often better than not on our devices (otherwise, smaller tables are stuck to the left, which feels strange with justified text... a 100% width table feels like a justified table :)

I have feeling this is better for wiki maybe this should be the default for wikipedia https://user-images.githubusercontent.com/24273478/49339210-f0949d00-f62e-11e8-8f2c-c2b29b1f3ca5.png

You are a monster. Thanks a lot for commit!!

@poire-z
Copy link
Contributor Author

poire-z commented Dec 2, 2018

There's also the thing that publishers expect the common rendering on the right (the new one), and most (well, it's the case of 2 of your books I tested with - 1 one of mine did not) will set in their publisher styles table { width: 100%; }, so it will look like the one on the left (the old one), as you can see when viewing your books in Calibre: tables take the full width.

Same KOReader Wikipedia EPUB displayed by Calibre:
calibre

It's a bit more messy with Wikipedia EPUBs, as tables are set in various different ways (classnames: infobox, infobox_v2, infobox_v3, navbox...), with sometimes the width set on the TABLE, sometimes on some TD, sometimes on a surrounding DIV (where we don't support width)...
So, as KOReader is the publisher of the wikipedia EPUBs :) we could set table { width: 100%; } in our wikipedia stylesheet.css. But with !important (impossible to get back the ones specified in wikiepdia html) or not (may need a style tweak to get the 100% again)?

I yet have to see how it look on other regular books, tables are not so common :)

So, 3 levels of tweaking:

  • our epub.css: either nothing, or table { width: 100%; }, or table { width: auto; margin-left: auto; margin-right: auto } to have small tables centered.
  • wikipedia stylesheet.css (not sure yet what I prefer)
  • style tweaks: for a wide choice of personal preferences

So, still thinking :)

@Frenzie
Copy link
Member

Frenzie commented Dec 3, 2018

or other CJK readers: any feeling about that? Any minimal number of CJK chars to aggregate on a single line, or are there some kind of rules for deciding that?

Is that a Wikipedia page? I'll just ask my standard question: what do the browsers do? (Because even if there's no written standard, there's often a de facto one.)

@poire-z
Copy link
Contributor Author

poire-z commented Dec 3, 2018

Is that a Wikipedia page? I'll just ask my standard question: what do the browsers do?

Yes, some random mixed chinese/english page I once found and now use for testing:
https://zh.wikipedia.org/wiki/Key_(%E9%81%8A%E6%88%B2%E5%93%81%E7%89%8C)

Browsers have a luxury we don't have, they can overflow :) Which is what firefox does:
image
In crengine, we try to keep specified widths and reduce others with unspecified widths (but if none: all columns) to the now-computed min-content-width.
In the 2nd screenshot, the full CJK-sequence-considered-as-a-single-word made that min-content-width, but even there, it was too much, so it was still further reduced and 1 CJK char ended up wrapped.
In the 1st screenshot, I said that a single CJK char make the min-content-width, so crengine went ahead: as it needs 100% for the 2nd cell, it reduced the 1st to the min it can: the longest CJK char width.
This example page/table is a bit twisted, but it gives an extreme sample case.

But OK, testing with just extracting some text from this table and putting it in a simple table with no width specified, Firefox does not mind wrapping down to a single CJK char:

image (right truncated)
image (right truncated)
image
but it seems to additionally try to keep final heights balanced when computing widths. crengine can't easily do that.

Same simpler table with this PR crengine, with the is_cjk stuff still commented out (so, full sequence of CJK considered as a unbreakable unit):
image

With the is_cjk uncommented (so, each CJK char considered a small unbreakable unit):
image

Same with some non-CJK slipped in:
image

That's what Firefox does.
We'll see and adapt when we get feedback from CJK users.
@poire-z
Copy link
Contributor Author

poire-z commented Dec 4, 2018

No more Travis CI auto-check ?
I'll bump this, and just add table {width: 100%;} to wikipedia.lua stylesheet.css, and a few style tweaks, and we'll see the feedback we get.

@poire-z
Copy link
Contributor Author

poire-z commented Dec 4, 2018

I'll add some style for BLOCKQUOTE in wikipedia.lua, because currently, there is none, and so P inside BLOCKQUOTE just look like any non-quote P.
In Wikipedia page, there is.

blockquote {
    overflow: hidden;
    margin: 1em 0;
    padding: 0 40px;
}

Thinking about using:

blockquote {
    margin: 0.5em 0;
    padding: 0 2em;
}

Should this has its place in our main epub.css instead?
(The padding 2em might looks a bit large, but it seems needed for a visual shift from the p { text-indent: 1.2em } if there is a single line P before or after like <p>John said:</p><blockquote><p>blah blah</p></blockquote>)

@Frenzie
Copy link
Member

Frenzie commented Dec 4, 2018

If that's not in the styles, I indeed think that'd be good to add. It's fairly standard.

@Frenzie
Copy link
Member

Frenzie commented Dec 4, 2018

Travis is probably just a hiccup.

Don't avoid wrap at end of line when ending on a space
preceded by a CH_PROP_AVOID_WRAP_AFTER char.
@poire-z
Copy link
Contributor Author

poire-z commented Dec 4, 2018

Ok, added.
And fixed a small issue that I just witnessed when image

image

Fixed, showing how BLOCKQUOTE is now rendered:
image

Illustration for the P 1.2em and need for BLOCKQUOTE 2em:
image

@poire-z
Copy link
Contributor Author

poire-z commented Dec 4, 2018

OK, switched to using margin-left/right instead of padding-left/right as suggested by Frenzie in a comment to the (now gone) epub.css previous commit.

@Frenzie
Copy link
Member

Frenzie commented Dec 4, 2018

If you want me to read over the actual code, that won't be for today. CSS is just something I can do at a glance. ;-)

@poire-z
Copy link
Contributor Author

poire-z commented Dec 4, 2018

Not forcing that on you :) I'll merge and bump all that tomorrow.

@frankyifei
Copy link
Contributor

For Chinese, one character is one word and there is no rules for minimal characters in a line. I don't know about Japanese and Korean. Actually it looks much better now after your change. There is a W3C format requirement for Japanese, which is most comprehensive. Japanese text layout

@poire-z
Copy link
Contributor Author

poire-z commented Dec 5, 2018

Thanks for the feedback!
So going on and merging.

@poire-z poire-z merged commit 0d5dc2d into koreader:master Dec 5, 2018
@poire-z
Copy link
Contributor Author

poire-z commented Aug 7, 2020

I guess that initial test case is a good start to refine it to an even smaller one :) And may be we won't find a smaller one, and it's the whole combination that causes the issue.
(Haven't had the time yet to investigate, but indeed, my gut feeling is that the cause might not be the autoclosing of table elements - but of something else, or bad interactions with what's before)

@Frenzie
Copy link
Member

Frenzie commented Aug 7, 2020

I guess that initial test case is a good start to refine it to an even smaller one :)

Absolutely, I intended it as a PSA as much as anything. 👍

15 years ago I occasionally reported bugs to Opera, which is when I was given some friendly pointers to improve my test cases. I liked to write minimal HTML, of this variety:

<!DOCTYPE html>
<title>test case</title>
<p style="color:blue">This should be blue but it's red! (Okay, obviously this never happened.)

And they told me that generally speaking I should first try to write it like this instead to ensure it's truly an issue about incorrect color parsing:

<!DOCTYPE html>
<html>
<head>
<title>test case</title>
</head>
<body>
<p style="color:blue">This should be blue but it's red! (Okay, obviously this never happened.)</p>
</body>
</html>

In other words, an ideal minimal testcase only includes the absolute minimum to reproduce the issue but it avoids going so minimal that it might introduce confounding issues.

The above is a somewhat badly chosen illustration since the "confounding issues" from my sample are the entire point. Perhaps it should be reversed for illustrative purposes instead:

<!DOCTYPE html>
<title>test case</title>
<p style="color:blue">This should render but it doesn't because the DOM thinks there's no BODY element!
(So what's that style=color:blue doing there? It's just distracting. :-) )

Btw, @a-kohout is excellent at providing minimal test cases. See koreader/koreader#3192 for example. :-)

@Frenzie
Copy link
Member

Frenzie commented Aug 7, 2020

If we remove some of the confounding/obscuring factors you still get a plenty weird result. The test case could still be slightly more minimal, but the following is roughly what I like to see in a test case. ^_^

<!DOCTYPE html>
<html>
<head>
<title>Table autoclosing</title>
</head>
<body>
<table>

<tr>
<td>
<b>Сферы производства</b>

<td>
<b>1890</b>

<td>
<b>1900</b>

<td>
<b>Прирост</b>

<tr>

<td>
Чугун, <i>т</i>

<td>
927 100

<td>
2 933 700

<td>
216%

</table>

</body>
</html>

Screenshot_2020-08-07_15-04-34

Vs browser (Firefox but doesn't matter):
Screenshot_2020-08-07_15-04-56

@Frenzie
Copy link
Member

Frenzie commented Aug 7, 2020

Comparatively, closing manually with a </tr> renders as expected:

Screenshot_2020-08-07_15-08-06

@poire-z
Copy link
Contributor Author

poire-z commented Aug 7, 2020

(Another thing that can be at play : Tables: complete incomplete tables #328 )

@virxkane
Copy link
Contributor

virxkane commented Aug 7, 2020

Indeed, you've completely lost me. If this document is not related to autoclosing at all why did you bring it up? :-/

Sure? please read again:

This is a slightly simplified snippet from a real problematic chm file.

Пайпс_Русская+революция_1.chm.zip

@virxkane
Copy link
Contributor

virxkane commented Aug 7, 2020

I just showed you how changes in one line broke the auto-closing of the <tr> tag.
I did not write here all the way as I came to this, here it is not appropriate. But it seems to me that this is the problem. Maybe I'm wrong, it happens :) But instead of trying to prove to me that I'm wrong, just look at this file "Пайпс_Русская+революция_1.chm.zip".

@virxkane
Copy link
Contributor

virxkane commented Aug 7, 2020

Comparatively, closing manually with a </tr> renders as expected:

Now go ahead and restore the string static const char * AC_TR[] = {"tr", "tr", "thead", "tfoot", "tbody", NULL};.
I not think that its the best solution, I just showed you the problem.

@poire-z
Copy link
Contributor Author

poire-z commented Aug 7, 2020

Disabling the COMPLETE_INCOMPLETE_TABLES flag (to not add mess to the mess).

When describing the algorithm:

[...] we check the current node and up its parent until we find one that is listed among the followup list of tags: ["li", "p"] (we stop at the first one found).

<tr>
<td>
<b>Прирост</b>
<tr>  <!-- This will close the preceeding TD, but not the first TR -->
<td>
Чугун, <i>т</i>

So, the next TR will close the TD, and will the be sibling of the TD, which is unexpected and some bit of the code decides to set rend_method = erm_killed:

// Check and deal with unproper children
if ( !is_proper ) { // Unproper child met
// printf("initTableRendMethods(%d): child %d is unproper\n", state, i);
// (20200626: not doing it when hasCacheFile() is no longer an issue)
if ( BLOCK_RENDERING_G(COMPLETE_INCOMPLETE_TABLES) && !(false && enode->getDocument()->hasCacheFile()) ) {
// We can insert a tabularBox element to wrap unproper elements
last_unproper = i;
if (first_unproper < 0)
first_unproper = i;
}
else {
if ( BLOCK_RENDERING_G(COMPLETE_INCOMPLETE_TABLES) && (false && enode->getDocument()->hasCacheFile()) ) {
enode->getDocument()->setBoxingWishedButPreventedByCache();
}
// Asked to not complete incomplete tables, or we can't insert
// tabularBox elements anymore
if ( !BLOCK_RENDERING_G(ENHANCED) ) {
// Legacy behaviour was to just make invisible internal-table
// elements that were not found in their proper internal-table
// container, but let other non-internal-table elements be
// (which might be rendered and drawn quite correctly when
// they are erm_final/erm_block, but won't be if erm_inline).
if ( d > css_d_table ) {
child->setRendMethod( erm_invisible );
}
}
else {
// When in enhanced mode, we let the ones that could
// be rendered and drawn quite correctly be. But we'll
// have the others drawn as erm_killed, showing a small
// symbol so users know some content is missing.
if ( d > css_d_table || d <= css_d_inline ) {
child->setRendMethod( erm_killed );
}
// Note that there are other situations where some content
// would not be shown when !COMPLETE_INCOMPLETE_TABLES, and
// for which we are not really able to set some node as
// erm_killed (for example, with TABLE > TABLE, the inner
// one will be rendered, but the outer one would have
// a height=0, and so the inner content will overflow
// its container and will not be drawn...)
}
}
}

and so the content of the other TRs are not shown.
image

So, for this case, I'd say it's a limitation of the AutoClose algorithm, that is stopping at the first parent in the AC_TR list, and won't close the preceeding row TR.

If you add to that the "complete incomplete table" stuff, it will wrap the standalone TR (that should not be there as a child of TR) into a TABLE-like tabularBox element:
image
which is what this code should do when it meets a standalong <TR> or <div style="display: table-row"'>

So, I'd say it's the AutoClose algorithm that should be fixed/enhanced, so to correctly close multiple unclosed elements (if that's even possible without taking wrong decisions). (note: I'm not interested in digging into that at all :/)
Then, ldomDocumentWriterFilter() used with HTML/CHM would produce a DOM without incomplete tables, and the "complete incomplete table" stuff wouldn't have to be involved.

Now go ahead and restore the string static const char * AC_TR[] = {"tr", "tr", "thead", "tfoot", "tbody", NULL};.
I not think that its the best solution

Indeed, with this AC_TR list, it won't close the TD, and won't stop there, and so it will properly close the TR.
So, let's just try this (which is not bad as the one with thead & co - but should work just the same):
static const char * AC_TR[] = {"tr", "tr", NULL};.
As we don't ask it to close any TD, it will close the TR.
And the ldomDocumentWriter::pop() I mentionned above will have closed the TD:
image

So, that looks like an easy solution - but I don't know it's side effects :)
I can't figure out (well, I actually don't want to think about it) what's the job of AutoClose vs pop(), how one should expect the other to do what it doesn't do, etc... :)

@Frenzie
Copy link
Member

Frenzie commented Aug 7, 2020

@virxkane I'm not trying to prove you wrong, unless you disagree about effective communication through minimal test cases. In that case I am. ^_^

However, your proposed fix only renders correctly in this particular case somewhat by accident. The resulting DOM is incorrect even if we feed it a fully fledged out table as included below. I would agree it seems like a reasonable workaround since thead, tfoot and tbody are mostly somewhat redundant, although if used for styling through CSS that'd fail to work as intended.

<!DOCTYPE html>
<html>
<head>
<title>Complete table</title>
<style>
thead {color:red}
tbody {color:blue}
</style>
</head>
<body>

<table>
<thead>
    <tr>
        <th>Сферы производства</th>
        <th>1890</th>
        <th>1900</th>
        <th>Прирост</th>
    </tr>
</thead>
<tbody>
    <tr>
        <td>Чугун, <i>т</i></td>
        <td>927 100</td>
        <td>2 933 700</td>
        <td>216%</td>
    </tr>
</tbody>
</table>

</body>
</html>

The intended rendering looks like:
Screenshot_2020-08-07_15-57-52

But since the thead and tbody are treated incorrectly in your suggested patch, it just remains black.
Screenshot_2020-08-07_15-58-20

By comparison, here's the correct DOM with the correct rendering as in the current code.
Screenshot_2020-08-07_15-59-04

I'd be inclined to agree that the damage is worse without your suggested workaround, but it's definitely a workaround, not a fix. :-)

@virxkane
Copy link
Contributor

virxkane commented Aug 7, 2020

@poire-z

So, for this case, I'd say it's a limitation of the AutoClose algorithm, that is stopping at the first parent in the AC_TR list, and won't close the preceeding row TR.

Seem like that, thank you.

@Frenzie

I'm not trying to prove you wrong, unless you disagree about effective communication through minimal test cases. In that case I am. ^_^

Ok. The next sample will be smaller :)

However, your proposed fix only renders correctly in this particular case somewhat by accident.

No, no, no. I don't proposed this fix, I show problem, I ask how to fix. Point.

@poire-z
Copy link
Contributor Author

poire-z commented Aug 7, 2020

So, anyway, if none of us want to think too much about that, we could agree on this scenario:

-static const char * AC_TR[] = {"tr", "tr", "thead", "tfoot", "tbody", NULL};
+static const char * AC_TR[] = {"tr", "tr", "td", NULL};

I did this in 2018 because having THEAD, TBODY... is wrong.
I added the "td" because it felt like the inverse correct solution, without too much thinking.
But logically, it shouldn't be needed: if there was an opening TD (and it must have one even for dirty HTML publishers that don't close anything - or no content would be shown), it will be closed by pop() when closing the TR. So no need to have in in AC_TR.
So, let's just remove the "td" :)

-static const char * AC_TR[] = {"tr", "tr", "td", NULL};
+static const char * AC_TR[] = {"tr", "tr", NULL};

But after that, we're not yet done with your initial test case :(

@poire-z
Copy link
Contributor Author

poire-z commented Aug 7, 2020

Same reason for the imbricated DD.
AC_DD[] = {"dd", "dd", "p", NULL};
A DD will close a DD - except if there was a P inside the previous DD: it will then close only the P, and will stay inside the previous DD.
While Firefox properly have them not-imbricated and at the same level:

Firefox inspector | Firefox | KOReader | KOReader DOM:
image

@poire-z
Copy link
Contributor Author

poire-z commented Aug 7, 2020

  <body>
<table border>
<tr>

renders well:
image

Just add a DD between BODY and TABLE, and boum:

  <body>
<dd>
<table border>
<tr>

image

image

Ok, further tests by removing stuff or adding closing tags to avoid triggering some auto-close:

  <body>
ABC
<dd>
<table border>
<tr>
<td width=148>
<dd><b>СÑ~DеÑ~@Ñ~K пÑ~@оизводÑ~AÑ~Bва</b></dd>
</td>
<td width=72>
<p align=center><b>1890</b></p>
</td>
</tr>
</table>
  </body>

It's the 2nd DD instide the table cell that causes the first one to be auto-closed - even if that 2nd one is properly closed:
image

So, all that mess is also caused by:
AC_DD[] = {"dd", "dd", "p", NULL};
When we meet the 2nd DD in the table cell TD: we go check all the parents until we find a DD or P, and we close all the elements met up to there: the TD, the TR, the TABLE, and the DD.
And what comes next is no more in a TABLE, and just a bunch of block elements.

Remove the DD in the all the table cells, and the TABLE will render well, properly inside that first DD.

I have no idea why it all looked fine in CoolReader 3.2.38...

@poire-z
Copy link
Contributor Author

poire-z commented Aug 10, 2020

I tried reading the specs but I finally gave up and went back at the original AutoClose() code and tried to make it do what we expect - and not fail on the problems noticed (nested elements, stopping too early) - may be somehow doing it in the spirits of what I got from the specs.

This seems to solve the issues noticed above.

--- a/crengine/src/lvtinydom.cpp
+++ b/crengine/src/lvtinydom.cpp
@@ -12952,24 +12952,46 @@ void ldomDocumentWriterFilter::appendStyle( const lChar16 * style )
 void ldomDocumentWriterFilter::AutoClose( lUInt16 tag_id, bool open )
 {
     lUInt16 * rule = _rules[tag_id];
     if ( !rule )
         return;
     if ( open ) {
         ldomElementWriter * found = NULL;
+        ldomElementWriter * found_fallback = NULL;
         ldomElementWriter * p = _currNode;
         while ( p && !found ) {
             lUInt16 id = p->_element->getNodeId();
             for ( int i=0; rule[i]; i++ ) {
                 if ( rule[i]==id ) {
-                    found = p;
-                    break;
+                    if ( i == 0 ) {
+                        // rule[0] should be the same as the element that we see opening
+                        // If we meet one, we're done auto-closing
+                        found = p;
+                        break;
+                    }
+                    else {
+                        // rule[1+] are other elements that we can meet and that we
+                        // should auto-close to become a sibling of them, but only
+                        // when no same-element (from rule[0] has been found.
+                        // Not sure if we should stop at the nearest one met,
+                        // or take the upper most one.
+                        found_fallback = p;
+                    }
                 }
             }
             p = p->_parent;
+            if ( p && p->_element->getNodeId() == el_table ) {
+                // Don't cross TABLE: a new DD inside a TABLE CELL should
+                // not auto-close a upper DD that contains this TABLE.
+                // This feels generic enough, so do it in all cases
+                break;
+            }
+        }
+        if ( !found && found_fallback ) {
+            found = found_fallback;
         }
         // found auto-close target
         if ( found != NULL ) {
             bool done = false;
             while ( !done && _currNode ) {
                 if ( _currNode == found )
                     done = true;

image

image

@Frenzie
Copy link
Member

Frenzie commented Aug 10, 2020

Possibly Good Enough™ and better either way. But it's much too warm to think clearly over here. :-)

@poire-z
Copy link
Contributor Author

poire-z commented Aug 10, 2020

Although it might still not be right.

static const char * AC_LI[] = {"li", "li", "p", NULL};
static const char * AC_UL[] = {"ul", "li", "p", NULL};  // ??
static const char * AC_OL[] = {"ol", "li", "p", NULL};  // ??
static const char * AC_DD[] = {"dd", "dd", "p", NULL};
static const char * AC_DL[] = {"dl", "dt", "p", NULL};
static const char * AC_DT[] = {"dt", "dt", "dd", "p", NULL};

(these are shifted one slot, so rule[0] is actually the 2nd element

I'm handling rule[0] should be the same as the element that we see opening which works for the LI and DD above - but not for UL.
A new UL should never close any upper UL I guess.
Should it close the current LI ? I don't think so, it can be a child UL > LI > UL
Should it close an upper LI ? I don't think so: UL > LI > DIV > P + LI
Should it close stop at the first upper LI ?
So, no rule for LI ?
Or juste remove the 2nd "lu" and let It close and stop at any upper P ?

@poire-z
Copy link
Contributor Author

poire-z commented Aug 11, 2020

Pfffffff. If we update the AC_* lists or the AutoClose() algorithm, we might have a totally different DOM, and previously made bookmarks/highlights would not work.
And our normalized XPointers and their migration only deal with the added "internal boxing" elements - so it's impossible to do any migration if we change the auto closing stuff as some elements will have new parents (which is possible even on documents that did render quite well and had no noticable issue).

So, that would mean a new DOM_VERSION - and keep using the old AC_ and AutoClose() function for older DOM versions, and switch to new ones for new books - with no migration.
With KOReader, it would be the default behaviour (to not upgrade) - but I think CoolReader always upgrade to the latest DOM version, right ?
And each time we'll fix that autoclose stuff because of a bad decision made now or some unforeseen case, we'll need to do that again.

If I trust some past comments of mine:

  • ldomDocumentWriterFilter and the AutoClose stuff is only used with HTML, CHM and PDB(html).
  • EPUB, FB2, RTF, WORD, plain text, PDB(txt) use ldomDocumentWriter which doesn't do AutoClose.

So, should we care about not breaking past highlights for these formats that I consider little-used document formats ?
Or can we just ignore this issue ? :/ :)

@Frenzie
Copy link
Member

Frenzie commented Aug 11, 2020

I'll drop these tests here https://github.com/html5lib/html5lib-tests/blob/71eebd59772d1d39aced0c0582ae9c09acf3ce6e/serializer/optionaltags.test

My view is similar to yours. Breaking it should be relatively low-impact. But if we do, unfortunately that makes the option of incremental improvements (i.e., continuous breakage) worse than it might have been.

@poire-z
Copy link
Contributor Author

poire-z commented Aug 12, 2020

So, I guess my first idea above is not right (assuming the second element is the same as the element and we should stop there).
I came up with this:

--- a/crengine/src/lvtinydom.cpp
+++ b/crengine/src/lvtinydom.cpp
@@ -12954,21 +12954,50 @@ void ldomDocumentWriterFilter::AutoClose( lUInt16 tag_id, bool open )
     lUInt16 * rule = _rules[tag_id];
     if ( !rule )
         return;
-    if ( open ) {
+    if ( open ) { // called by OnTagOpen()
         ldomElementWriter * found = NULL;
         ldomElementWriter * p = _currNode;
-        while ( p && !found ) {
-            lUInt16 id = p->_element->getNodeId();
+        lUInt16 id = p ? p->_element->getNodeId() : 0;
+        while ( p ) {
             for ( int i=0; rule[i]; i++ ) {
                 if ( rule[i]==id ) {
                     found = p;
-                    break;
+                    // We don't stop at the first one met: we continue checking
+                    // upper nodes that this node might close. But we might break
+                    // below when meeting one of the stop element.
                 }
             }
             p = p->_parent;
+            if ( !p )
+                break;
+            id = p->_element->getNodeId();
+            // Some hardcoded bits (to not complicate AC_* rules and their parsing...)
+            // for parent elements that should stop us.
+            if ( id == el_table ) {
+                // Never cross a TABLE (a new DD inside a TABLE CELL should
+                // not auto-close a upper DD that contains this TABLE).
+                // This feels generic enough, as a table has independant block
+                // rendering contexts for each of its cells.
+                break;
+            }
+            if ( id == el_ruby ) {
+                // RUBY should mostly containe ruby sub-element (and possible
+                // some other inline elements that don't auto close).
+                // So, we shouldn't need to check for tag_id: stop there
+                break;
+            }
+            if ( tag_id == el_li && (id == el_ul || id == el_ol) ) {
+                // A LI stops at its UL or OL container, so it does not
+                // close a upper LI
+                break;
+            }
+            if ( id == el_dl && (tag_id == el_dl || tag_id == el_dt || tag_id == el_dd ) ) {
+                // A DL, DT or DD stop at its DL container, so it does not close it
+                // not any upper DT or DD
+                break;
+            }
         }
-        // found auto-close target
-        if ( found != NULL ) {
+        if ( found != NULL ) { // found auto-close target
             bool done = false;
             while ( !done && _currNode ) {
                 if ( _currNode == found )

And this AC_ stuff I re-ordered and commented (with a few changes but not showing the diff as it would make it less readable):

// Most block container elements close a P.
// Most should not close themselves, as they can be nested
// When closing upper nodes, we stop at TABLE to not close them, as their content
// is an independant rendering context (this is hardcoded in AutoClose())
static const char * AC_P[]  = {"p", "p", NULL}; // a P close any P, so there can't be nested <P>s
static const char * AC_DIV[] = {"div", "p", NULL};

// (UL/OL used to have "li" there, but this was wrong as a UL/OL can be inside a LI, with nested lists)
static const char * AC_UL[] = {"ul", "p", NULL};
static const char * AC_OL[] = {"ol", "p", NULL};
// A LI should close a previous LI, but should not close an upper UL or OL, and should not
// go look upper them and close any upper LI (this is hardcoded in AutoClose())
static const char * AC_LI[] = {"li", "li", "p", NULL};

// Description list (DL = list, DT = term, DD = definition/description)
// None of these should close an upper DL containing them, and should not go look upper
// their containing DL and close any upper DD/DT (this is hardcoded in AutoClose())
// A new DL can happen in a DD, but not in a DT.
static const char * AC_DL[] = {"dl", "dt", "p", NULL};
// DT and DD close any previous term or definition
static const char * AC_DT[] = {"dt", "dt", "dd", "p", NULL};
static const char * AC_DD[] = {"dd", "dd", "dt", "p", NULL}; // (used to not close DT, but no reason not to)

// A TABLE should not close anything (except any previous P), because of nested tables
static const char * AC_TABLE[] = {"table", "p", NULL};
// Table rows container should close any previous rows container or standalone rows
static const char * AC_THEAD[] = {"thead", "tr", "thead", "tfoot", "tbody", NULL};
static const char * AC_TBODY[] = {"tbody", "tr", "thead", "tfoot", "tbody", NULL};
static const char * AC_TFOOT[] = {"tfoot", "tr", "thead", "tfoot", "tbody", NULL};
// A TR should close any previous TR (or unclosed previous TD, that would anyway
// be closed when closing its containing TR)
static const char * AC_TR[] = {"tr", "tr", "td", NULL};
// TH and TD close any previous TH or TD
static const char * AC_TH[] = {"th", "th", "td", NULL};
static const char * AC_TD[] = {"td", "td", "th", NULL};

// Ruby elements (often without a closing element and expecting auto closing
// according to various rules, sometimes contradictory...)
// When closing upper nodes, we stop at RUBY to not close upper ruby sub-elements
// and allow nested ruby (this is hardcoded in AutoClose())
// A new RBC and RTC close every other ruby sub-element
static const char * AC_RBC[] = {"rbc", "rbc", "rb", "rtc", "rt", NULL};
static const char * AC_RTC[] = {"rtc", "rtc", "rt", "rbc", "rb", NULL};
// A new RB closes every other ruby sub-element, except RBC (which may contain multiple RB)
static const char * AC_RB[] = {"rb", "rb", "rt", "rtc", NULL};
// A new RT closes every other ruby sub-element, except RTC (which may contain multiple RT)
static const char * AC_RT[] = {"rt", "rt", "rb", "rbc", NULL};
// A RP (ruby fallback parenthesis) closes ruby sub-element, except containers
static const char * AC_RP[] = {"rp", "rp", "rb", "rt", NULL};

// A new one of these should close any previous same one
static const char * AC_OPTION[] = {"option", "option", NULL};
static const char * AC_PARAM[] = {"param", "param", NULL};
static const char * AC_PRE[] = {"pre", "pre", NULL};

// Auto-closing (no sub-element accepted)
static const char * AC_HR[] = {"hr", NULL};
static const char * AC_BR[] = {"br", NULL};
static const char * AC_WBR[] = {"wbr", NULL};
static const char * AC_IMG[]= {"img", NULL};
static const char * AC_COL[] = {"col", NULL};
static const char * AC_BASE[] = {"base", NULL};
static const char * AC_LINK[] = {"link", NULL};
static const char * AC_META[] = {"meta", NULL};
static const char * AC_AREA[] = {"area", NULL};
static const char * AC_EMBED[] = {"embed", NULL};
static const char * AC_INPUT[] = {"input", NULL};
static const char * AC_SOURCE[] = {"source", NULL};
static const char * AC_TRACK[] = {"track", NULL};

Not much change to the original AC (except UL/OL not closing any LI, and DT also closing a DD, for symetry), I've just added comment to explain why these rules with what I know about these tags, and not by following the specs. The new stuff is the stop rules in the C code.
This too makes the CHM and the test autoclose HTML fine.

@poire-z
Copy link
Contributor Author

poire-z commented Aug 13, 2020

I'm still trying to understand the specs, and I need a bit of spec reading wits :)

There is list of tags that are considered "special", which looks mostly like all elements that are not inline (so, block and others found in head, etc...)
https://html.spec.whatwg.org/multipage/parsing.html#special

Can you look at these steps in the section https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody:
A start tag whose tag name is "li"
and the following
Any other end tag

They both are bout walking up the parents to close the matching start tag, but they mention:
For start LI: If node is in the special category, but is not an address, div, or p element, then jump to the step labeled done below.
For any other end tag: Otherwise, if node is in the special category, then this is a parse error; ignore the token, and return.

Do you understand (like I do) that all these "special" tags are stop tags when walking up parents? (so, with <span> <ul> <li> </span> , or with <dl> <dd> <ul> <li> </dd> the last one won't close the first one ? Does that make sense ?

Above the Any other end tag, there is:
An end tag whose tag name is one of: "address", "article", "aside", "blockquote", "button", "center", "details", "dialog", "dir", "div", "dl", "fieldset", "figcaption", "figure", "footer", "header", "hgroup", "listing", "main", "menu", "nav", "ol", "pre", "section", "summary", "ul"
These would then not have that limitation to stop at a special tag.
Does that make sense?

There's another set:
A start tag whose tag name is one of: "address", "article", "aside", "blockquote", "center", "details", "dialog", "dir", "div", "dl", "fieldset", "figcaption", "figure", "footer", "header", "hgroup", "main", "menu", "nav", "ol", "p", "section", "summary", "ul"
plus a few more specific ones, that all should close any previously still open P.

So, it looks like we could do with a few sets of elements, and if we order them correctly in fb2def.h, we could hardcode a few if (tag_id >= el_special_start && tag_id <= el_special_end) and do the closing instead of having tons of AC_ that all just close a P.

(There's lots of stuff we won't be able to do easily, like active formatting elements a, b, big, code, em, font, i, nobr, s, small, strike, strong, tt, and u, that when upper and closed inner by some mis-nested elements, should be resurrected and still applied to next content... or mis-nested elements adoptions agency stuff, and some other stacks, and specific modes for when parsing tables... but well :)

@Frenzie
Copy link
Member

Frenzie commented Aug 13, 2020

Do you understand (like I do) that all these "special" tags are stop tags when walking up parents? (so, with <span> <ul> <li> </span> , or with <dl> <dd> <ul> <li> </dd> the last one won't close the first one ? Does that make sense ?

I think so. I think that should be <span><ul><li></li></ul></span> and <dl><dd><ul><li></li></ul></dd></dl>.

@poire-z
Copy link
Contributor Author

poire-z commented Aug 15, 2020

Still trying to understant the specs...
A little thing saddened me:
image image

Everything block closes a P.
So, that would mean that inside a P, you can't have any fancy embedded (as I called them) float with some stuctural block elements (like DIV, HR, TABLE), but only plain inline stuff (like IMG, dropcaps in SPAN...).
So, to do anything fancy, you'd have to close the P, insert the fancy float as a block element, and re-open the P (the algo just closes it, it won't re-open it) - and so, have a line break even if you didn't want one.

Does that sound right? And if yes, what would be the alternative if I wanted a fancy float in the middle of some text, and I didn't want any break in my text? (I see: just not use a P - but then it breaks all the "semantic" idea of using P...) (or is it that in the "semantic" realm, it's a sin to insert some fancy foreign stuff in a P unit that should read without any foreign break?)

@poire-z
Copy link
Contributor Author

poire-z commented Aug 15, 2020

And nested (misnested or not) Hx tags only close the direct previous - but not if there's an intermediate in-between...:
image

@Frenzie
Copy link
Member

Frenzie commented Aug 15, 2020

(or is it that in the "semantic" realm, it's a sin to insert some fancy foreign stuff in a P unit that should read without any foreign break?)

As long as it's related (like images or whatever), that should be fine. But indeed they'll be auto-closed by other block-level elements.

@poire-z
Copy link
Contributor Author

poire-z commented Aug 19, 2020

I'm still working on that HMTL parser. I think I made sense of the specs...
A good part of the stuff like "close emplied end tags" in the algo is just to notice the HTML is bad and flag a "parser error" that the algo just ignore and go do normal work as if nothing was wrong (often just skipping these tags while closing up to the target node).
So, if we don't care about noticing bad HTML and just take the normal behaviour, it's simplified quite a bit.
Still some stuff I have to think about before proposing something, I'd rather get it right as it will need a new DOM_VERSION (I can keep the current behaviour for older dom versions, the new stuff plugs in rather nicely as an alternate algorithm).
Mentionning that again: with KOReader, we won't update the dom_version of previously opened books - but I think you do that with CoolReader. Might be not much of an issue with this one, as it will only affect plain HTML and CHM books.

And there's the lib.ru hacks in that parser to possibly try to solve.

(All that specs reading and work for so little personal benefit :), as most formats like FB2 or EPUB don't use that HTML parser... but better get something right than add more hacks :/ )

@virxkane
Copy link
Contributor

virxkane commented Aug 19, 2020

A good part of the stuff like "close emplied end tags" in the algo is just to notice the HTML is bad and flag a "parser error" that the algo just ignore and go do normal work as if nothing was wrong (often just skipping these tags while closing up to the target node).

Good news.

Mentionning that again: with KOReader, we won't update the dom_version of previously opened books - but I think you do that with CoolReader.

For me, this is still an open question. Now, in the program settings, we can choose 'legacy' or 'latest'. And this choice is saved for each book. If we remove this option, we will have to do something with the bookmarks saved in the database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSS: table: vertical-align; td: width unsupported
6 participants