More intelligent accumulation of text into para #9

LoneRifle · 2019-06-25T06:36:14Z

Our fork of pdf2md accumulates all text into a single contiguous paragraph in each block (2d6d889). While this is helpful in most cases, the approach does not distinguish between regular text and, say, headings, which should be placed on its own line

The text was updated successfully, but these errors were encountered:

jenlky · 2019-09-24T13:17:24Z

Hi, I tried the library and got the results below. It managed to convert the header for the chapters into a bolded text.

By "approach does not distinguish between regular text and headings", what do you mean by that? Do you mean the output should be like "The Image of the Malays until the Time of Raffles" in one line? It currently does so inconsistently for some of the subsequent header chapters in bold.

The Myth of the Lazy Native pdf

Markdown snippet

LoneRifle · 2019-09-30T02:45:26Z

Hi @jenlky , as you have observed, headings are properly recognised and converted. Where it doesn't do so well is with text that is bolded and on its own line.

Using this as an example...
pdf2md parsing demo.pdf

One gets...

# Headings Work 

And pdf2md can distinguish between normal text and the heading like so. 

## Sub-Headings Work 

And again, pdf2md can distinguish between this and the heading like so. But.... the moment you have something like... Some artificially generated heading created as normal text like this Then the heading will end up as part of the text

skepticalwaves · 2021-03-06T20:58:48Z

Following up on this topic, I've found with documents I've tested with that paragraphs which break mid-sentence over pages don't get accumulated back into a paragraph after conversion.

Ex:

Text:


 A second factor was the good natural defenses of Oyo-Ile’s ultimate site. After the kingdom’s early 


relocations, it finally returned to settle on the site that was destined to be its home until the 1830s.

In addition, the paragraph that breaks over the page has a one-space indent compared to other paragraphs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More intelligent accumulation of text into para #9

More intelligent accumulation of text into para #9

LoneRifle commented Jun 25, 2019

jenlky commented Sep 24, 2019 •

edited

Loading

LoneRifle commented Sep 30, 2019 •

edited

Loading

skepticalwaves commented Mar 6, 2021

More intelligent accumulation of text into para #9

More intelligent accumulation of text into para #9

Comments

LoneRifle commented Jun 25, 2019

jenlky commented Sep 24, 2019 • edited Loading

LoneRifle commented Sep 30, 2019 • edited Loading

skepticalwaves commented Mar 6, 2021

jenlky commented Sep 24, 2019 •

edited

Loading

LoneRifle commented Sep 30, 2019 •

edited

Loading