Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More intelligent accumulation of text into para #9

Open
LoneRifle opened this issue Jun 25, 2019 · 3 comments
Open

More intelligent accumulation of text into para #9

LoneRifle opened this issue Jun 25, 2019 · 3 comments

Comments

@LoneRifle
Copy link
Collaborator

Our fork of pdf2md accumulates all text into a single contiguous paragraph in each block (2d6d889). While this is helpful in most cases, the approach does not distinguish between regular text and, say, headings, which should be placed on its own line

@jenlky
Copy link

jenlky commented Sep 24, 2019

Hi, I tried the library and got the results below. It managed to convert the header for the chapters into a bolded text.

By "approach does not distinguish between regular text and headings", what do you mean by that? Do you mean the output should be like "The Image of the Malays until the Time of Raffles" in one line? It currently does so inconsistently for some of the subsequent header chapters in bold.

The Myth of the Lazy Native pdf
The Myth of the Lazy Native pdf

Markdown snippet
Markdown snippet

@LoneRifle
Copy link
Collaborator Author

LoneRifle commented Sep 30, 2019

Hi @jenlky , as you have observed, headings are properly recognised and converted. Where it doesn't do so well is with text that is bolded and on its own line.

Using this as an example...
pdf2md parsing demo.pdf

One gets...

# Headings Work 

And pdf2md can distinguish between normal text and the heading like so. 

## Sub-Headings Work 

And again, pdf2md can distinguish between this and the heading like so. But.... the moment you have something like... Some artificially generated heading created as normal text like this Then the heading will end up as part of the text 

@skepticalwaves
Copy link

Following up on this topic, I've found with documents I've tested with that paragraphs which break mid-sentence over pages don't get accumulated back into a paragraph after conversion.

Ex:
image

Text:


 A second factor was the good natural defenses of Oyo-Ile’s ultimate site. After the kingdom’s early 


relocations, it finally returned to settle on the site that was destined to be its home until the 1830s.

In addition, the paragraph that breaks over the page has a one-space indent compared to other paragraphs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants