-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More intelligent accumulation of text into para #9
Comments
Hi, I tried the library and got the results below. It managed to convert the header for the chapters into a bolded text. By "approach does not distinguish between regular text and headings", what do you mean by that? Do you mean the output should be like "The Image of the Malays until the Time of Raffles" in one line? It currently does so inconsistently for some of the subsequent header chapters in bold. |
Hi @jenlky , as you have observed, headings are properly recognised and converted. Where it doesn't do so well is with text that is bolded and on its own line. Using this as an example... One gets... # Headings Work
And pdf2md can distinguish between normal text and the heading like so.
## Sub-Headings Work
And again, pdf2md can distinguish between this and the heading like so. But.... the moment you have something like... Some artificially generated heading created as normal text like this Then the heading will end up as part of the text |
Following up on this topic, I've found with documents I've tested with that paragraphs which break mid-sentence over pages don't get accumulated back into a paragraph after conversion. Text:
In addition, the paragraph that breaks over the page has a one-space indent compared to other paragraphs. |
Our fork of pdf2md accumulates all text into a single contiguous paragraph in each block (2d6d889). While this is helpful in most cases, the approach does not distinguish between regular text and, say, headings, which should be placed on its own line
The text was updated successfully, but these errors were encountered: