Text boldness cannot detected (bug?) #522
Replies: 2 comments 2 replies
-
Hi @ozturkib Appreciate your interest in the library. This PDF seems like an OCR-ed document in which case the OCR software may not have properly used the bold font. You won't find any Times-Bold usage on the first page but on the fourth page, you will. |
Beta Was this translation helpful? Give feedback.
-
Indeed, the PDF is OCR'ed; what you see are the scanned bitmaps of the original paper pages, and OCR added text drawn invisibly above it. Thus, whether or not a bold font is used, does not necessarily relate in any way to the visible writing on the bitmap. Also even in not OCR'ed PDFs you cannot completely count on font names giving away style information. It is good style and common to use the original names of the used fonts (which usually contain something like 'bold' for bold fonts) but on one hand one can also use anonymized names like A, B, C, ... for one's fonts, and on the other hand there are poor man's bold techniques which make text in non-bold fonts look bold. |
Beta Was this translation helpful? Give feedback.
-
Hello there,
Below scenario seems to be a bug on the pdfplumber side.
My algorithm is heavily dependent on the detection of font boldness whether it is bold or not because I would like to label the line as title or body. I am detecting the boldness as follows. It is fine with the majority of files; however, it cannot detect a few of them. I placed an example at the bottom. The font name is Times-Roman the first line text (just as an example) which does not indicate any boldness, unlike other files. Here, I am expecting something like Times-Bold for titles. How can I understand the font is bold or not on this example document?
Beta Was this translation helpful? Give feedback.
All reactions