Text boldness cannot detected (bug?) #522

ozturkib · 2021-10-26T15:33:07Z

ozturkib
Oct 26, 2021

Hello there,
Below scenario seems to be a bug on the pdfplumber side.

My algorithm is heavily dependent on the detection of font boldness whether it is bold or not because I would like to label the line as title or body. I am detecting the boldness as follows. It is fine with the majority of files; however, it cannot detect a few of them. I placed an example at the bottom. The font name is Times-Roman the first line text (just as an example) which does not indicate any boldness, unlike other files. Here, I am expecting something like Times-Bold for titles. How can I understand the font is bold or not on this example document?

pdf_url = "https://file.io/X88UZHxCWTsp"

for pdf_page in pdf.pages:
	words           = pdf_page.extract_words(x_tolerance=3, y_tolerance=10, extra_attrs=['fontname'])
	for word in words: 
		if 'Bold' in word['fontname']:
			print(word, " is bold")

samkit-jain · 2021-10-26T16:29:21Z

samkit-jain
Oct 26, 2021
Collaborator

Hi @ozturkib Appreciate your interest in the library. This PDF seems like an OCR-ed document in which case the OCR software may not have properly used the bold font. You won't find any Times-Bold usage on the first page but on the fourth page, you will.

2 replies

ozturkib Oct 26, 2021
Author

@samkit-jain thanks for your quick answer. Yes, the fourth page has bold font in my code as well. What do you mean by the OCR-ed document? It has not got an embedded image inside the pdf file. It contains the text.

Do you have any suggestion/scenario to label all titles properly in this specific example inside or outside pdfplumber?
How can I label this kind of files if there is not any solution to label properly titles?
Thanks again

samkit-jain Oct 27, 2021
Collaborator

Nopes, have no other alternative to identify the bold text.

mkl-public · 2021-10-26T17:11:19Z

mkl-public
Oct 26, 2021

Indeed, the PDF is OCR'ed; what you see are the scanned bitmaps of the original paper pages, and OCR added text drawn invisibly above it. Thus, whether or not a bold font is used, does not necessarily relate in any way to the visible writing on the bitmap.

Also even in not OCR'ed PDFs you cannot completely count on font names giving away style information. It is good style and common to use the original names of the used fonts (which usually contain something like 'bold' for bold fonts) but on one hand one can also use anonymized names like A, B, C, ... for one's fonts, and on the other hand there are poor man's bold techniques which make text in non-bold fonts look bold.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text boldness cannot detected (bug?) #522

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Text boldness cannot detected (bug?) #522

ozturkib Oct 26, 2021

Replies: 2 comments · 2 replies

samkit-jain Oct 26, 2021 Collaborator

ozturkib Oct 26, 2021 Author

samkit-jain Oct 27, 2021 Collaborator

mkl-public Oct 26, 2021

ozturkib
Oct 26, 2021

Replies: 2 comments 2 replies

samkit-jain
Oct 26, 2021
Collaborator

ozturkib Oct 26, 2021
Author

samkit-jain Oct 27, 2021
Collaborator

mkl-public
Oct 26, 2021