AssertionError: ('Unhandled', 6), only with some PDFs on some pages #994

Rellatte · 2023-09-20T12:48:30Z

Rellatte
Sep 20, 2023

Hello,

I am trying to read the text in some PDFs. Usually PDFplumber works very well and this is my code for it:
temp_pdf = pdfplumber.open(path) temp_page = temp_pdf.pages[0] temp_content = temp_page.extract_words()

Now I'm running into an AssertionError: ('Unhandled', 6) on a subset of PDFs.
This happens with most functions, namely extract_words, extract_table, extract_tables, extract_text and with to_csv, to_dict, to_json.
The function 'im = temp_page.to_image()' works and shows a good image of the pdf! Everything else seems to fail :/

Only the first page is not readable, the other 5+ are working.
The pdf can be opened with Adobe ect. and the text can be copied without problems.
I also can't find any difference between working and non-functioning PDFs, all are created with the same software and have the same layout.

Maybe someone can help me how or why the pdf's are not working. What could I do with a "working" PDF to make it throw the same error?

Any help is welcome :)

jsvine · 2023-09-21T02:55:23Z

jsvine
Sep 21, 2023
Maintainer

Thanks for flagging. Two questions:

Do you still get the error if you load the PDF via pdfplumber.open(path, repair=True)?
If so, can you share the PDF? It will be much easier to debug if there's a concrete example to work with.

0 replies

Rellatte · 2023-09-21T10:32:37Z

Rellatte
Sep 21, 2023
Author

Hello jsvine,

Thanks for your reply.

I tried, but for the life of me, it will not find Ghostscript :/
Maybe you know how I could make it work? The error always is: "Cannot find Ghostscript, which is required for repairs."
I have installed Ghostscript for Win 64-bit (running Win11), through the exe from the Ghostscript homepage.
It is in the Apps, and I can open a Ghostscript window. I can check the version within PowerShell (C:\Program Files\gs\gs10.02.0\bin> .\gswin64.exe --version).
I added it to PATH.
I then installed it a 2nd time through PowerShell ( .\python.exe -m pip install ghostscript).
I am able to 'import ghostscript,' and ghostscript then shows <module 'ghostscript' from 'C:\Program Files\Python311\Lib\site-packages\ghostscript_init_.py'>.
I am using Jupyter Notebook to execute python code.

Unfortunately, I cannot share a PDF because they are sensitive. I will try to make a dummy PDF, which I could share, but chances are slim, and that will take some time. If I get dummy PDFs, I can also try it on my personal PC, where I have admin rights.
Thanks again for your time.

3 replies

jsvine Sep 21, 2023
Maintainer

Thanks, and sorry to hear about repair not working out. What if you try the repairing instructions here, and then try pdfplumber with the repaired version?

Rellatte Sep 22, 2023
Author

Hi jsvine,

Now I am sure we are in different timezones :)

It somehow works now, mainly thanks to you (and to some degree to ChatGPT)
I think my general problem was (or is), that the whole context, where Gostscript is installed, has to be given. "Just" setting it into PATH did not do anything.

Here is my "final" code solution, which I can run within Jupyter Notebook.

import subprocess
import pdfplumber

path_to_ghostscript = r"C:\Program Files\gs\gs10.02.0\bin\gswin64c.exe"

command = [
    path_to_ghostscript, 
    '-o', 'repaired.pdf',
    '-sDEVICE=pdfwrite',
    '-dPDFSETTINGS=/prepress',
    path_to_pdf]

subprocess.run(command, check=True) # Run the command
pdf = pdfplumber.open('repaired.pdf') # Open the repaired PDF
wörter = pdf.pages[0].extract_words() # extract

This "fixes" the PDF so that the words can now be extracted with PDFplumber!

I would guess that is also the reason why "pdfplumber.open(path, repair=True)" did not find Ghostscript.

Thanks again for your time and help!!
Tob

jsvine Sep 22, 2023
Maintainer

Great, and thanks for following up. Yes, it seems that the issue was that the original PDF was malformed, and that repairing fixed this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError: ('Unhandled', 6), only with some PDFs on some pages #994

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

AssertionError: ('Unhandled', 6), only with some PDFs on some pages #994

Rellatte Sep 20, 2023

Replies: 2 comments · 3 replies

jsvine Sep 21, 2023 Maintainer

Rellatte Sep 21, 2023 Author

jsvine Sep 21, 2023 Maintainer

Rellatte Sep 22, 2023 Author

jsvine Sep 22, 2023 Maintainer

Rellatte
Sep 20, 2023

Replies: 2 comments 3 replies

jsvine
Sep 21, 2023
Maintainer

Rellatte
Sep 21, 2023
Author

jsvine Sep 21, 2023
Maintainer

Rellatte Sep 22, 2023
Author

jsvine Sep 22, 2023
Maintainer