Troubles with images extraction #1207

rerik · 2024-09-26T16:05:21Z

Describe the bug

It's 2-in-1 problem.

At first, image raw data (for example, doc.pages[1].images[0]['stream'].rawdata) is broken. PIL Image PIL.Image.open(io.BytesIO(doc.pages[1].images[0]['stream'].rawdata)) except an error {UnidentifiedImageError}UnidentifiedImageError('cannot identify image file <_io.BytesIO object at 0x7f62058f5df0>'). If to save image bytes directly, it's just broken and cannot be opened.

with open ('image.jpg', 'wb') as file:
    file.write(doc.pages[1].images[0]['stream'].rawdata))

I've tried get raw bytes of this image with pypdf lib. It contains ~2 times more bytes and can be eazely saved, so it's not a principial problem of image itself.

At second, if I try to save crop by this image bbox, it miss.

doc.pages[1].crop((
    doc.pages[1].images[0]['x0'],
    doc.pages[1].images[0]['top'], 
    doc.pages[1].images[0]['x1'], 
    doc.pages[1].images[0]['bottom']
)).to_image(resolution=300).save('img.jpg')

This code saves

Instead of

Have you tried repairing the PDF?

Yes, I've tryied. In this case it's just crush with opening:

Traceback (most recent call last):
  File "/home/alex/AlanNLP/qna/test.py/test_new_pdf_parser.py", line 60, in <module>
    result = parse(io.BytesIO(response.content), images_dir, images_url, SOURCE, pages_cache_file=PAGES_CACHE, print_progress=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alex/AlanNLP/qna/src.py/pdf_parser_new/parser.py", line 397, in parse
    doc = pp.open(file, repair=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alex/AlanNLP/venv/3.11/lib/python3.11/site-packages/pdfplumber/pdf.py", line 84, in open
    stream = _repair(
             ^^^^^^^^
  File "/home/alex/AlanNLP/venv/3.11/lib/python3.11/site-packages/pdfplumber/repair.py", line 58, in _repair
    raise Exception(f"{stderr.decode('utf-8')}")
Exception: GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
   **** Error: Incorrect object count in object stream.
               Output may be incorrect.
Error: /rangecheck in resolveobjectstream
Operand stack:
   (/tmp/gs_ujCNhG)   --nostringval--   --dict:1/100(L)--   2511   4207859   13   2511   3645   --dict:8/15(L)--   150   --nostringval--   163   --nostringval--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:7/7(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:3/3(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:3/3(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:7/7(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:3/3(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:3/3(L)--   --dict:7/7(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:3/3(L)--   --dict:4/4(L)--   --dict:3/3(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--
Execution stack:
   %interp_exit   .runexec2   --nostringval--   runpdf   --nostringval--   2   %stopped_push   --nostringval--   runpdf   runpdf   false   1   %stopped_push   1990   1   3   %oparray_pop   1989   1   3   %oparray_pop   1977   1   3   %oparray_pop   runpdf   1978   3   3   %oparray_pop   runpdf   runpdf   runpdf   runpdf   runpdf   runpdf   runpdf   runpdf
Dictionary stack:
   --dict:731/1123(ro)(G)--   --dict:1/20(G)--   --dict:80/200(L)--   --dict:80/200(L)--   --dict:135/256(ro)(G)--   --dict:315/325(ro)(G)--   --dict:29/32(L)--
Current allocation mode is local
GPL Ghostscript 9.50: Unrecoverable error, exit code 1

Environment

pdfplumber version: 0.11.4
Python version: 3.11.9
OS: Ubuntu 20.04.6 LTS on Windows 10 x86_64

The text was updated successfully, but these errors were encountered:

jsvine · 2024-10-03T02:57:39Z

Hi @rerik, and thanks for your interest in pdfplumber. Can you share the PDF and a minimal Python script that reproduces the problem?

rerik · 2024-10-03T10:46:01Z

Hi @rerik, and thanks for your interest in pdfplumber. Can you share the PDF and a minimal Python script that reproduces the problem?

Oh, I'm sorry, it's my bad. I was absolutely sure I gave the link to the target file: https://storage.googleapis.com/alan-ai-knowledge-base/isu-knowledge-base/Book/fundamentals-book-1-6.pdf

Minimal Python script to reproduce:

import io
import requests

import pdfplumber as pp


SOURCE = 'https://storage.googleapis.com/alan-ai-knowledge-base/isu-knowledge-base/Book/fundamentals-book-1-6.pdf'

response = requests.get(SOURCE)
doc = pp.open(io.BytesIO(response.content))
page = doc.pages[1]
image = page.images[0]

page.crop((
    image['x0'],
    image['top'], 
    image['x1'], 
    image['bottom']
)).to_image(resolution=300).save('img.jpg')

jsvine · 2024-10-03T12:13:04Z

Thank you, this is very helpful. I can reproduce the issue, and will see if I can find a solution.

JiachengSun0520 · 2024-12-06T18:44:42Z

I have similar problem when I tried to read stream using PIL.
with pdfplumber.open("example.pdf") as pdf:
# print(pdf.pages)
page = pdf.pages[0] # Extract the first page
for page in pdf.pages:
positions = []
for im in page.images:
p = (im['x0'], im['top'], im['x1'], im['bottom'])
print(im)
image_data = im['stream'].get_data()
pil_image = Image.open(io.BytesIO(image_data))
positions.append(p)
print('positions', positions)

pil_image = Image.open(io.BytesIO(image_data))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x13ff37380>

This is the image that caused the problem:
{'x0': 12.0, 'y0': 28.32494, 'x1': 780.0, 'y1': 583.67504, 'width': 768.0, 'height': 555.3501, 'stream': <PDFStream(75): raw=241206, {'BitsPerComponent': 8, 'ColorSpace': PDFObjRef:76, 'Filter': /'FlateDecode', 'Height': 632, 'Interpolate': True, 'Length': 241206, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 874}>, 'srcsize': (874, 632), 'imagemask': None, 'bits': 8, 'colorspace': [[/'ICCBased', <PDFStream(77): raw=3172, {'Alternate': /'DeviceRGB', 'Filter': /'FlateDecode', 'Length': 3172, 'N': 3}>]], 'mcid': None, 'tag': None, 'object_type': 'image', 'page_number': 8, 'top': 28.324960000000033, 'bottom': 583.67506, 'doctop': 4412.32496}

Looks like that all PDFstreams with 'Filter': /'FlateDecode' have the problem.
Maybe I am wrong but all streams in my testcase with 'Filter': /'DCTDecode' are good.

JiachengSun0520 · 2024-12-09T19:34:29Z

solved it by using PIL.Image.frombytes()

rerik added the bug label Sep 26, 2024

jsvine added the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Oct 3, 2024

mratanusarkar mentioned this issue Oct 19, 2024

need info working with page.images #1217

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubles with images extraction #1207

Troubles with images extraction #1207

rerik commented Sep 26, 2024

jsvine commented Oct 3, 2024

rerik commented Oct 3, 2024

jsvine commented Oct 3, 2024

JiachengSun0520 commented Dec 6, 2024

JiachengSun0520 commented Dec 9, 2024

Troubles with images extraction #1207

Troubles with images extraction #1207

Comments

rerik commented Sep 26, 2024

Describe the bug

Have you tried repairing the PDF?

Environment

jsvine commented Oct 3, 2024

rerik commented Oct 3, 2024

jsvine commented Oct 3, 2024

JiachengSun0520 commented Dec 6, 2024

JiachengSun0520 commented Dec 9, 2024