Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubles with images extraction #1207

Open
rerik opened this issue Sep 26, 2024 · 5 comments
Open

Troubles with images extraction #1207

rerik opened this issue Sep 26, 2024 · 5 comments
Labels
awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author bug

Comments

@rerik
Copy link

rerik commented Sep 26, 2024

Describe the bug

It's 2-in-1 problem.

At first, image raw data (for example, doc.pages[1].images[0]['stream'].rawdata) is broken. PIL Image PIL.Image.open(io.BytesIO(doc.pages[1].images[0]['stream'].rawdata)) except an error {UnidentifiedImageError}UnidentifiedImageError('cannot identify image file <_io.BytesIO object at 0x7f62058f5df0>'). If to save image bytes directly, it's just broken and cannot be opened.

with open ('image.jpg', 'wb') as file:
    file.write(doc.pages[1].images[0]['stream'].rawdata))

I've tried get raw bytes of this image with pypdf lib. It contains ~2 times more bytes and can be eazely saved, so it's not a principial problem of image itself.

At second, if I try to save crop by this image bbox, it miss.

doc.pages[1].crop((
    doc.pages[1].images[0]['x0'],
    doc.pages[1].images[0]['top'], 
    doc.pages[1].images[0]['x1'], 
    doc.pages[1].images[0]['bottom']
)).to_image(resolution=300).save('img.jpg')

This code saves
img_5
Instead of
img_4

Have you tried repairing the PDF?

Yes, I've tryied. In this case it's just crush with opening:

Traceback (most recent call last):
  File "/home/alex/AlanNLP/qna/test.py/test_new_pdf_parser.py", line 60, in <module>
    result = parse(io.BytesIO(response.content), images_dir, images_url, SOURCE, pages_cache_file=PAGES_CACHE, print_progress=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alex/AlanNLP/qna/src.py/pdf_parser_new/parser.py", line 397, in parse
    doc = pp.open(file, repair=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alex/AlanNLP/venv/3.11/lib/python3.11/site-packages/pdfplumber/pdf.py", line 84, in open
    stream = _repair(
             ^^^^^^^^
  File "/home/alex/AlanNLP/venv/3.11/lib/python3.11/site-packages/pdfplumber/repair.py", line 58, in _repair
    raise Exception(f"{stderr.decode('utf-8')}")
Exception: GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
   **** Error: Incorrect object count in object stream.
               Output may be incorrect.
Error: /rangecheck in resolveobjectstream
Operand stack:
   (/tmp/gs_ujCNhG)   --nostringval--   --dict:1/100(L)--   2511   4207859   13   2511   3645   --dict:8/15(L)--   150   --nostringval--   163   --nostringval--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:7/7(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:3/3(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:3/3(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:7/7(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:3/3(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:3/3(L)--   --dict:7/7(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:3/3(L)--   --dict:4/4(L)--   --dict:3/3(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--
Execution stack:
   %interp_exit   .runexec2   --nostringval--   runpdf   --nostringval--   2   %stopped_push   --nostringval--   runpdf   runpdf   false   1   %stopped_push   1990   1   3   %oparray_pop   1989   1   3   %oparray_pop   1977   1   3   %oparray_pop   runpdf   1978   3   3   %oparray_pop   runpdf   runpdf   runpdf   runpdf   runpdf   runpdf   runpdf   runpdf
Dictionary stack:
   --dict:731/1123(ro)(G)--   --dict:1/20(G)--   --dict:80/200(L)--   --dict:80/200(L)--   --dict:135/256(ro)(G)--   --dict:315/325(ro)(G)--   --dict:29/32(L)--
Current allocation mode is local
GPL Ghostscript 9.50: Unrecoverable error, exit code 1

Environment

  • pdfplumber version: 0.11.4
  • Python version: 3.11.9
  • OS: Ubuntu 20.04.6 LTS on Windows 10 x86_64
@rerik rerik added the bug label Sep 26, 2024
@jsvine jsvine added the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Oct 3, 2024
@jsvine
Copy link
Owner

jsvine commented Oct 3, 2024

Hi @rerik, and thanks for your interest in pdfplumber. Can you share the PDF and a minimal Python script that reproduces the problem?

@rerik
Copy link
Author

rerik commented Oct 3, 2024

Hi @rerik, and thanks for your interest in pdfplumber. Can you share the PDF and a minimal Python script that reproduces the problem?

Oh, I'm sorry, it's my bad. I was absolutely sure I gave the link to the target file: https://storage.googleapis.com/alan-ai-knowledge-base/isu-knowledge-base/Book/fundamentals-book-1-6.pdf

Minimal Python script to reproduce:

import io
import requests

import pdfplumber as pp


SOURCE = 'https://storage.googleapis.com/alan-ai-knowledge-base/isu-knowledge-base/Book/fundamentals-book-1-6.pdf'

response = requests.get(SOURCE)
doc = pp.open(io.BytesIO(response.content))
page = doc.pages[1]
image = page.images[0]

page.crop((
    image['x0'],
    image['top'], 
    image['x1'], 
    image['bottom']
)).to_image(resolution=300).save('img.jpg')

@jsvine
Copy link
Owner

jsvine commented Oct 3, 2024

Thank you, this is very helpful. I can reproduce the issue, and will see if I can find a solution.

@JiachengSun0520
Copy link

I have similar problem when I tried to read stream using PIL.
with pdfplumber.open("example.pdf") as pdf:
# print(pdf.pages)
page = pdf.pages[0] # Extract the first page
for page in pdf.pages:
positions = []
for im in page.images:
p = (im['x0'], im['top'], im['x1'], im['bottom'])
print(im)
image_data = im['stream'].get_data()
pil_image = Image.open(io.BytesIO(image_data))
positions.append(p)
print('positions', positions)

pil_image = Image.open(io.BytesIO(image_data))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x13ff37380>

This is the image that caused the problem:
{'x0': 12.0, 'y0': 28.32494, 'x1': 780.0, 'y1': 583.67504, 'width': 768.0, 'height': 555.3501, 'stream': <PDFStream(75): raw=241206, {'BitsPerComponent': 8, 'ColorSpace': PDFObjRef:76, 'Filter': /'FlateDecode', 'Height': 632, 'Interpolate': True, 'Length': 241206, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 874}>, 'srcsize': (874, 632), 'imagemask': None, 'bits': 8, 'colorspace': [[/'ICCBased', <PDFStream(77): raw=3172, {'Alternate': /'DeviceRGB', 'Filter': /'FlateDecode', 'Length': 3172, 'N': 3}>]], 'mcid': None, 'tag': None, 'object_type': 'image', 'page_number': 8, 'top': 28.324960000000033, 'bottom': 583.67506, 'doctop': 4412.32496}

Looks like that all PDFstreams with 'Filter': /'FlateDecode' have the problem.
Maybe I am wrong but all streams in my testcase with 'Filter': /'DCTDecode' are good.

@JiachengSun0520
Copy link

solved it by using PIL.Image.frombytes()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author bug
Projects
None yet
Development

No branches or pull requests

4 participants
@jsvine @rerik @JiachengSun0520 and others