-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Troubles with images extraction #1207
Comments
Hi @rerik, and thanks for your interest in |
Oh, I'm sorry, it's my bad. I was absolutely sure I gave the link to the target file: https://storage.googleapis.com/alan-ai-knowledge-base/isu-knowledge-base/Book/fundamentals-book-1-6.pdf Minimal Python script to reproduce: import io
import requests
import pdfplumber as pp
SOURCE = 'https://storage.googleapis.com/alan-ai-knowledge-base/isu-knowledge-base/Book/fundamentals-book-1-6.pdf'
response = requests.get(SOURCE)
doc = pp.open(io.BytesIO(response.content))
page = doc.pages[1]
image = page.images[0]
page.crop((
image['x0'],
image['top'],
image['x1'],
image['bottom']
)).to_image(resolution=300).save('img.jpg') |
Thank you, this is very helpful. I can reproduce the issue, and will see if I can find a solution. |
I have similar problem when I tried to read stream using PIL.
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x13ff37380> This is the image that caused the problem: Looks like that all PDFstreams with 'Filter': /'FlateDecode' have the problem. |
solved it by using PIL.Image.frombytes() |
Describe the bug
It's 2-in-1 problem.
At first, image raw data (for example,
doc.pages[1].images[0]['stream'].rawdata
) is broken. PIL ImagePIL.Image.open(io.BytesIO(doc.pages[1].images[0]['stream'].rawdata))
except an error{UnidentifiedImageError}UnidentifiedImageError('cannot identify image file <_io.BytesIO object at 0x7f62058f5df0>')
. If to save image bytes directly, it's just broken and cannot be opened.I've tried get raw bytes of this image with pypdf lib. It contains ~2 times more bytes and can be eazely saved, so it's not a principial problem of image itself.
At second, if I try to save crop by this image bbox, it miss.
This code saves
Instead of
Have you tried repairing the PDF?
Yes, I've tryied. In this case it's just crush with opening:
Environment
The text was updated successfully, but these errors were encountered: