Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure to dereference MediaBox in /Pages #1027

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dhdaines
Copy link
Contributor

@dhdaines dhdaines commented Jul 31, 2024

There was a regression in the latest release as noted in #1004 - object references are everywhere! Beware! Fixes #1004

There is a new test in test_tools_pdf2txt.py (hopefully it is getting run?) but also I tested it on Error.pdf from that issue.

Checklist

  • [ x] I have read CONTRIBUTING.md.
  • [ x] I have added a concise human-readable description of the change to CHANGELOG.md.
  • [ x] I have tested that this fix is effective or that this feature works.
  • [ x] I have added docstrings to newly created methods and classes.
  • [ x] I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

pdfminer/pdfpage.py Outdated Show resolved Hide resolved
@chenxi-briink
Copy link

chenxi-briink commented Aug 14, 2024

@felixxm it would be great if this PR could be merged soon so other libraries like pdfplumber could depend on latest version of pdfminer.

currently the pdfplubmer depends on 20231228, which throws TypeError: 'PDFObjRef' object is not iterable for some other situations when extracting text from pages (sorry I couldn't provide more technical details since I'm not familiar with the internals)

@felixxm
Copy link
Contributor

felixxm commented Aug 14, 2024

@felixxm it would be great if this PR could be merged soon so other libraries like pdfplumber could depend on latest version of pdfminer.

I'd love to see it merged, but I don't have such a power 🦸 I'm not a maintainer of this package.

@raunakdoesdev
Copy link

+1 on this - would love to see it merged @pietermarsman

@blackelk
Copy link

+1

@walterheck
Copy link

@pietermarsman would really like to see that release, do you have any plans for it? I see you have no github activity since August, which is a little concerning as maintainer of this repo :)

@dhdaines
Copy link
Contributor Author

In the meantime (shameless self-promotion) you could try PAVÉS which implements a mostly pdfminer.six compatible interface and ought to be somewhat more robust (your mileage may vary). It is also considerably faster on large documents due to the ability to parallelize layout analysis across multiple CPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TypeError: 'PDFObjRef' object is not iterable
7 participants