-
Notifications
You must be signed in to change notification settings - Fork 943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure to dereference MediaBox in /Pages #1027
base: master
Are you sure you want to change the base?
Conversation
That is invalid PDF, but there are lots of invalid PDFs out there
@felixxm it would be great if this PR could be merged soon so other libraries like pdfplumber could depend on latest version of pdfminer. currently the pdfplubmer depends on |
I'd love to see it merged, but I don't have such a power 🦸 I'm not a maintainer of this package. |
+1 on this - would love to see it merged @pietermarsman |
+1 |
@pietermarsman would really like to see that release, do you have any plans for it? I see you have no github activity since August, which is a little concerning as maintainer of this repo :) |
In the meantime (shameless self-promotion) you could try PAVÉS which implements a mostly pdfminer.six compatible interface and ought to be somewhat more robust (your mileage may vary). It is also considerably faster on large documents due to the ability to parallelize layout analysis across multiple CPUs. |
There was a regression in the latest release as noted in #1004 - object references are everywhere! Beware! Fixes #1004
There is a new test in
test_tools_pdf2txt.py
(hopefully it is getting run?) but also I tested it on Error.pdf from that issue.Checklist