Make sure to dereference MediaBox in /Pages #1027

dhdaines · 2024-07-31T19:58:50Z

There was a regression in the latest release as noted in #1004 - object references are everywhere! Beware! Fixes #1004

There is a new test in test_tools_pdf2txt.py (hopefully it is getting run?) but also I tested it on Error.pdf from that issue.

Checklist

[ x] I have read CONTRIBUTING.md.
[ x] I have added a concise human-readable description of the change to CHANGELOG.md.
[ x] I have tested that this fix is effective or that this feature works.
[ x] I have added docstrings to newly created methods and classes.
[ x] I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

pdfminer/pdfpage.py

That is invalid PDF, but there are lots of invalid PDFs out there

chenxi-briink · 2024-08-14T02:22:09Z

@felixxm it would be great if this PR could be merged soon so other libraries like pdfplumber could depend on latest version of pdfminer.

currently the pdfplubmer depends on 20231228, which throws TypeError: 'PDFObjRef' object is not iterable for some other situations when extracting text from pages (sorry I couldn't provide more technical details since I'm not familiar with the internals)

felixxm · 2024-08-14T04:48:39Z

@felixxm it would be great if this PR could be merged soon so other libraries like pdfplumber could depend on latest version of pdfminer.

I'd love to see it merged, but I don't have such a power 🦸 I'm not a maintainer of this package.

raunakdoesdev · 2024-11-13T01:00:34Z

+1 on this - would love to see it merged @pietermarsman

blackelk · 2024-11-21T19:52:19Z

+1

walterheck · 2024-11-28T15:18:14Z

@pietermarsman would really like to see that release, do you have any plans for it? I see you have no github activity since August, which is a little concerning as maintainer of this repo :)

dhdaines · 2024-12-30T21:01:38Z

In the meantime (shameless self-promotion) you could try PAVÉS which implements a mostly pdfminer.six compatible interface and ought to be somewhat more robust (your mileage may vary). It is also considerably faster on large documents due to the ability to parallelize layout analysis across multiple CPUs.

fix: dereference MediaBox (fixes: pdfminer#1004)

ad101c1

dhdaines mentioned this pull request Jul 31, 2024

Update version of pdfminer-six to 20240706 jsvine/pdfplumber#1166

Open

felixxm reviewed Aug 1, 2024

View reviewed changes

pdfminer/pdfpage.py Outdated Show resolved Hide resolved

dhdaines added 3 commits August 1, 2024 10:14

feat: be defensive against missing MediaBox

548c018

That is invalid PDF, but there are lots of invalid PDFs out there

fix: resolve1 is redundant here, we know its a List

4cbee43

fix: further cleanup and extra super robustness

9fa01d2

rain01 approved these changes Nov 21, 2024

View reviewed changes

felixxm approved these changes Dec 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure to dereference MediaBox in /Pages #1027

Make sure to dereference MediaBox in /Pages #1027

dhdaines commented Jul 31, 2024 •

edited

Loading

chenxi-briink commented Aug 14, 2024 •

edited

Loading

felixxm commented Aug 14, 2024

raunakdoesdev commented Nov 13, 2024

blackelk commented Nov 21, 2024

walterheck commented Nov 28, 2024

dhdaines commented Dec 30, 2024

Make sure to dereference MediaBox in /Pages #1027

Are you sure you want to change the base?

Make sure to dereference MediaBox in /Pages #1027

Conversation

dhdaines commented Jul 31, 2024 • edited Loading

chenxi-briink commented Aug 14, 2024 • edited Loading

felixxm commented Aug 14, 2024

raunakdoesdev commented Nov 13, 2024

blackelk commented Nov 21, 2024

walterheck commented Nov 28, 2024

dhdaines commented Dec 30, 2024

dhdaines commented Jul 31, 2024 •

edited

Loading

chenxi-briink commented Aug 14, 2024 •

edited

Loading