v0.10.0 #936

jsvine · 2023-07-16T22:31:13Z

[0.10.0] - 2023-07-16

Changed

Normalize color representation to tuple[float|int, ...] (#917). (57d51bb)
Replace Wand with pypdfium2 for page.to_image(...). (b049373)

Added

Add pdfplumber.repair(...) and .open(repair=True) (#824). (db6ae97)
Add Page.find_table(...) (#873). (3772af6)
Add quantize=True, colors=256, bits=8 arguments/defaults to PageImage.save(...). (b049373)
Extract and handle patterns + (some) color spaces. (97ca4b0)

Removed

Remove support for Python 3.7 (EOL'ed June 2023). (c9d24d5)
Remove vestigial 'font' and 'name' properties from PDF objects. (6d62054)

Fixed

Fix bug for re-crops that use relative=True (#914). (0de6da9)
Handle use_text_flow more consistently (#912). (b1db5b8)

Fixed a small typo

Update README.md

As noted in #912, `use_text_flow` was not being handled consistently, as characters and words were being re-sorted without checking first if this parameter was set to `True`.

Main edge-case was with `use_text_flow` on text-lines that then backtracked. But this rewrite also aims to make the logic more explicit and easier to follow.

When using relative=True for a re-crop, pdfplumber was passing the wrong bounding box to the cropping function. This commit fixes that bug and also refactors CroppedPage.__init__(...) for clarity and consistency's sake.

@dhdaines

This commit normalizes the type representation of `stroking_color` and `non_stroking_color` values. Thanks to @dhdaines for pointing out this inconsistency. Previously, `pdfplumber` passed along `pdfminer.six`'s colors without normalization. Due to quirks in `pdfminer.six`'s color handling, this meant that those values could be floats, ints, lists, or tuples. This commit normalizes all color values (when non-None) into n-tuples, where (val,) represents grayscale colors, (val, val, val) represents RBG, and (val, val, val, val) represents CMYK colors. This should solve the consistency issue, although might cause breaking changes to code that filters for non-tuple values — e.g., `[c for c in page.chars if c == [1, 0 0]]`. Although breaking changes are unpleasant, I think the tradeoff for longer-term consistency is worth it.

@pdille

Previously, `pdfplumber.Page` had these table-getting methods: - `.find_tables(...)` - `.extract_tables(...)` - `.extract_table(...)` For consistency/completeness's sake, this commit adds: - `.find_table(...)` ... which, analogous to `.extract_table(...)`, returns the largest table on the page. Indeed, `.extract_table(...)` now uses `.find_table(...)` beneath the hood. Thanks to @pdille for the suggestion, here: #864 (reply in thread)

Inspired by #828 The PDF reference allows for "colors" to be defined as a series of numbers and/or (much less commonly) patterns. (See p. 288 and section 4.6 here: https://ghostscript.com/~robin/pdf_reference17.pdf) This commit separates out the pattern component of colors into their own attributes, `stroking_pattern` and `non_stroking_pattern` so that they don't muddle the interpretation of standard colors' tuple-of-numbers representation. This commit also adds code that attempts to fetch the `ncs`/`scs` color space of each object. Due to current limitations of pdfminer.six, however, the only such color space immediately available is the `ncs` (non-stroking color space) property of char objects.

This commit swaps out Wand (and its non-Python dependencies ImageMagick and Ghostscript) for pypdfium2 for PageImage rendering. This has some advantages: - Less finicky: Wand often caused users problems, due to "MagickWand shared library not found" and "PolicyError: not authorized `PDF'" issues. By contrast, pypdfium2 seems (at least at first) to more self-contained and not require any system-tweaking. - Faster: pypdfium2 appears to render images more quickly than Wand (see @cmdlineuser's tests in #899) - More flexible: pypdfium2 appears to generate images with greater color depth; by default, pdfplumber quantizes those images so that they save/display compactly (in fact, with smaller file sizes than the previous code), this commit also adds parameters to retain all/more of the original, more detailed colors. Thanks to @cmdlineuser in #899 for the suggestion.

This commit adds convenience methods to repair PDFs on the fly and/or to write repaired PDFs to disk. Currently, this does so via Ghostscript using the method we've asked many users to try by following the instructions at https://superuser.com/questions/278562/how-can-i-fix-repair-a-corrupted-pdf-file Now, hopefully, this saves folks a few steps.

codecov · 2023-07-16T22:31:30Z

Codecov Report

Merging #936 (28c0afc) into stable (ae676ae) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            stable      #936   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           17        18    +1     
  Lines         1532      1585   +53     
=========================================
+ Hits          1532      1585   +53

Impacted Files	Coverage Δ
pdfplumber/__init__.py	`100.00% <100.00%> (ø)`
pdfplumber/_version.py	`100.00% <100.00%> (ø)`
pdfplumber/display.py	`100.00% <100.00%> (ø)`
pdfplumber/page.py	`100.00% <100.00%> (ø)`
pdfplumber/pdf.py	`100.00% <100.00%> (ø)`
pdfplumber/repair.py	`100.00% <100.00%> (ø)`
pdfplumber/utils/text.py	`100.00% <100.00%> (ø)`

RitchieP and others added 28 commits May 2, 2023 14:46

Update README.md

8bf5121

Fixed a small typo

Merge pull request #877 from RitchieP/patch-1

5aaabcd

Update README.md

Handle use_text_flow more consistently

b1db5b8

As noted in #912, `use_text_flow` was not being handled consistently, as characters and words were being re-sorted without checking first if this parameter was set to `True`.

Rewrite char_begins_new_word for ease & edge cases

a032019

Main edge-case was with `use_text_flow` on text-lines that then backtracked. But this rewrite also aims to make the logic more explicit and easier to follow.

Add another test for use_text_flow

474f74c

Fix bug for re-crops that use relative=True (#914)

0de6da9

When using relative=True for a re-crop, pdfplumber was passing the wrong bounding box to the cropping function. This commit fixes that bug and also refactors CroppedPage.__init__(...) for clarity and consistency's sake.

Remove vestigial 'font' and 'name' properties

6d62054

Move expanded notes on colors to docs/colors.md

ea5e275

Add Python 3.11 to supported versions

788857e

Remove tox from setup.cfg

df0e027

Add repair notes to docs and bug report template

3ab6649

Update examples

de65bd8

Update CHANGELOG.md for v0.10.0

0cc1047

Bump to v0.10.0

70c25ba

Merge branch 'develop' of github.com:jsvine/pdfplumber into develop

fbc6fac

Fix tuple/Tuple typesig for earlier Python versions

10ae47d

Fix PNG size tests for other platforms

b5c268d

Update pandas version in requirements-dev

00efbf0

Update .github/workflows/tests.yml

34070b4

Remove support for Python 3.7

c9d24d5

Update CHANGELOG.md

9aa57de

Update CITATION.cff

ab9164b

Fix missing trailing parens in CHANGELOG.md

28c0afc

jsvine merged commit 00386ad into stable Jul 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.10.0 #936

v0.10.0 #936

jsvine commented Jul 16, 2023

codecov bot commented Jul 16, 2023 •

edited

Loading

v0.10.0 #936

v0.10.0 #936

Conversation

jsvine commented Jul 16, 2023

[0.10.0] - 2023-07-16

Changed

Added

Removed

Fixed

codecov bot commented Jul 16, 2023 • edited Loading

Codecov Report

codecov bot commented Jul 16, 2023 •

edited

Loading