Releases · jsvine/pdfplumber

16 Jul 22:37

jsvine

v0.10.0

00386ad

v0.10.0

Changed

Normalize color representation to tuple[float|int, ...] (#917). (57d51bb)
Replace Wand with pypdfium2 for page.to_image(...). (b049373)

Added

Add pdfplumber.repair(...) and .open(repair=True) (#824). (db6ae97)
Add Page.find_table(...) (#873). (3772af6)
Add quantize=True, colors=256, bits=8 arguments/defaults to PageImage.save(...). (b049373)
Extract and handle patterns + (some) color spaces. (97ca4b0)

Removed

Remove support for Python 3.7 (EOL'ed June 2023). (c9d24d5)
Remove vestigial 'font' and 'name' properties from PDF objects. (6d62054)

Fixed

Fix bug for re-crops that use relative=True (#914). (0de6da9)
Handle use_text_flow more consistently (#912). (b1db5b8)

Assets 2

13 Apr 12:58

jsvine

v0.9.0

255eaac

v0.9.0

Changed

Make word segmentation (via WordExtractor.char_begins_new_word(...)) more explict and rigorous; should help in catching edge-cases in the future. (6acd580 + ebb93ea + #840)
Use curve_edge objects (instead of just line and rect_edge objects) in default table-detection strategy. (6f6b465 + #858)
By default, expand ligatures into their consituent letters (e.g., ﬃ to ffi), and add the expand_ligatures boolean parameter to text-extraction methods. (86e935d + #598)

Added

Add Page.extract_text_lines(...) method. (4b37397 + #852)
Add main_group, return_groups, return_chars parameters to Page.search(...). (4b37397)
Add .curve_edges property to PDF and Page. (6f6b465)

Fixed

Fix handling of bytes-typed fontnames. (9441ff7 + #461 + #842)
Fix handling of whitespace-only and empty results of Page.search(...). (6f6b465 + #853)

Assets 2

14 Feb 03:05

jsvine

v0.8.0

b6847ad

v0.8.0

Changed

Change the (still experimental) Page/utils.extract_text(layout=True) approach so that it pads, to the degree necessary, the ends of lines with spaces and the end of the text with blank lines to acheive better mimicry of page layout. (d3662de)
Refactor handling of pts attribute and, in doing so, deprecate the curve_obj["points"] attribute, and fix PageImage.draw_line(...)'s handling of diagonal lines. (216bedd)
Breaking change: In Page.extract_table[s](...), keep_blank_chars must now be passed as text_keep_blank_chars, for consistency's sake. (c4e1b29)

Added

Add Page.extract_table[s](...) support for all Page.extract_text(...) keyword arguments. (c4e1b29)
Add height and width keyword arguemnts to Page.to_image(...). (#798 + 93f7dbd)
Add layout_width, layout_width_chars, layout_height, and layout_width_chars parameters to Page/utils.extract_text(layout=True). (d3662de)
Add CITATION.cff. (#755) [h/t @joaoccruz]

Fixed

Fix simple edge-case for when page rotation is (incorrectly) set to None. (#811) [h/t @toshi1127]

Development Changes

Convert utils.py into utils/ submodules. Retains same interface, just an improvement in organization. (6351d97)
Fix typing hints to include io.BytesIO. (d4107f6) [h/t @conitrade-as]
Refactor text-extraction utilities, paving way for better consistency across various entrypoints to text extraction (e.g., via utils.extract_text(...), via Page.extract_text(...), via Page.extract_table(...)). (3424b57)

Contributors

toshi1127, joaoccruz, and conitrade-as

Assets 2

22 Nov 18:03

jsvine

v0.7.6

f6741d3

v0.7.6

Changed

Bump pinned pdfminer.six version to 20221105. (e63a038)

Fixed

Restore text attribute to .textboxhorizontal/etc., regression introduced in 9587cc7 / v0.6.2. (8a0c126)
Fix lru_cache usage, which are discouraged for class methods due to garbage-collection issues. (e3142a0)

Development Changes

Upgrade nbexec development requirement from 0.1.0 to 0.2.0. (30dac25)

Assets 2

01 Oct 13:50

jsvine

v0.7.5

5aca57c

v0.7.5

Added

Add PageImage.show() as alias for PageImage.annotated.show(). (#715 + 5c7787b)

Fixed

Fixed issue where py.typed file was not included in PyPi distribution. (#698 + #703 + 6908487) [h/t @jhonatan-lopes]
Reinstated the ability to call utils.cluster_objects(...) with any hashable value (str, int, tuple, etc.) as the key_fn parameter, reverting breaking change in 58b1ab1. (#691 + 1e97656) [h/t @jfuruness]

Development Changes

Update Wand version in requirements.txt from >=0.6.7 to >=0.6.10. (#713 + 3457d79)

Contributors

jfuruness and jhonatan-lopes

Assets 2

20 Jul 22:38

jsvine

v0.7.4

11f8ce3

v0.7.4

Added

Add utils.outside_bbox(...) and Page.outside_bbox(...) method, which are the inverse of utils.within_bbox(...) and Page.within_bbox(...). (#369 + 3ab1cc4)
Add strict=True/False parameter to Page.crop(...), Page.within_bbox(...), and Page.outside_bbox(...); default is True, while False bypasses the test_proposed_bbox(...) check. (#421 + 71ad60f)
Add more guidance to exception when .to_image(...) raises PIL.Image.DecompressionBombError. (#413 + b6ff9e8)

Fixed

Fix PageImage conversions for PDFs with cmyk colorspaces; convert them to rgb earlier in the process. (28330da)

Assets 2

18 Jul 14:50

jsvine

v0.7.3

f9c1f61

v0.7.3

Fixed

Quick fix for transparency issue in visual debugging mode. b98dd7c

Assets 2

18 Jul 03:12

jsvine

v0.7.2

12daa3b

v0.7.2

Added

Add split_at_punctuation parameter to .extract_words(...) and .extract_text(...). (#682) [h/t @lolipopshock]
Add README.md link to @hbh112233abc's Chinese translation of README.md. (#674)

Changed

Change .to_image(...)'s approach, preferring to composite with a white background instead of removing the alpha channel. (1cd1f9a)

Fixed

Fix bug in LayoutEngine.calculate(...) when processing char objects with len>1 representations, such as ligatures. (#683)

Contributors

hbh112233abc and lolipopshock

Assets 2

27 May 18:59

jsvine

v0.7.0

cec6e0f

v0.7.0

Added

Add "matrix" property to char objects, representing the current transformation matrix. (ae6f99e)
Add pdfplumber.ctm submodule with class CTM, to calculate scale, skew, and translation of a current transformation matrix obtained from a char's "matrix" property. (ae6f99e)
Add page.search(...), an experimental feature that allows you to search a page's text via regular expressions and non-regex strings, returning the text, any regex matches, the bounding box coordinates, and the char objects themselves. (#201 + 58b1ab1)
Add --include-attrs/--exclude-attrs to CLI (and corresponding params to .to_json(...), .to_csv(...), and Serializer. (4deac25)
Add py.typed for PEP561 compatibility and detection of typing hints by mypy. (ca795d1) [h/t @jhonatan-lopes]

Changed

Bump pinned pdfminer.six version to 20220524. (486cea8)

Removed

Remove utils.collate_chars(...), the old name (and then alias) for utils.extract_text(...). (24f3532)

Fixed

Fix IndexError bug for .extract_text(layout=True) on pages without text. (#658 + ad3df11) [h/t @ethanscorey]

Contributors

ethanscorey and jhonatan-lopes

Assets 2

06 May 18:08

jsvine

v0.6.2

3191c25

v0.6.2

The main news about this version is that it introduces type annotations, and enforces them via mypy --strict. It also fills in the few remaining gaps in the library's test coverage (although all parts of the library could still use stronger tests). See CHANGELOG.md for details.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed

Added

Removed

Fixed

Changed

Added

Fixed

Changed

Added

Fixed

Development Changes

Contributors

Changed

Fixed

Development Changes

Added

Fixed

Development Changes

Contributors

Added

Fixed

Fixed

Added

Changed

Fixed

Contributors

Added

Changed

Removed

Fixed

Contributors

Releases: jsvine/pdfplumber

v0.10.0

Changed

Added

Removed

Fixed

v0.9.0

Changed

Added

Fixed

v0.8.0

Changed

Added

Fixed

Development Changes

Contributors

v0.7.6

Changed

Fixed

Development Changes

v0.7.5

Added

Fixed

Development Changes

Contributors

v0.7.4

Added

Fixed

v0.7.3

Fixed

v0.7.2

Added

Changed

Fixed

Contributors

v0.7.0

Added

Changed

Removed

Fixed

Contributors

v0.6.2