Skip to content

Releases: jsvine/pdfplumber

v0.10.0

16 Jul 22:37
00386ad
Compare
Choose a tag to compare

Changed

  • Normalize color representation to tuple[float|int, ...] (#917). (57d51bb)
  • Replace Wand with pypdfium2 for page.to_image(...). (b049373)

Added

  • Add pdfplumber.repair(...) and .open(repair=True) (#824). (db6ae97)
  • Add Page.find_table(...) (#873). (3772af6)
  • Add quantize=True, colors=256, bits=8 arguments/defaults to PageImage.save(...). (b049373)
  • Extract and handle patterns + (some) color spaces. (97ca4b0)

Removed

Fixed

  • Fix bug for re-crops that use relative=True (#914). (0de6da9)
  • Handle use_text_flow more consistently (#912). (b1db5b8)

v0.9.0

13 Apr 12:58
255eaac
Compare
Choose a tag to compare

Changed

  • Make word segmentation (via WordExtractor.char_begins_new_word(...)) more explict and rigorous; should help in catching edge-cases in the future. (6acd580 + ebb93ea + #840)
  • Use curve_edge objects (instead of just line and rect_edge objects) in default table-detection strategy. (6f6b465 + #858)
  • By default, expand ligatures into their consituent letters (e.g., to ffi), and add the expand_ligatures boolean parameter to text-extraction methods. (86e935d + #598)

Added

  • Add Page.extract_text_lines(...) method. (4b37397 + #852)
  • Add main_group, return_groups, return_chars parameters to Page.search(...). (4b37397)
  • Add .curve_edges property to PDF and Page. (6f6b465)

Fixed

  • Fix handling of bytes-typed fontnames. (9441ff7 + #461 + #842)
  • Fix handling of whitespace-only and empty results of Page.search(...). (6f6b465 + #853)

v0.8.0

14 Feb 03:05
Compare
Choose a tag to compare

Changed

  • Change the (still experimental) Page/utils.extract_text(layout=True) approach so that it pads, to the degree necessary, the ends of lines with spaces and the end of the text with blank lines to acheive better mimicry of page layout. (d3662de)
  • Refactor handling of pts attribute and, in doing so, deprecate the curve_obj["points"] attribute, and fix PageImage.draw_line(...)'s handling of diagonal lines. (216bedd)
  • Breaking change: In Page.extract_table[s](...), keep_blank_chars must now be passed as text_keep_blank_chars, for consistency's sake. (c4e1b29)

Added

  • Add Page.extract_table[s](...) support for all Page.extract_text(...) keyword arguments. (c4e1b29)
  • Add height and width keyword arguemnts to Page.to_image(...). (#798 + 93f7dbd)
  • Add layout_width, layout_width_chars, layout_height, and layout_width_chars parameters to Page/utils.extract_text(layout=True). (d3662de)
  • Add CITATION.cff. (#755) [h/t @joaoccruz]

Fixed

  • Fix simple edge-case for when page rotation is (incorrectly) set to None. (#811) [h/t @toshi1127]

Development Changes

  • Convert utils.py into utils/ submodules. Retains same interface, just an improvement in organization. (6351d97)
  • Fix typing hints to include io.BytesIO. (d4107f6) [h/t @conitrade-as]
  • Refactor text-extraction utilities, paving way for better consistency across various entrypoints to text extraction (e.g., via utils.extract_text(...), via Page.extract_text(...), via Page.extract_table(...)). (3424b57)

v0.7.6

22 Nov 18:03
Compare
Choose a tag to compare

Changed

  • Bump pinned pdfminer.six version to 20221105. (e63a038)

Fixed

Development Changes

  • Upgrade nbexec development requirement from 0.1.0 to 0.2.0. (30dac25)

v0.7.5

01 Oct 13:50
Compare
Choose a tag to compare

Added

  • Add PageImage.show() as alias for PageImage.annotated.show(). (#715 + 5c7787b)

Fixed

  • Fixed issue where py.typed file was not included in PyPi distribution. (#698 + #703 + 6908487) [h/t @jhonatan-lopes]
  • Reinstated the ability to call utils.cluster_objects(...) with any hashable value (str, int, tuple, etc.) as the key_fn parameter, reverting breaking change in 58b1ab1. (#691 + 1e97656) [h/t @jfuruness]

Development Changes

  • Update Wand version in requirements.txt from >=0.6.7 to >=0.6.10. (#713 + 3457d79)

v0.7.4

20 Jul 22:38
Compare
Choose a tag to compare

Added

  • Add utils.outside_bbox(...) and Page.outside_bbox(...) method, which are the inverse of utils.within_bbox(...) and Page.within_bbox(...). (#369 + 3ab1cc4)
  • Add strict=True/False parameter to Page.crop(...), Page.within_bbox(...), and Page.outside_bbox(...); default is True, while False bypasses the test_proposed_bbox(...) check. (#421 + 71ad60f)
  • Add more guidance to exception when .to_image(...) raises PIL.Image.DecompressionBombError. (#413 + b6ff9e8)

Fixed

  • Fix PageImage conversions for PDFs with cmyk colorspaces; convert them to rgb earlier in the process. (28330da)

v0.7.3

18 Jul 14:50
Compare
Choose a tag to compare

Fixed

  • Quick fix for transparency issue in visual debugging mode. b98dd7c

v0.7.2

18 Jul 03:12
Compare
Choose a tag to compare

Added

Changed

  • Change .to_image(...)'s approach, preferring to composite with a white background instead of removing the alpha channel. (1cd1f9a)

Fixed

  • Fix bug in LayoutEngine.calculate(...) when processing char objects with len>1 representations, such as ligatures. (#683)

v0.7.0

27 May 18:59
Compare
Choose a tag to compare

Added

  • Add "matrix" property to char objects, representing the current transformation matrix. (ae6f99e)
  • Add pdfplumber.ctm submodule with class CTM, to calculate scale, skew, and translation of a current transformation matrix obtained from a char's "matrix" property. (ae6f99e)
  • Add page.search(...), an experimental feature that allows you to search a page's text via regular expressions and non-regex strings, returning the text, any regex matches, the bounding box coordinates, and the char objects themselves. (#201 + 58b1ab1)
  • Add --include-attrs/--exclude-attrs to CLI (and corresponding params to .to_json(...), .to_csv(...), and Serializer. (4deac25)
  • Add py.typed for PEP561 compatibility and detection of typing hints by mypy. (ca795d1) [h/t @jhonatan-lopes]

Changed

  • Bump pinned pdfminer.six version to 20220524. (486cea8)

Removed

  • Remove utils.collate_chars(...), the old name (and then alias) for utils.extract_text(...). (24f3532)

Fixed

v0.6.2

06 May 18:08
Compare
Choose a tag to compare

The main news about this version is that it introduces type annotations, and enforces them via mypy --strict. It also fills in the few remaining gaps in the library's test coverage (although all parts of the library could still use stronger tests). See CHANGELOG.md for details.