Releases: jsvine/pdfplumber
Releases · jsvine/pdfplumber
v0.10.0
Changed
- Normalize color representation to
tuple[float|int, ...]
(#917). (57d51bb) - Replace Wand with pypdfium2 for page.to_image(...). (b049373)
Added
- Add
pdfplumber.repair(...)
and.open(repair=True)
(#824). (db6ae97) - Add Page.find_table(...) (#873). (3772af6)
- Add
quantize=True
,colors=256
,bits=8
arguments/defaults toPageImage.save(...)
. (b049373) - Extract and handle patterns + (some) color spaces. (97ca4b0)
Removed
- Remove support for Python 3.7 (EOL'ed June 2023). (c9d24d5)
- Remove vestigial 'font' and 'name' properties from PDF objects. (6d62054)
Fixed
v0.9.0
Changed
- Make word segmentation (via
WordExtractor.char_begins_new_word(...)
) more explict and rigorous; should help in catching edge-cases in the future. (6acd580 + ebb93ea + #840) - Use
curve_edge
objects (instead of justline
andrect_edge
objects) in default table-detection strategy. (6f6b465 + #858) - By default, expand ligatures into their consituent letters (e.g.,
ffi
toffi
), and add theexpand_ligatures
boolean parameter to text-extraction methods. (86e935d + #598)
Added
- Add
Page.extract_text_lines(...)
method. (4b37397 + #852) - Add
main_group
,return_groups
,return_chars
parameters toPage.search(...)
. (4b37397) - Add
.curve_edges
property toPDF
andPage
. (6f6b465)
Fixed
v0.8.0
Changed
- Change the (still experimental)
Page/utils.extract_text(layout=True)
approach so that it pads, to the degree necessary, the ends of lines with spaces and the end of the text with blank lines to acheive better mimicry of page layout. (d3662de) - Refactor handling of
pts
attribute and, in doing so, deprecate thecurve_obj["points"]
attribute, and fixPageImage.draw_line(...)
's handling of diagonal lines. (216bedd) - Breaking change: In
Page.extract_table[s](...)
,keep_blank_chars
must now be passed astext_keep_blank_chars
, for consistency's sake. (c4e1b29)
Added
- Add
Page.extract_table[s](...)
support for allPage.extract_text(...)
keyword arguments. (c4e1b29) - Add
height
andwidth
keyword arguemnts toPage.to_image(...)
. (#798 + 93f7dbd) - Add
layout_width
,layout_width_chars
,layout_height
, andlayout_width_chars
parameters toPage/utils.extract_text(layout=True)
. (d3662de) - Add CITATION.cff. (#755) [h/t @joaoccruz]
Fixed
- Fix simple edge-case for when page rotation is (incorrectly) set to
None
. (#811) [h/t @toshi1127]
Development Changes
- Convert
utils.py
intoutils/
submodules. Retains same interface, just an improvement in organization. (6351d97) - Fix typing hints to include io.BytesIO. (d4107f6) [h/t @conitrade-as]
- Refactor text-extraction utilities, paving way for better consistency across various entrypoints to text extraction (e.g., via
utils.extract_text(...)
, viaPage.extract_text(...)
, viaPage.extract_table(...)
). (3424b57)
v0.7.6
Changed
- Bump pinned
pdfminer.six
version to20221105
. (e63a038)
Fixed
- Restore
text
attribute to.textboxhorizontal
/etc., regression introduced in9587cc7
/v0.6.2
. (8a0c126) - Fix
lru_cache
usage, which are discouraged for class methods due to garbage-collection issues. (e3142a0)
Development Changes
- Upgrade
nbexec
development requirement from0.1.0
to0.2.0
. (30dac25)
v0.7.5
Added
Fixed
- Fixed issue where
py.typed
file was not included in PyPi distribution. (#698 + #703 + 6908487) [h/t @jhonatan-lopes] - Reinstated the ability to call
utils.cluster_objects(...)
with any hashable value (str
,int
,tuple
, etc.) as thekey_fn
parameter, reverting breaking change in 58b1ab1. (#691 + 1e97656) [h/t @jfuruness]
Development Changes
v0.7.4
Added
- Add
utils.outside_bbox(...)
andPage.outside_bbox(...)
method, which are the inverse ofutils.within_bbox(...)
andPage.within_bbox(...)
. (#369 + 3ab1cc4) - Add
strict=True/False
parameter toPage.crop(...)
,Page.within_bbox(...)
, andPage.outside_bbox(...)
; default isTrue
, whileFalse
bypasses thetest_proposed_bbox(...)
check. (#421 + 71ad60f) - Add more guidance to exception when
.to_image(...)
raisesPIL.Image.DecompressionBombError
. (#413 + b6ff9e8)
Fixed
- Fix
PageImage
conversions for PDFs withcmyk
colorspaces; convert them torgb
earlier in the process. (28330da)
v0.7.3
v0.7.2
Added
- Add
split_at_punctuation
parameter to.extract_words(...)
and.extract_text(...)
. (#682) [h/t @lolipopshock] - Add README.md link to @hbh112233abc's Chinese translation of README.md. (#674)
Changed
- Change
.to_image(...)
's approach, preferring to composite with a white background instead of removing the alpha channel. (1cd1f9a)
Fixed
- Fix bug in
LayoutEngine.calculate(...)
when processing char objects with len>1 representations, such as ligatures. (#683)
v0.7.0
Added
- Add
"matrix"
property tochar
objects, representing the current transformation matrix. (ae6f99e) - Add
pdfplumber.ctm
submodule with classCTM
, to calculate scale, skew, and translation of a current transformation matrix obtained from achar
's"matrix"
property. (ae6f99e) - Add
page.search(...)
, an experimental feature that allows you to search a page's text via regular expressions and non-regex strings, returning the text, any regex matches, the bounding box coordinates, and the char objects themselves. (#201 + 58b1ab1) - Add
--include-attrs
/--exclude-attrs
to CLI (and corresponding params to.to_json(...)
,.to_csv(...)
, andSerializer
. (4deac25) - Add
py.typed
for PEP561 compatibility and detection of typing hints by mypy. (ca795d1) [h/t @jhonatan-lopes]
Changed
- Bump pinned
pdfminer.six
version to20220524
. (486cea8)
Removed
- Remove
utils.collate_chars(...)
, the old name (and then alias) forutils.extract_text(...)
. (24f3532)
Fixed
- Fix
IndexError
bug for.extract_text(layout=True)
on pages without text. (#658 + ad3df11) [h/t @ethanscorey]
v0.6.2
The main news about this version is that it introduces type annotations, and enforces them via mypy --strict
. It also fills in the few remaining gaps in the library's test coverage (although all parts of the library could still use stronger tests). See CHANGELOG.md for details.