-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0.10.0 #936
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixed a small typo
Update README.md
As noted in #912, `use_text_flow` was not being handled consistently, as characters and words were being re-sorted without checking first if this parameter was set to `True`.
Main edge-case was with `use_text_flow` on text-lines that then backtracked. But this rewrite also aims to make the logic more explicit and easier to follow.
When using relative=True for a re-crop, pdfplumber was passing the wrong bounding box to the cropping function. This commit fixes that bug and also refactors CroppedPage.__init__(...) for clarity and consistency's sake.
This commit normalizes the type representation of `stroking_color` and `non_stroking_color` values. Thanks to @dhdaines for pointing out this inconsistency. Previously, `pdfplumber` passed along `pdfminer.six`'s colors without normalization. Due to quirks in `pdfminer.six`'s color handling, this meant that those values could be floats, ints, lists, or tuples. This commit normalizes all color values (when non-None) into n-tuples, where (val,) represents grayscale colors, (val, val, val) represents RBG, and (val, val, val, val) represents CMYK colors. This should solve the consistency issue, although might cause breaking changes to code that filters for non-tuple values — e.g., `[c for c in page.chars if c == [1, 0 0]]`. Although breaking changes are unpleasant, I think the tradeoff for longer-term consistency is worth it.
Previously, `pdfplumber.Page` had these table-getting methods: - `.find_tables(...)` - `.extract_tables(...)` - `.extract_table(...)` For consistency/completeness's sake, this commit adds: - `.find_table(...)` ... which, analogous to `.extract_table(...)`, returns the largest table on the page. Indeed, `.extract_table(...)` now uses `.find_table(...)` beneath the hood. Thanks to @pdille for the suggestion, here: #864 (reply in thread)
Inspired by #828 The PDF reference allows for "colors" to be defined as a series of numbers and/or (much less commonly) patterns. (See p. 288 and section 4.6 here: https://ghostscript.com/~robin/pdf_reference17.pdf) This commit separates out the pattern component of colors into their own attributes, `stroking_pattern` and `non_stroking_pattern` so that they don't muddle the interpretation of standard colors' tuple-of-numbers representation. This commit also adds code that attempts to fetch the `ncs`/`scs` color space of each object. Due to current limitations of pdfminer.six, however, the only such color space immediately available is the `ncs` (non-stroking color space) property of char objects.
This commit swaps out Wand (and its non-Python dependencies ImageMagick and Ghostscript) for pypdfium2 for PageImage rendering. This has some advantages: - Less finicky: Wand often caused users problems, due to "MagickWand shared library not found" and "PolicyError: not authorized `PDF'" issues. By contrast, pypdfium2 seems (at least at first) to more self-contained and not require any system-tweaking. - Faster: pypdfium2 appears to render images more quickly than Wand (see @cmdlineuser's tests in #899) - More flexible: pypdfium2 appears to generate images with greater color depth; by default, pdfplumber quantizes those images so that they save/display compactly (in fact, with smaller file sizes than the previous code), this commit also adds parameters to retain all/more of the original, more detailed colors. Thanks to @cmdlineuser in #899 for the suggestion.
This commit adds convenience methods to repair PDFs on the fly and/or to write repaired PDFs to disk. Currently, this does so via Ghostscript using the method we've asked many users to try by following the instructions at https://superuser.com/questions/278562/how-can-i-fix-repair-a-corrupted-pdf-file Now, hopefully, this saves folks a few steps.
Codecov Report
@@ Coverage Diff @@
## stable #936 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 17 18 +1
Lines 1532 1585 +53
=========================================
+ Hits 1532 1585 +53
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[0.10.0] - 2023-07-16
Changed
tuple[float|int, ...]
(#917). (57d51bb)Added
pdfplumber.repair(...)
and.open(repair=True)
(#824). (db6ae97)quantize=True
,colors=256
,bits=8
arguments/defaults toPageImage.save(...)
. (b049373)Removed
Fixed
use_text_flow
more consistently (#912). (b1db5b8)