-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom deduppe_chars char properties #1114
Comments
Thanks for the suggestion, @felix-hh. Are you able to share a version of the PDF redacted with https://github.com/JoshData/pdf-redactor? Or another PDF that demonstrates the same issue? |
Hi @jsvine I made a good-faith attempt at redacting the pdf with the tool but the footer text is not redacted and can still be extracted. This is a problem because the footer identifies the data source which is proprietary. I also do not know how to reproduce the issue with my own pdf. |
Thanks @felix-hh. For new features, I like/want to have unit tests for them, which requires a PDF demonstrating a failing example. Could you use a tool (e.g., Adobe Acrobat, Preview, etc.) to manually redact the footer text? |
Following up on #71 - I have had a problem with duplicated characters that have the same text but different properties (i.e. fonts). Unfortunately I can't share the file as it is private. I am requesting optionally providing custom properties to the deduplication function (please let me know if this is otherwise available!)
Here's a sketch of the proposed code changes:
pdfplumber/pdfplumber/utils/text.py
Line 789 in 147f2c4
The interfaces exposing this should also be updated.
The end result looks like
The text was updated successfully, but these errors were encountered: