Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom deduppe_chars char properties #1114

Open
felix-hh opened this issue Mar 14, 2024 · 4 comments
Open

Custom deduppe_chars char properties #1114

felix-hh opened this issue Mar 14, 2024 · 4 comments
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"

Comments

@felix-hh
Copy link

felix-hh commented Mar 14, 2024

Following up on #71 - I have had a problem with duplicated characters that have the same text but different properties (i.e. fonts). Unfortunately I can't share the file as it is private. I am requesting optionally providing custom properties to the deduplication function (please let me know if this is otherwise available!)

Here's a sketch of the proposed code changes:

key = itemgetter("fontname", "size", "upright", "text")

def dedupe_chars(chars: T_obj_list, tolerance: T_num = 1, char_properties: Optional[List[str]] = None) -> T_obj_list:
    """
    Removes duplicate chars — those sharing the same text, fontname, size,
    and positioning (within `tolerance`) as other characters in the set.
    """
    # key = itemgetter("fontname", "size", "upright", "text")
    char_properties = char_properties if char_properties is not None else ["fontname", "size", "upright", "text"]
    key = itemgetter(*char_properties)

   <... more code>

The interfaces exposing this should also be updated.

The end result looks like

print(section.dedupe_chars(tolerance=0.1, char_properties=['text']).extract_text())
@felix-hh felix-hh added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Mar 14, 2024
@jsvine
Copy link
Owner

jsvine commented Mar 15, 2024

Thanks for the suggestion, @felix-hh. Are you able to share a version of the PDF redacted with https://github.com/JoshData/pdf-redactor? Or another PDF that demonstrates the same issue?

@felix-hh
Copy link
Author

felix-hh commented Mar 16, 2024

Hi @jsvine I made a good-faith attempt at redacting the pdf with the tool but the footer text is not redacted and can still be extracted. This is a problem because the footer identifies the data source which is proprietary. I also do not know how to reproduce the issue with my own pdf.

@felix-hh
Copy link
Author

felix-hh commented Mar 16, 2024

Let me know if there is some other way I can help. I am happy to provide a pull request for the change verifying that it works on my end.

Here's some screenshots if it helps:
Redacted PDF screenshot:
image

What the output of extract_text looks like:
image

@jsvine
Copy link
Owner

jsvine commented Mar 25, 2024

Thanks @felix-hh. For new features, I like/want to have unit tests for them, which requires a PDF demonstrating a failing example. Could you use a tool (e.g., Adobe Acrobat, Preview, etc.) to manually redact the footer text?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"
Projects
None yet
Development

No branches or pull requests

2 participants