Custom deduppe_chars char properties #1114

felix-hh · 2024-03-14T00:33:24Z

Following up on #71 - I have had a problem with duplicated characters that have the same text but different properties (i.e. fonts). Unfortunately I can't share the file as it is private. I am requesting optionally providing custom properties to the deduplication function (please let me know if this is otherwise available!)

Here's a sketch of the proposed code changes:

pdfplumber/pdfplumber/utils/text.py

Line 789 in 147f2c4

key = itemgetter("fontname", "size", "upright", "text")

def dedupe_chars(chars: T_obj_list, tolerance: T_num = 1, char_properties: Optional[List[str]] = None) -> T_obj_list:
    """
    Removes duplicate chars — those sharing the same text, fontname, size,
    and positioning (within `tolerance`) as other characters in the set.
    """
    # key = itemgetter("fontname", "size", "upright", "text")
    char_properties = char_properties if char_properties is not None else ["fontname", "size", "upright", "text"]
    key = itemgetter(*char_properties)

   <... more code>

The interfaces exposing this should also be updated.

The end result looks like

print(section.dedupe_chars(tolerance=0.1, char_properties=['text']).extract_text())

jsvine · 2024-03-15T20:31:07Z

Thanks for the suggestion, @felix-hh. Are you able to share a version of the PDF redacted with https://github.com/JoshData/pdf-redactor? Or another PDF that demonstrates the same issue?

felix-hh · 2024-03-16T21:19:01Z

Hi @jsvine I made a good-faith attempt at redacting the pdf with the tool but the footer text is not redacted and can still be extracted. This is a problem because the footer identifies the data source which is proprietary. I also do not know how to reproduce the issue with my own pdf.

felix-hh · 2024-03-16T21:19:57Z

Let me know if there is some other way I can help. I am happy to provide a pull request for the change verifying that it works on my end.

Here's some screenshots if it helps:
Redacted PDF screenshot:

What the output of extract_text looks like:

jsvine · 2024-03-25T15:20:50Z

Thanks @felix-hh. For new features, I like/want to have unit tests for them, which requires a PDF demonstrating a failing example. Could you use a tool (e.g., Adobe Acrobat, Preview, etc.) to manually redact the footer text?

felix-hh added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Mar 14, 2024

QuentinAndre11 mentioned this issue Jun 27, 2024

add ignore_char_properties arg in dedupe_chars #1161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom deduppe_chars char properties #1114

Custom deduppe_chars char properties #1114

felix-hh commented Mar 14, 2024 •

edited

Loading

jsvine commented Mar 15, 2024

felix-hh commented Mar 16, 2024 •

edited

Loading

felix-hh commented Mar 16, 2024 •

edited

Loading

jsvine commented Mar 25, 2024

Custom deduppe_chars char properties #1114

Custom deduppe_chars char properties #1114

Comments

felix-hh commented Mar 14, 2024 • edited Loading

jsvine commented Mar 15, 2024

felix-hh commented Mar 16, 2024 • edited Loading

felix-hh commented Mar 16, 2024 • edited Loading

jsvine commented Mar 25, 2024

felix-hh commented Mar 14, 2024 •

edited

Loading

felix-hh commented Mar 16, 2024 •

edited

Loading

felix-hh commented Mar 16, 2024 •

edited

Loading