Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ignore_char_properties arg in dedupe_chars #1161

Merged
merged 5 commits into from
Jul 6, 2024

Conversation

QuentinAndre11
Copy link

See #1114

More precise parameters could be used for char matching in dedupe_chars (a list of list of similar fontnames, a tolerance for the size...) but for my use case directly ignoring these keys was sufficient and it was easier to test.

PS: It is my first contribution so I did not know if I had to add a new version in the Changelog or append it to the last one.

@jsvine
Copy link
Owner

jsvine commented Jun 27, 2024

Hi @QuentinAndre11, and thanks for your interest in pdfplumber! I'm curious: What advantage do you see to this approach vs. having the list of properties be directly configurable? E.g., instead of this in the current pdfplumber code:

def dedupe_chars(chars: T_obj_list, tolerance: T_num = 1) -> T_obj_list:
    """
    Removes duplicate chars — those sharing the same text, fontname, size,
    and positioning (within `tolerance`) as other characters in the set.
    """
    key = itemgetter("fontname", "size", "upright", "text")

... doing this:

def dedupe_chars(
  chars: T_obj_list, 
  tolerance: T_num = 1, 
  properties: Tuple[str, ...] = tuple("fontname", "size", "upright", "text")
) -> T_obj_list:
    """
    Removes duplicate chars — those sharing the same text, fontname, size,
    and positioning (within `tolerance`) as other characters in the set.
    """
    key = itemgetter(*properties)

@QuentinAndre11
Copy link
Author

QuentinAndre11 commented Jun 27, 2024

Hi ! I thought text and upright properties aren't properties that we would want to change at all. The only case I can think of for the text would be visually close letters like I and l but why on earth would we want to deduplicate that ? Same goes for upright-proof letters like o but it also does not seem a relevant feature.
So because we have mandatory properties, it made more sense for me to "tolerate" some props than to directly give the keys. Does it make sense ?

Ps: the tests are failing because of pandas, I guess I can't do anything about it?

@QuentinAndre11
Copy link
Author

QuentinAndre11 commented Jun 28, 2024

If needed, I can actually replace the kwargs from dedupe_chars(tolerate) to dedupe_chars(tolerate_x: T_num = 1, tolerate_y: T_num = 1, ignore_font: bool = False, tolerate_size: T_num = 0), it's not that much additional work @jsvine
Edit: we need to keep tolerate for backward compatibility, so tolerate: Union[T_num, Tuple[T_num, T_num]] = 1

@jsvine
Copy link
Owner

jsvine commented Jul 2, 2024

Thanks, @QuentinAndre11. This is a good point:

I thought text and upright properties aren't properties that we would want to change at all.

Given that there's no real reason to change them, what about this updated definition?:

def dedupe_chars(
  chars: T_obj_list, 
  tolerance: T_num = 1, 
  extra_attrs: Optional[Tuple[str, ...]] = tuple("fontname", "size")
) -> T_obj_list:
    """
    Removes duplicate chars — those sharing the same text, fontname, size,
    and positioning (within `tolerance`) as other characters in the set.
    """
    key = itemgetter(*("upright", "text"), *(extra_attrs or tuple()))

Then, all you'd have to do for your particular use-case is call .dedupe_chars(extra_attrs=None). At the same time, it would enable folks to be even more strict, if they preferred, by passing a more extensive list of attributes.

Another advantage is that the parameter matches the name (and idea) of the .extract_words(extra_attrs=...).

@QuentinAndre11
Copy link
Author

@jsvine Works for me (and my use case), so I implemented it your way !

@jsvine jsvine changed the base branch from stable to develop July 6, 2024 19:40
@jsvine
Copy link
Owner

jsvine commented Jul 6, 2024

Thanks! Pushed some documentation tweaks and a fix for the failing tests (we just needed to upgrade pandas).

@jsvine
Copy link
Owner

jsvine commented Jul 6, 2024

Looks like there's still an issue with Python 3.8 (because the latest pandas isn't available on it), but that version is reaching its end of life, so we'll remove 3.8 from the test suite after merging. Thanks again for the contribution!

@jsvine jsvine merged commit 923707a into jsvine:develop Jul 6, 2024
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants