Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Improve OCR results, stricten criteria before dropping bitmap areas #719

Merged
merged 1 commit into from
Jan 10, 2025

Conversation

cau-git
Copy link
Contributor

@cau-git cau-git commented Jan 9, 2025

This introduces multiple changes to address problems with partially failing content in scanned documents.

  • The code for detection of OCR rectangles now merges almost-connected, nearby bitmap rectangles into one
  • OCR rectangles are not dropped individually if they don't pass the size threshold, only if the sum of all OCR rectangles falls below the size threshold
  • EasyOCR confidence lowered to 0.5 (from 0.65) for better recall

Issue resolved by this Pull Request:
Resolves #641 and others

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Copy link

mergify bot commented Jan 9, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@nikos-livathinos nikos-livathinos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@cau-git cau-git merged commit 5a060f2 into main Jan 10, 2025
9 checks passed
@cau-git cau-git deleted the cau/ocr-dropout-fixes branch January 10, 2025 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Help in debugging conversion of a PDF to text
3 participants