Issue in extracting text with regional languages. #973

Jisan09 · 2023-08-26T02:55:21Z

Jisan09
Aug 26, 2023

Hi, I have this PDF when i try extract page.extract_text() it giving me some weird texts as output. Is there way to get proper text or skip saving if the text not proper?

here output sample i got:

gp
p. o. o cudr,.1  so I zozsI   pn0h.08,03.2023 euu$puqsfrofl  edlofluq sn6fl  elL6iDL  6lur-rqe6ir  roEOtrb  ono'lg& ot'lqooir  Geofl $q$6lsn6ir6r6i) elgDd6sr6ur  guu$puqoirofl GonTeil
guu$pL+6irdfl&  Gongrb  gg11oroh I gl$dl6Do  oeuiir.r-o  Guoeunonh,  GsnoroJ uoenir.r-orb, afli, guorrf) r-noirr.on&  305A,  orflennnirgfl&dl  sr6lD6D,
u6Dpru dl$pnroeuofl  &'.L@p6q  dlr-rirg, rlonGo@,
Gonrurbq$qjTh ronorr-r-rb  -641 004.```

jsvine · 2023-09-11T16:42:48Z

jsvine
Sep 11, 2023
Maintainer

Hi @Jisan09, the issue here seems to be that this is a scanned PDF, and that the OCR (converting the image to text) has not succeeded well, even before you start working with it in pdfplumber. One way to see this is to open the PDF in a standard PDF viewer, select some text, copy it to your clipboard, and then paste it into a text editor.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue in extracting text with regional languages. #973

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Issue in extracting text with regional languages. #973

Jisan09 Aug 26, 2023

Replies: 1 comment

jsvine Sep 11, 2023 Maintainer

Jisan09
Aug 26, 2023

jsvine
Sep 11, 2023
Maintainer