Problems extracting complete set of text from a pdf #491
-
Dear Jeremy, Firstly many thanks for creating and sharing the PDFPlumber system, which has been great to use. I wonder if you can help with a specific problem I am encountering? When I look to extract a specific page from a pdf, the system is working very well. However, when I am looking to extract multiple pages, or a whole document, the code is only extracting a specific page. I am using the code below to extract the text from page indexes 220-222. If you could kindly share why I am not extracting the code over these set of pages that would be most appreciated. Many thanks for any advice you can give import pdfplumber |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Hi @nigelkiernan Appreciate your interest in the library. Allow me to resolve your query on behalf of Jeremy. The reason the text is getting extracted from a single page is that you are using import pdfplumber
pdf = pdfplumber.open("XXXX.pdf")
start_page = 220
end_page = 222
for page in pdf.pages[start_page-1:end_page]:
# Do operations on page like page.extract_text() |
Beta Was this translation helpful? Give feedback.
-
Hi Samkit
Thank you so much for your help with this! I really appreciate you coming back - it's super kind of you.
Many thanks to you and Jeremy for this programme - it's been an absolute Godsend.
Kind regards,
Nigel
…________________________________
From: Samkit Jain ***@***.***>
Sent: 17 August 2021 17:44
To: jsvine/pdfplumber ***@***.***>
Cc: PG-Kiernan, Nigel ***@***.***>; Mention ***@***.***>
Subject: Re: [jsvine/pdfplumber] Problems extracting complete set of text from a pdf (#491)
CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and believe the content to be safe.
Hi @nigelkiernan<https://github.com/nigelkiernan> Please use the following code to get the text in a dataframe
pd.DataFrame({"Text": [p.extract_text() for p in pdf.pages]}, columns=["Text"])
Notice that I am using {"Text": [p.extract_text() for p in pdf.pages]} instead of pdf.pages.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#491 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ARGL2JKTHN6TAYOVJPLYEGTT5KGVZANCNFSM5CGKNWUQ>.
|
Beta Was this translation helpful? Give feedback.
Hi @nigelkiernan Appreciate your interest in the library. Allow me to resolve your query on behalf of Jeremy. The reason the text is getting extracted from a single page is that you are using
[220-222]
. To slice the list, you need to use it like[220:222]
. 220-222 gets converted to -2 and it refers to the second last page. To fix it, you can refer to the following code