Problems extracting complete set of text from a pdf #491

nigelkiernan · 2021-08-15T16:24:28Z

nigelkiernan
Aug 15, 2021

Dear Jeremy,

Firstly many thanks for creating and sharing the PDFPlumber system, which has been great to use. I wonder if you can help with a specific problem I am encountering? When I look to extract a specific page from a pdf, the system is working very well. However, when I am looking to extract multiple pages, or a whole document, the code is only extracting a specific page.

I am using the code below to extract the text from page indexes 220-222. If you could kindly share why I am not extracting the code over these set of pages that would be most appreciated.

Many thanks for any advice you can give
Kind regards

import pdfplumber
pdf = pdfplumber.open("XXXX.pdf")
wc_2018 = pdf.pages[220-222]
wc_doc_test = pdf.pages[220-222].to_image(resolution = 150)
extract_test = wc_doc_test.draw_rects(wc_2018.extract_words())
word_list = wc_2018.extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=True, horizontal_ltr=True, vertical_ttb=True, extra_attrs=[])

Answered by samkit-jain

Aug 16, 2021

Hi @nigelkiernan Appreciate your interest in the library. Allow me to resolve your query on behalf of Jeremy. The reason the text is getting extracted from a single page is that you are using [220-222]. To slice the list, you need to use it like [220:222]. 220-222 gets converted to -2 and it refers to the second last page. To fix it, you can refer to the following code

import pdfplumber

pdf = pdfplumber.open("XXXX.pdf")

start_page = 220
end_page = 222

for page in pdf.pages[start_page-1:end_page]:
    # Do operations on page like page.extract_text()

View full answer

samkit-jain · 2021-08-16T11:17:42Z

samkit-jain
Aug 16, 2021
Collaborator

Hi @nigelkiernan Appreciate your interest in the library. Allow me to resolve your query on behalf of Jeremy. The reason the text is getting extracted from a single page is that you are using [220-222]. To slice the list, you need to use it like [220:222]. 220-222 gets converted to -2 and it refers to the second last page. To fix it, you can refer to the following code

import pdfplumber

pdf = pdfplumber.open("XXXX.pdf")

start_page = 220
end_page = 222

for page in pdf.pages[start_page-1:end_page]:
    # Do operations on page like page.extract_text()

4 replies

nigelkiernan Aug 16, 2021
Author

Hi samkit-jain,
Thanks for coming back, that worked nicely. One (basic) follow on question if I may? How do I now turn the output from this for loop into a pandas DataFrame, so that I can manipulate it?

The code steps I have used are below. If you can kindly help with this last piece, that would be super helpful - thank you!

import pdfplumber as pdfp
pdf = pdfp.open('XXXXX')
for page in pdf.pages:
print(page.extract_text())

samkit-jain Aug 17, 2021
Collaborator

Hi @nigelkiernan Could you please elaborate on what you mean by turning the output into a Pandas dataframe? Do you want the dataframe to be like

page_number | text

or something else?

nigelkiernan Aug 17, 2021
Author

Hi Samkit, thanks for your help here and the discussion.

The pdf document I am operating on in pdfplumber is an annual report. When this block of code is run, pdfplumber extracts the whole of the text of the annual report into a Jupyter notebook I'm working on. Just for clarity, the code block again is this:

import pdfplumber as pdfp
pdf = pdfp.open('Annual-Report.pdf')
for page in pdf.pages:
print(page.extract_text())

Which imports the annual report text into the Jupyter notebook. I'm then trying to turn this into a pandas DataFrame, so I can perform analysis on the text. I'm currently using the code below.

df = pd.DataFrame(pdf.pages, columns=["Text"])

However when I use this I get this empty DataFrame (below), when I am looking to get a DataFrame full of the extracted text.
Apologies for the elementary question, any help you can give is appreciated!
Kind regards

df

	Text
Page:1
Page:2
Page:3
Page:4
Page:5
...
Page:220
Page:221
Page:222
Page:223
Page:224

samkit-jain Aug 17, 2021
Collaborator

Hi @nigelkiernan Please use the following code to get the text in a dataframe

pd.DataFrame({"Text": [p.extract_text() for p in pdf.pages]}, columns=["Text"])

Notice that I am using {"Text": [p.extract_text() for p in pdf.pages]} instead of pdf.pages.

nigelkiernan · 2021-08-19T07:04:08Z

nigelkiernan
Aug 19, 2021
Author

Hi Samkit Thank you so much for your help with this! I really appreciate you coming back - it's super kind of you. Many thanks to you and Jeremy for this programme - it's been an absolute Godsend. Kind regards, Nigel

…

________________________________ From: Samkit Jain ***@***.***> Sent: 17 August 2021 17:44 To: jsvine/pdfplumber ***@***.***> Cc: PG-Kiernan, Nigel ***@***.***>; Mention ***@***.***> Subject: Re: [jsvine/pdfplumber] Problems extracting complete set of text from a pdf (#491) CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and believe the content to be safe. Hi @nigelkiernan<https://github.com/nigelkiernan> Please use the following code to get the text in a dataframe pd.DataFrame({"Text": [p.extract_text() for p in pdf.pages]}, columns=["Text"]) Notice that I am using {"Text": [p.extract_text() for p in pdf.pages]} instead of pdf.pages. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#491 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ARGL2JKTHN6TAYOVJPLYEGTT5KGVZANCNFSM5CGKNWUQ>.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems extracting complete set of text from a pdf #491

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Problems extracting complete set of text from a pdf #491

nigelkiernan Aug 15, 2021

Replies: 2 comments · 4 replies

samkit-jain Aug 16, 2021 Collaborator

nigelkiernan Aug 16, 2021 Author

samkit-jain Aug 17, 2021 Collaborator

nigelkiernan Aug 17, 2021 Author

samkit-jain Aug 17, 2021 Collaborator

nigelkiernan Aug 19, 2021 Author

nigelkiernan
Aug 15, 2021

Replies: 2 comments 4 replies

samkit-jain
Aug 16, 2021
Collaborator

nigelkiernan Aug 16, 2021
Author

samkit-jain Aug 17, 2021
Collaborator

nigelkiernan Aug 17, 2021
Author

samkit-jain Aug 17, 2021
Collaborator

nigelkiernan
Aug 19, 2021
Author