Facing problem when extract table by using"text" strategy #543

youpengbo2018 · 2021-11-19T02:54:58Z

youpengbo2018
Nov 19, 2021

2021年1-4月份主要经济指标.pdf
Hi , this is the code I used to extract the table.After running the code, I found that I can not get th full row data of the final row. the row shows ['None','None','None','18.2','5459','27',],actually I need the result to be ['公路运输','万吨','1701.0','18.2','5459','27'].Could you help me to fix it?
import pdfplumber
import pandas as pd
from decimal import Decimal

pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 100)
pd.set_option('display.max_colwidth', 100)
def analyze_pdf(file_path):
result = ""
with pdfplumber.open(file_path) as pdf:
for i in range(pdf.pages[-1].page_number):
page = pdf.pages[i] #按索引读取pdf页数
for table in page.extract_tables(table_settings={"vertical_strategy": "lines"
,"explicit_vertical_lines":[Decimal(page.width)*Decimal(0.03),Decimal(page.width)*Decimal(0.36),Decimal(page.width)*Decimal(0.56),Decimal(page.width)*Decimal(0.691),Decimal(page.width)*Decimal(0.781),Decimal(page.width)*Decimal(0.91)]
,"explicit_horizontal_lines":[Decimal(page.height)*Decimal(0.001)]
,"horizontal_strategy": "text"
# ,"intersection_x_tolerance":30
}):
# print(page.horizontal_edges)
print(page.width)
print(page.bbox)
# # print(table[1:])
print("table",table)
df = pd.DataFrame(table,columns=['指标','1','单位',"本月",'同比增长','1-本月','同比增长(%)'])
del df['1']
print(df)
df1 = df[~(df['指标']== 'None')]
print(df1)
df.to_csv(r'F:\work\沈阳数据\text.csv')
# for r in table[1:]:
# print(type(r), r)
# result += r
# result += '\t'.join(r)
# result += '\t'.join('%s' %id for id in r)
# result += '\n'

return result

pass
if name == 'main':
file_Path=r'F:\work\沈阳数据\2021年1-4月份主要经济指标.pdf'
print(analyze_pdf(file_Path))
# text_save(r'F:\work\沈阳数据\2021年1-4月份主要经济指标.txt', analyze_pdf(file_Path))

samkit-jain · 2021-11-20T19:15:41Z

samkit-jain
Nov 20, 2021
Collaborator

Hi @youpengbo2018 Appreciate your interest in the library. When you do

im = page.to_image(resolution=200)
im.draw_lines(page.curves+page.edges)

you'll notice that the there are some hidden horizontal and vertical line separators.

You can use the explicit_* table extraction strategy to extract the table correctly by using those lines. A sample implementation is as follows:

import pdfplumber


def get_vertical_lines(page):
    """
    Run table extraction using the default lines strategy and get the vertical lines
    from the first row.
    """
    tables = page.find_tables(
        table_settings={"vertical_strategy": "lines", "horizontal_strategy": "lines"}
    )
    first_row = tables[0].rows[0]

    return [cell[0] for cell in first_row.cells] + [first_row.cells[-1][2]]


def get_horizontal_lines(page):
    """
    Get the coordinates of all the horizontal lines.
    """
    return [page.height - edge["y0"] for edge in page.horizontal_edges]


pdf = pdfplumber.open("file.pdf")
page = pdf.pages[0]
table = page.extract_table(
    table_settings={
        "vertical_strategy": "explicit",
        "explicit_vertical_lines": get_vertical_lines(page),
        "horizontal_strategy": "explicit",
        "explicit_horizontal_lines": get_horizontal_lines(page),
    }
)

for row in table:
    print(row)

Result is

['', '国民经', '济主要\n（4月）', '指标', '', '']
['', '单位', '本月', '同比增长（%）', '1-本月', '同比增长（%）']
['规模以上工业增加值', '亿元', '', '', '', '28.0']
['#装备制造业增加值', '亿元', '', '', '', '45.1']
['规模以上工业产销率', '%', '99.2', '-0.5', '101.4', '2.8']
['规模以上工业利润总额(1-上月)', '亿元', '', '', '162.3', '240.1']
['规模以上工业利税总额(1-上月)', '亿元', '', '', '256.8', '175.3']
['固定资产投资', '亿元', '', '-2.5', '', '8.0']
['#工业固定资产投资', '亿元', '', '30.3', '', '0.8']
['#房地产开发投资', '亿元', '', '-9.4', '', '14.2']
['商品房销售面积', '万平方米', '', '-20.5', '294.8', '-2.5']
['限额以上消费品零售额', '亿元', '136.9', '11.1', '527.1', '18.7']
['#网上商品零售额', '亿元', '30.7', '8.8', '134.6', '11.6']
['实际利用外资（商务部口径）', '亿美元', '1.17', '-48.2', '3.5', '27.2']
['进出口总额', '亿美元', '16.6', '167.8', '64.5', '63.2']
['出口', '亿美元', '5.0', '52.8', '18.6', '49.2']
['进口', '亿美元', '11.6', '297.1', '46.0', '69.7']
['进出口总额', '亿元', '108.0', '147.4', '418.4', '51.9']
['出口', '亿元', '32.8', '41.8', '120.8', '38.8']
['进口', '亿元', '75.2', '266.4', '297.7', '57.9']
['一般公共预算收入', '亿元', '71.9', '16.7', '277.2', '20.7']
['#税收收入', '亿元', '59.5', '20.1', '218.5', '17.7']
['一般公共预算支出', '亿元', '80.3', '5.0', '288.2', '-12.2']
['金融机构本外币存款余额', '亿元', '19 278.1', '4.3', '', '']
['金融机构本外币贷款余额', '亿元', '18 650.4', '6.0', '', '']
['全社会用电量', '亿千瓦时', '29.2', '5.6', '134.1', '11.0']
['#工业', '亿千瓦时', '11.6', '-4.6', '55.6', '5.3']
['#制造业', '亿千瓦时', '8.4', '9.3', '35.1', '29.0']
['货运总量', '万吨', '1 731.9', '17.8', '5 582.9', '26.5']
['#铁路运输', '万吨', '30.5', '-2.9', '122.5', '8.2']
['#公路运输', '万吨', '1 701.0', '18.2', '5 459.0', '27.0']
['', '', '', '', '', '—  1']

This is very specific to the PDF you shared and may not be a plug-and-play solution but I hope it has put you in the right direction. You can modify the above code to suit your needs better.

0 replies

youpengbo2018 · 2021-11-24T14:28:43Z

youpengbo2018
Nov 24, 2021
Author

thank you ! it helps me a lot

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Facing problem when extract table by using"text" strategy #543

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Facing problem when extract table by using"text" strategy #543

youpengbo2018 Nov 19, 2021

Replies: 2 comments

samkit-jain Nov 20, 2021 Collaborator

youpengbo2018 Nov 24, 2021 Author

youpengbo2018
Nov 19, 2021

samkit-jain
Nov 20, 2021
Collaborator

youpengbo2018
Nov 24, 2021
Author