You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi!
I have a code to extract tables from pdf files. To identify the tables i´m using a layoutparser, hence I need to convert the image coordinates into pdf coordinates. To do this, I have a code where the pdf is converted into image using pdf2image, the layout model extract runs in each page image getting the blocks coordinates and type, the image size is obtained using pillow and the pdf page size is obtained using PyPDF2. Having this the convertion is done using the following equation for all 4 box coordinates (x1, y1, x1, y2)
x1 = image_box_x_1 * pdf_width / image_width
The code is the following:
This is the first pdf that I had this problem, every test before this was ok. Since, the block coordinates are correct for each page image (I verify it). I think the problem is with the conversion of the pdf to image. Someone have any idea on how to solve this problem?
Thanks in advance!
The text was updated successfully, but these errors were encountered:
Hi!
I have a code to extract tables from pdf files. To identify the tables i´m using a layoutparser, hence I need to convert the image coordinates into pdf coordinates. To do this, I have a code where the pdf is converted into image using pdf2image, the layout model extract runs in each page image getting the blocks coordinates and type, the image size is obtained using pillow and the pdf page size is obtained using PyPDF2. Having this the convertion is done using the following equation for all 4 box coordinates (x1, y1, x1, y2)
x1 = image_box_x_1 * pdf_width / image_width
The code is the following:
def find_blocks_layoutparser(file_path: str, pdf, model):
page_list = convert_from_path(file_path)
block_boxes = []
extracted_blocks = {}
page_index = 0
# Initiate the parser model
for page in page_list:
page.save(f'page{page_index}.jpg')
# Detect all block in a page
layout = model.detect(page)
boxes = []
width, height = page.size
pdf_page = pdf.pages[page_index]
pdf_size = pdf_page.mediabox
pdf_width = pdf_size[2] - pdf_size[0]
pdf_height = pdf_size[3] - pdf_size[1]
for entry in layout:
# Retrieve the bounding box
x1 = entry.block.x_1 / width * float(pdf_width)
x2 = entry.block.x_2 / width * float(pdf_width)
y1 = entry.block.y_1 / height * float(pdf_height)
y2 = entry.block.y_2 / height * float(pdf_height)
boxes.append([x1, y1, x2, y2])
The tectangles obtained are the follwing:
This is the first pdf that I had this problem, every test before this was ok. Since, the block coordinates are correct for each page image (I verify it). I think the problem is with the conversion of the pdf to image. Someone have any idea on how to solve this problem?
Thanks in advance!
The text was updated successfully, but these errors were encountered: