Extract_tables setting for the specific strategy #966
Replies: 3 comments 9 replies
-
Looks like you may need to use the actual size/column names as reference points. from bisect import bisect_right
from operator import itemgetter
size_cols = []
for n in range(36, 46):
size_cols.append(f'{n}')
size_cols.append(f'{n} ½')
size_cols.append('46')
qty_cols = ['Qty', 'Price', 'Discount', 'Total row']
# find "header" rows
sizes = page.search(' '.join(f'({col})' for col in size_cols))
qtys = page.search(' '.join(f'({col})' for col in qty_cols))
# build "column" lines
explicit_vertical_lines = []
for rows in [sizes, qtys]:
bbox = rows[0]['x0'], rows[0]['top'], rows[0]['x1'], rows[0]['bottom']
crop = page.crop(bbox)
for col in rows[0]['groups']:
line = crop.search(col)[0]['x0']
explicit_vertical_lines.append(line)
right = max(page.chars, key=itemgetter('x1'))['x1']
explicit_vertical_lines.append(right)
words = sorted(page.extract_words(), key=itemgetter('top'))
# use first "word" in line after each header row as bottom line
rows = []
for size in sizes:
idx = bisect_right(words, size['top'], key=itemgetter('top'))
bbox = page.bbox[0], size['bottom'], page.bbox[2], words[idx]['bottom']
crop = page.crop(bbox)
row = crop.extract_table(dict(
explicit_vertical_lines = explicit_vertical_lines,
horizontal_strategy = "text",
vertical_strategy = "explicit"
))[1]
rows.append(row) >>> pd.DataFrame(rows, columns = size_cols + qty_cols)
36 36 ½ 37 37 ½ 38 38 ½ 39 39 ½ 40 40 ½ 41 41 ½ 42 42 ½ 43 43 ½ 44 44 ½ 45 45 ½ 46 Qty Price Discount Total row
0 2 4 4 4 14 14 4 2 2 50 €162,50 €8.125,00
1 2 2 2 6 14 14 4 2 2 48 €162,50 €7.800,00
2 1 2 2 1 6 €162,50 €975,00
3 2 2 4 6 6 2 2 24 €162,50 €3.900,00 The name/description part could be done in a separate step. |
Beta Was this translation helpful? Give feedback.
-
thank , very much.!!! but i 'understand rha cause of failure, I do not see tha that in the several page of dcoument th range is not always the same.... |
Beta Was this translation helpful? Give feedback.
-
@cmdlineluser , Thank you for your time. I have verified that in the many .pdf files there are several errors (lines with huge values, strange results..) . I then found this approach;
I used the same strategy to retrieve the item values and finally got this result **What do you think?, can it be optimized? ** my goal is to standardize as much as possible. ps. consider that I started studying pdfplumber (wonderful library) a few days ago
|
Beta Was this translation helpful? Give feedback.
-
Hi all, I've this pdf. I'm trying to extract table from it. What is the better strategy to get the table? I can not be able to get the value specific on table , for example in the first table witha header "Quantity per size" , i 've to get ['36, 36 ½, 37, 37 ½, 38, 38 ½ ,39, 39 ½, 40 ,40 ½ ,41, 41 ½, 42, 42 ½ ,43, 43 ½ ,44, 44 ½, 45, 45 ½ ,46 ] and for the second line [0,0,2,0,4,0,4,0,4,0,14 ,0,14,0,4,0,2,0 2,0,0]
My final result would be : KFA10-001 ,Khatarina 001 Black ,['36, 36 ½, 37, 37 ½, 38, 38 ½ ,39, 39 ½, 40 ,40 ½ ,41, 41 ½, 42, 42 ½ ,43, 43 ½ ,44, 44 ½, 45, 45 ½ ,46 ] and for the second line [0,0,2,0,4,0,4,0,4,0,14 ,0,14,0,4,0,2,0 2,0,0],50,€162,50,€8.125,00
In my opinion, the idea is to isolate the smallest area around the values via cropping, use the x0 position of each word as your vertical line and via explicit_vertical_lines which will give back empty strings for the "blank" cells.
but for every may tentativ i get always simiar layout: ['Quantity per size', None, 'Qty', 'Price', 'Discount', 'Total row'], ['36 36 ½ 37 37 ½ 38 38 ½ 39 39 ½ 40 40 ½ 41 41 ½ 42 42 ½ 43 43 ½ 44 44 ½ 45 45 ½ 46\n2 4 4 4 14 14 4 2 2', None, '50', '€162,50', '', '€8.125,00']
Can you help me how can I do it?
Beta Was this translation helpful? Give feedback.
All reactions