Extract_tables setting for the specific strategy #966

Nugnes · 2023-08-10T15:22:14Z

Nugnes
Aug 10, 2023

Hi all, I've this pdf. I'm trying to extract table from it. What is the better strategy to get the table? I can not be able to get the value specific on table , for example in the first table witha header "Quantity per size" , i 've to get ['36, 36 ½, 37, 37 ½, 38, 38 ½ ,39, 39 ½, 40 ,40 ½ ,41, 41 ½, 42, 42 ½ ,43, 43 ½ ,44, 44 ½, 45, 45 ½ ,46 ] and for the second line [0,0,2,0,4,0,4,0,4,0,14 ,0,14,0,4,0,2,0 2,0,0]

My final result would be : KFA10-001 ,Khatarina 001 Black ,['36, 36 ½, 37, 37 ½, 38, 38 ½ ,39, 39 ½, 40 ,40 ½ ,41, 41 ½, 42, 42 ½ ,43, 43 ½ ,44, 44 ½, 45, 45 ½ ,46 ] and for the second line [0,0,2,0,4,0,4,0,4,0,14 ,0,14,0,4,0,2,0 2,0,0],50,€162,50,€8.125,00

In my opinion, the idea is to isolate the smallest area around the values via cropping, use the x0 position of each word as your vertical line and via explicit_vertical_lines which will give back empty strings for the "blank" cells.

but for every may tentativ i get always simiar layout: ['Quantity per size', None, 'Qty', 'Price', 'Discount', 'Total row'], ['36 36 ½ 37 37 ½ 38 38 ½ 39 39 ½ 40 40 ½ 41 41 ½ 42 42 ½ 43 43 ½ 44 44 ½ 45 45 ½ 46\n2 4 4 4 14 14 4 2 2', None, '50', '€162,50', '', '€8.125,00']

Can you help me how can I do it?

cmdlineluser · 2023-08-10T21:19:52Z

cmdlineluser
Aug 10, 2023

Looks like you may need to use the actual size/column names as reference points.

from bisect import bisect_right
from operator import itemgetter

size_cols = []
for n in range(36, 46):
    size_cols.append(f'{n}')
    size_cols.append(f'{n} ½')
size_cols.append('46')

qty_cols  = ['Qty', 'Price', 'Discount', 'Total row']

# find "header" rows
sizes = page.search(' '.join(f'({col})' for col in size_cols))
qtys  = page.search(' '.join(f'({col})' for col in qty_cols))

# build "column" lines
explicit_vertical_lines = []
for rows in [sizes, qtys]:
    bbox = rows[0]['x0'], rows[0]['top'], rows[0]['x1'], rows[0]['bottom']
    crop = page.crop(bbox)
    for col in rows[0]['groups']:
       line = crop.search(col)[0]['x0']
       explicit_vertical_lines.append(line)
       
right = max(page.chars, key=itemgetter('x1'))['x1']
explicit_vertical_lines.append(right)


words = sorted(page.extract_words(), key=itemgetter('top'))

# use first "word" in line after each header row as bottom line
rows = []
for size in sizes:
    idx = bisect_right(words, size['top'], key=itemgetter('top'))
    bbox = page.bbox[0], size['bottom'], page.bbox[2], words[idx]['bottom']
    crop = page.crop(bbox)
    row = crop.extract_table(dict(
       explicit_vertical_lines = explicit_vertical_lines, 
       horizontal_strategy = "text", 
       vertical_strategy = "explicit"
    ))[1]
    rows.append(row)

>>> pd.DataFrame(rows, columns = size_cols + qty_cols)
  36 36 ½ 37 37 ½ 38 38 ½ 39 39 ½ 40 40 ½  41 41 ½  42 42 ½ 43 43 ½ 44 44 ½ 45 45 ½ 46 Qty    Price Discount  Total row
0          2       4       4       4       14       14       4       2       2          50  €162,50           €8.125,00
1          2       2       2       6       14       14       4       2       2          48  €162,50           €7.800,00
2                                  1        2        2       1                           6  €162,50             €975,00
3                  2       2       4        6        6       2       2                  24  €162,50           €3.900,00

The name/description part could be done in a separate step.

4 replies

Nugnes Aug 14, 2023
Author

Great !. Suppose that you have size_cols not always in the same layout even if in the same pdf like:

In your opinion what is th correct approch?

cmdlineluser Aug 14, 2023

Does the 36 - 46 remain constant?

If so, you could modify the code to make the ½ entries optional.

Nugnes Aug 14, 2023
Author

it seems yes, i find 36-46 like costant

i'm traing to modify your code for thi purpose in this manner

        size_cols_int = []
        for n in range(36, 46):
            size_cols_int.append(f'{n}')
        size_cols_int.append('46')

        sizes_1 = page.search(' '.join(f'({col})' for col in size_cols))
        sizes_2 = page.search(' '.join(f'({col})' for col in size_cols_int))
        if sizes_1 is None:
            sizes=sizes_2
        else:
            sizes=sizes_1

but i'm get this error:

    bbox = rows[0]['x0'], rows[0]['top'], rows[0]['x1'], rows[0]['bottom']
           ~~~~^^^
IndexError: list index out of range

cmdlineluser Aug 14, 2023

Well, there are at least 2 problems.

The first is the pattern needs to be changed:

>>> ' '.join(f'({col})' for col in size_cols)
'(36) (36 ½) (37) (37 ½) (38) (38 ½) (39) (39 ½) (40) (40 ½) (41) (41 ½) (42) (42 ½) (43) (43 ½) (44) (44 ½) (45) (45 ½) (46)'

You would need to make the 1/2 columns optional:

'(36) (?:(36 ½) )?(37) (?:(37 ½) )?(38) (?:(38 ½) )?(39) (?:(39 ½) )?(40) (?:(40 ½) )?(41) (?:(41 ½) )?(42) (?:(42 ½) )?(43) (?:(43 ½) )?(44) (?:(44 ½) )?(45) (?:(45 ½) )?(46)'

It's probably simplest just to hardcode the pattern in this case.

The second problem is the code assumed all columns were in the same place, so it only uses the first match positions to create vertical lines.

for rows in [sizes, qtys]:
    rows[0]

However, now you need to create the vertical lines for each size/qty row separately as they may differ.

Nugnes · 2023-08-14T14:13:20Z

Nugnes
Aug 14, 2023
Author

thank , very much.!!! but i 'understand rha cause of failure, I do not see tha that in the several page of dcoument th range is not always the same....

4 replies

cmdlineluser Aug 15, 2023

Can you post a PDF version of that page?

Nugnes Aug 15, 2023
Author

of course : PDF version.
Y your strategy is perfect for me. I'd like to standardize the pattern, because I have many pdf with a structure like n sizes\quantity for size. I would like to set patterns, even hardcode if necessary , and recognize quantities for relative sizes. Obviously the columns must contain all the elements of the sizes with the respective associated quantities for relative items.

thank you for your amazing support

cmdlineluser Aug 15, 2023

Well, this is what came to mind.

Use the horizontal lines as section markers (I've used "E-Mail Customer" as a starting point - i.e. only lines below this string)

We then check for the column names, crop sizes/qty into 2 parts and parse both as tables.

import itertools
import pdfplumber
from operator import itemgetter 

pdf = pdfplumber.open('Downloads/shoe-sizes.pdf')

columns = 'Qty', 'Price', 'Discount', 'Total row'

for page in pdf.pages:
    email = page.search('E-Mail Customer')[0]['bottom']
    
    hlines = sorted(page.horizontal_edges, key=itemgetter('top'))
    idx = next(idx for idx, line in enumerate(hlines) if line['top'] > email)

    items = []
    item  = []
    
    for top, bottom in itertools.pairwise(hlines[idx:]):
    
        height = bottom['top'] - top['bottom']
        if height == 0: 
           continue
           
        bbox = [ page.bbox[0], top['top'], page.bbox[2], bottom['bottom'] ]
        try:
            crop = page.crop(bbox)
        except ValueError:
            # seems to be some odd line on the page with huge (wrong?) values
            pass
            
        text = crop.extract_text()
        
        if 'Quantity per size' in text:
            # split at 'Qty ...'
            split = crop.search(' '.join(columns))[0]
            
            bbox[0] = split['x0']
            
            qty = page.crop(bbox)
            
            explicit_vertical_lines = [ qty.search(col)[0]['x0']  for col in columns ]
            right = max(qty.chars, key=itemgetter('x1'))['x1']
            
            explicit_vertical_lines.append(right)
            
            qty = qty.extract_table(dict(
                explicit_vertical_lines = explicit_vertical_lines,
                horizontal_strategy = "text"
            ))
            
            # extract sizes table
            bbox[:3] = page.bbox[0], split['bottom'], split['x0']
            
            sizes = page.within_bbox(bbox)
            
            explicit_vertical_lines = [ word['x1'] for word in sizes.search(r'\d{2}(?: ½)?') ]
            
            sizes = sizes.extract_table(dict(
               explicit_vertical_lines = explicit_vertical_lines,
               horizontal_strategy = 'text',
            ))
            
            print(item)
            print(dict(zip(qty[0], qty[2])))
            print(dict(zip(sizes[0], sizes[2])))
            
            item = []
            
        elif text: 
            item.append(text)

['NBUW111FA09', 'Minaar\nBLK0001 Black']
{'Qty': '34', 'Price': '€145,83', 'Discount': '', 'Total row': '€4.958,22'}
{'36': '', '37': '', '38': '2', '39': '4', '40': '2', '41': '10', '42': '10', '43': '4', '44': '2', '45': '', '46': ''}
['NBUW137LE05', 'Rozes\nMTY0001 Aloe']
{'Qty': '6', 'Price': '€162,50', 'Discount': '', 'Total row': '€975,00'}
{'36 ½': '', '37': '', '37 ½': '', '38': '', '38 ½': '', '39': '', '39 ½': '', '40': '1', '40 ½': '', '41': '1', '41 ½': '', '42': '1', '42 ½': '', '43': '1', '43 ½': '', '44': '1', '44 ½': '', '45': '1', '45 ½': '', '46': ''}
['NBUW139LE07', 'Fedaia\nMTY0001 Aloe']
{'Qty': '12', 'Price': '€104,17', 'Discount': '', 'Total row': '€1.250,04'}
{'37': '', '38': '', '39': '', '40': '2', '41': '4', '42': '4', '43': '2', '44': '', '45': '', '46': ''}
['NBUW140LE08', 'Fedaia\nBLK0001 Black']
{'Qty': '40', 'Price': '€104,17', 'Discount': '', 'Total row': '€4.166,80'}
{'37': '', '38': '2', '39': '4', '40': '5', '41': '9', '42': '10', '43': '6', '44': '3', '45': '1', '46': ''}
['NBUW149LE17', 'Fedaia\nMTY0001 Gray']
{'Qty': '8', 'Price': '€104,17', 'Discount': '', 'Total row': '€833,36'}
{'37': '', '38': '', '39': '', '40': '1', '41': '1', '42': '2', '43': '2', '44': '1', '45': '1', '46': ''}
['NFA10-001', 'Neal\n001 Black']
{'Qty': '30', 'Price': '€120,83', 'Discount': '', 'Total row': '€3.624,90'}
{'36': '', '37': '', '38': '', '39': '', '40': '3', '41': '3', '42': '9', '43': '9', '44': '3', '45': '3', '46': ''}
['SAFA10-001', 'Rozes\n001 Black']
{'Qty': '6', 'Price': '€162,50', 'Discount': '', 'Total row': '€975,00'}
{'36 ½': '', '37': '', '37 ½': '', '38': '', '38 ½': '', '39': '', '39 ½': '', '40': '1', '40 ½': '', '41': '1', '41 ½': '', '42': '1', '42 ½': '', '43': '1', '43 ½': '', '44': '1', '44 ½': '', '45': '1', '45 ½': '', '46': ''}

Nugnes Aug 16, 2023
Author

Great, this approch is very pretty, but i prefer the firt one that you shared because it would be more efficient for me . It seems for me to be more applicabile to a lot of similar pdf that i have to trasform, so i 'm looking for to study how to you do this and understed how i can standardize on your code suggest...

Nugnes · 2023-08-17T18:19:41Z

Nugnes
Aug 17, 2023
Author

@cmdlineluser , Thank you for your time. I have verified that in the many .pdf files there are several errors (lines with huge values, strange results..) . I then found this approach;

Via crop, I isolate the rectangle with the size and quantity information. I based it by intercepting the horizontal lines

via crop I got split again by the columns of interest
use start of each word as vline

I used the same strategy to retrieve the item values and finally got this result

**What do you think?, can it be optimized? **

my goal is to standardize as much as possible.

ps. consider that I started studying pdfplumber (wonderful library) a few days ago



coll=['35', '35 ½', '36', '36 ½', '37', '37 ½', '38', '38 ½', '39', '39 ½', '40', '40 ½', '41', '41 ½', '42', '42 ½', '43', '43 ½', '44', '44 ½', '45', '45 ½', '46']
coll_prouct_name=['CPF-1', 'Nome', 'CPF-2', 'Colore','Qty','Price']
df = pd.DataFrame(columns=coll+coll_prouct_name) # empty dataframe

with pdfplumber.open(doc) as pdf:
    
    for page in pdf.pages:
        #page = pdf.pages[0]

        for rect in page.rects :
            
            bbox = pdfplumber.utils.obj_to_bbox(rect)
            #print('real bbox{}'.format(bbox ))
            #bbox=(15.9, 208.39999999999998, 577.9, 336.29999999999995)
            product_area=page.crop(bbox)
            #print('product_area------>{}'.format(product_area.vertical_edges))
          
            width =   product_area.bbox[3]-product_area.bbox[1] 
            if width < 50: 
                break

            
            # First thick vertical  line 
            product_line = int(next(
                line['x0'] for line in page.vertical_edges 
                if  line['orientation'] == 'v'
                and line['linewidth'] == 1 
                and line['height'] > 80
            ))
            #print('product_line------>{}'.format(product_line))
            bottom = product_area.search('Quantity per size')[0]['top'] -20
            #print('Quantity_size_top------>{}'.format(bottom))
        
            # print(product_line, 
            #       product_area.bbox[1],
            #       product_area.bbox[2],
            #       bottom , sep='\n')
            
            bbox=(product_line, 
                product_area.bbox[1],
                product_area.bbox[2],
                bottom )
            
            product=product_area.crop(bbox)
            text = product.extract_words()
            prouct_name  = [word['text'] for word in product.extract_words()]
            
            #print('product------>{}'.format(product.vertical_edges))
            
            # im = product.to_image(resolution = 400)
            # im.draw_rects(product.edges, stroke_width=5)
            # im.show()

            hlines = [
                line['top'] for line in product_area.edges 
                    if  line['orientation'] == 'h'
                    and line ['object_type']=='line'
                    and line['stroking_color'] == (0, 0, 0) 
                    and line['width'] > product_area.width / 1.25
                ]
            # Make sure our lines are sorted from top -> bottom
            hlines = sorted(set(hlines))
            #print('hlines------>{}'.format(hlines))
        
            for top, bottom in itertools.pairwise(hlines):
                    
                row = product_area.crop(
                    (product_area.bbox[0], top, product_area.bbox[-2], bottom)
                )
                text = row.extract_text()
                if 'Quantity per size' not in text:

                    lines = row.extract_text_lines()
                    # Find the vertical column dividers
                    vlines = {}
                    for line in row.vertical_edges:
                        if line['object_type'] == 'line':
                            vlines[line['x0']] = line
                    vlines = sorted(line['x0'] for line in vlines.values()) + [row.width]      
                    # h = open('lines' + '.json', "w")
                    # json.dump(lines, h, indent=2, sort_keys=False)
                    # h.close()

                    # extract columns 
                    col1 = row.crop((product_area.bbox[0], top, vlines[0], bottom))
                    col2 = row.crop((vlines[0], top, vlines[1], bottom))
                    col3 = row.crop((vlines[1], top, vlines[2], bottom))
                    
                    # im = col3.to_image(resolution = 400)
                    # im.draw_rects(col3.extract_words())
                    # im.show()

                    lines = col1.extract_text_lines()
                    # lines 1-2 are the values 
                    #bbox->(22.449999999999967, 314.33259999999996, 383.9776, 332.3658)
                    bbox = lines[0]['x0'], lines[0]['top'], lines[-2]['x1'], lines[-1]['bottom']
                    #print('bbox------>{}'.format(bbox))
                    values = col1.crop(bbox)
                    
                    # im = values.to_image(resolution = 400)
                    # im.draw_rects(values.extract_words())
                    # im.show()

                    # use start of each word as vline
                    explicit_vertical_lines = [ word['x1'] for word in values.search(r'\d{2}(?: ½)?') ]
                    #print('vlines------>{}'.format(vlines))
                    
                    table = values.extract_tables(dict(
                    explicit_vertical_lines = explicit_vertical_lines,
                    horizontal_strategy = 'text',
                    ))
            
               
                    
                    print('prouct_name------>{}'.format(prouct_name))
                    print('table------>{}'.format(table[0]))
                    print('Sizes------>{}'.format(table[0][0]))
                    print('Qtys------>{}'.format(table[0][-1]))
                    print(f'{col2.extract_text()=}')
                    print(f'{col3.extract_text()=}')
                    
                    rows=(dict(zip(table[0][0], table[0][-1])))

                    data=pd.DataFrame(rows, columns =  coll, index=[0]) 
                     
                    data['CPF-1']= prouct_name[0] 
                    data['Nome']= prouct_name[1] 
                    data['CPF-2']= prouct_name[2] 
                    data['Colore']= prouct_name[3] 
                    data['Qty']= col2.extract_text()
                    data['Price']= col3.extract_text()
                    
                    data[coll] = data[coll].apply(pd.Series).fillna('').replace('', '0').astype(int)
                    data['Qty'] = data['Qty'].apply(pd.Series).fillna('').replace('', '0').astype(int)
                    #replace € symbol
                    data['Price'] =data["Price"].str.replace("€","")
                    #convert string to float and set decima point
                    data['Price'] =data["Price"].apply(lambda x: float(x.split()[0].replace(',', '.')))
                    #data['Price'] = data['Price'].apply(pd.Series).fillna('').replace('', '').astype('float64')
                    df = df._append(data, ignore_index = True)

        
# Export dataset to XLSX
with pd.ExcelWriter(save_path+"SLAM.xlsx") as writer:
    df.to_excel(writer, sheet_name="SLAM")
    auto_adjust_xlsx_column_width(df, writer, sheet_name="SLAM", margin=0)

1 reply

cmdlineluser Aug 18, 2023

I'm not sure if I can provide a useful response.

As for the code, nothing stands out apart from the pandas parts:

Creating an empty dataframe and trying to append to it is kind of an "anti-pattern" - it's usually recommended to gather the data first and create the dataframe in a single step.
The .apply stuff can probably be done using .str methods.

For myself, looking at the same PDF on 2 different days can result in 2 completely different approaches.

I don't know all the different formats/layouts you're working with, so if the current approach works - there's not really anything else useful I can add to the conversation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract_tables setting for the specific strategy #966

{{title}}

Replies: 3 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Extract_tables setting for the specific strategy #966

Nugnes Aug 10, 2023

Replies: 3 comments · 9 replies

cmdlineluser Aug 10, 2023

Nugnes Aug 14, 2023 Author

cmdlineluser Aug 14, 2023

Nugnes Aug 14, 2023 Author

cmdlineluser Aug 14, 2023

Nugnes Aug 14, 2023 Author

cmdlineluser Aug 15, 2023

Nugnes Aug 15, 2023 Author

cmdlineluser Aug 15, 2023

Nugnes Aug 16, 2023 Author

Nugnes Aug 17, 2023 Author

cmdlineluser Aug 18, 2023

Nugnes
Aug 10, 2023

Replies: 3 comments 9 replies

cmdlineluser
Aug 10, 2023

Nugnes Aug 14, 2023
Author

Nugnes Aug 14, 2023
Author

Nugnes
Aug 14, 2023
Author

Nugnes Aug 15, 2023
Author

Nugnes Aug 16, 2023
Author

Nugnes
Aug 17, 2023
Author