Inconsistent table detection due to colours / layout #1119

jameskohjunwei · 2024-04-11T02:56:29Z

jameskohjunwei
Apr 11, 2024

Thank you for creating this library, i've been a user for a while - here to ask a specific question about this pdf i came accross.

I've got a PDF where rows are visually segmented by color (a faint purple every alternate row). I assume pdfplumber detects this color to identify rows for extraction. The issue arises when some pages start without this purple row segmentation - in those cases, detection is missed, and the rows don't get extracted. Is there a solution to this without explicitly hardcoding horizontal lines?

PDF here jpy_statement1-redacted_removed.pdf

Any assistance would be greatly appreciated. Thank you!

Image 1: example of rows starting without purple colour row and gets missed out in the detection.

Image 2: example of row starting WITH purple colour row and gets picked up successfully.

My code below:

import pandas as pd
import requests
import pdfplumber
import re
import numpy as np
# import gspread
import csv
import time
from datetime import datetime
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt

transactions = []

pdf = pdfplumber.open("your path to my pdf here")

## debugger to check lines extracted
page = pdf.pages[0]



top_sections = [page.search(r'\bBalance\b')]


bottom_sections = [page.search(r'\bRemarks\b')]


# Crop the page using the top and bottom coordinates
cropped_page = page.crop((28.35, 236.45799999999997, page.width, 724.518))
im = cropped_page.to_image()

im.reset().debug_tablefinder({
    "vertical_strategy": "explicit", 
    "horizontal_strategy": "lines",
    
    "explicit_vertical_lines": [ 30, 140, 330, 410, 485,560],
})

df_transactions = pd.DataFrame(transactions, columns=['date', 'description', 'money out', 'money in', 'balance'])
df_transactions

df_transactions.to_csv('youtrip_jpy_txn_1.csv', index=False)
print("CSV file has been exported successfully.")

jsvine · 2024-04-14T22:18:31Z

jsvine
Apr 14, 2024
Maintainer

Happy to hear that pdfplumber has been generally useful for you. The issue you're running into seems to be that there is no explicit graphical indication of those rows, which are delineated by the purple rectangles. I'd suggest trying to use the text on the page (perhaps via page.extract_words(...) or page.search(...)) to identify the horizontal position at which the table begins/ends, and then passing those positions to explicit_vertical_lines.

1 reply

jameskohjunwei Apr 15, 2024
Author

Thank you for responding @jsvine i'll give that a try

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent table detection due to colours / layout #1119

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Inconsistent table detection due to colours / layout #1119

jameskohjunwei Apr 11, 2024

Replies: 1 comment · 1 reply

jsvine Apr 14, 2024 Maintainer

jameskohjunwei Apr 15, 2024 Author

jameskohjunwei
Apr 11, 2024

Replies: 1 comment 1 reply

jsvine
Apr 14, 2024
Maintainer

jameskohjunwei Apr 15, 2024
Author