Skip to content

Latest commit

 

History

History
230 lines (165 loc) · 10.3 KB

README.md

File metadata and controls

230 lines (165 loc) · 10.3 KB

abn-amro-statement-parser

Parser for the Dutch ABN AMRO bank transactions.

This project supports the TXT*.TAB files, which are tab-separated values. This project may still be useful while parsing the other available file formats.

This project also supports the Statement-*.pdf ICS credit card statements. (ICS is the credit card company that provides ABN-AMRO-branded credit cards.)

How to use this project

Consider using a virtual environment to keep everything tidy. Then:

pip install git+https://github.com/denilsonsa/abn-amro-statement-parser.git

Alternatively, you can manually download any module from the abnamroparser/ directory and place it wherever you want.

from abnamroparser import tsvparser

# You can iterate over the transactions of a certain file:
with open("TXT_SAMPLE.TAB") as f:
    for transaction in tsvparser.read_tsv(f):
        print(
            "{} {!s:12} {:32} {}".format(
                transaction.date.isoformat(),
                transaction.amount,
                transaction.desc["type"],
                transaction.desc.get("Naam", "?"),
            )
        )

# Or you can easily convert it to a JSON file:
import json
with open("bank_transactions.json", "w") as f:
    json.dump(
        tsvparser.convert_tsv_to_json_like("TXT_SAMPLE.TAB"),
        f,
        indent=2,
        sort_keys=True,
    )
import glob
from abnamroparser import icspdfparser

# Parsing ICS credit card statements:
for pdf_filename in sorted(glob.glob("Statement-*.pdf")):
    for transaction in icspdfparser.read_ics_pdf(pdf_filename):
        print(
            "{} {!s:12} {:24} {}".format(
                transaction.date.isoformat(),
                transaction.amount,
                transaction.descriptions[0],
                transaction.descriptions[1],
            )
        )

# Or you can easily convert it to a JSON file:
import json
from itertools import chain
with open("credit_card_transactions.json", "w") as f:
    json.dump(
        [
            transaction.as_json_like
            for transaction in chain.from_iterable(
                icspdfparser.read_ics_pdf(pdf_filename)
                for pdf_filename in sorted(glob.glob("Statement-*.pdf"))
            )
        ],
        f,
        indent=2,
        sort_keys=True,
    )

I encourage you to take a look at the source-code. It's full of doctests, so it should be easy to learn.

About the available file formats from ABN AMRO

You can download your transactions from the ABN AMRO internet banking website:

  1. Login to the ABN AMRO Internet Banking website.
  2. At the top three tabs, go to Self service.
  3. Choose Download statements as the topic.
  4. Click on Download transactions. (And then click on the yellow button to open the actual page.)
    • Don't go to Download account statements, or to Download other statements, as those pages only allow downloading PDF files.
  5. Select the period you want to download.
  6. Select the format of the file you want to download.
    • For the purposes of this repository, please select TXT.
    • Personally, I think it's a good idea to download all available formats and save them for the future.
  7. Check which accounts should be included.
  8. Click on the yellow download button.

The ABN AMRO website offers the following file formats:

PDF

Sample filename: mutov123456789_01012022-31122022.pdf

Basic PDF file, well formatted to look pretty in a paper statement.

TXT

Sample filename: TXT231122235959.TAB

The date and time of the download is part of the filename, following the TXT%y%m%d%H%M%S.TAB format.

It is a tab-separated file without any headers. The columns are the same as the XLS Excel file, but in a slightly different order.

Use tsvparser to parse this kind of file.

MT940

Sample filename: MT940231122235959.STA

The date and time of the download is part of the filename, following the MT940%y%m%d%H%M%S.STA format.

It is a plain text file in MT940 format.

Use the mt-940 (PyPI, docs) library to parse it.

XLS

Sample filename: XLS231122235959.xls

The date and time of the download is part of the filename, following the XLS%y%m%d%H%M%S.xls format.

It is an Excel file with one transaction per row. It is equivalent to the TXT file, but with the columns in a slightly different order.

The first row has the headers:

  • accountNumber - always the same
  • mutationcode - the three-letter currency code
  • transactiondate - in format YYYYMMDD
  • valuedate - in format YYYYMMDD
  • startsaldo - number
  • endsaldo - number
  • amount - number
  • description - string, exactly the same as in the TXT format

CAMT. 053

Sample filename: 2012345678_012345678901.zip

It is a ZIP file full of XML files with this naming schema: 2012345678_AAAAAAAAA_DDMMYY000000.xml, where AAAAAAAAA is the account number and DDMMYY is the date. Each file contains all the transactions for that date.

I have not investigated much this format. It's probably a good machine-readable format, if you manage to figure out the meaning behind the several XML elements.

However, despite being an XML of arbitrary length, some XML elements have an limit on the size of their strings, so some fields might be truncated. (e.g. <Nm>…</Nm> seems to be limited to 24 characters, while /NAME/ from <AddtlNtryInf>…</AddtlNtryInf> contains the full name.)

Comparison between the formats

PDF TXT MT940 XLS CAMT. 053
Plain text No Yes Yes No Zipped XML
Human-readable Yes Mostly Almost Yes Hell, no!
Machine-readble No Yes Yes Mostly Yes
Easy to concatenate No Yes Yes Mostly Depends
Description case Mixed Mixed ALL CAPS Mixed Mixed
Space every 32/64 chars - Yes Newline No No
Truncated string fields No? No? No? No? Yes (some)
May benefit from this project's parse_description() - Yes Probably Yes Yes
  • Because each transaction is broken into multiple lines, MT940 is hard to manipulate by hand with a plain text editor.
  • MT940 having the description in ALL CAPS makes it uglier and less readable. That was the main reason why I decided to stop using MT940 and write my own TXT parser.
  • CAMT. 053 is very convoluted and hard to work with (unless you use specialized software). Even then, some fields may be truncated.
  • It is always hard to extract information from PDF files.
  • Having to deal with additional space characters in the TXT files is annoying.
  • XLS is a proprietary binary format requiring specialized software.

Thus, there is no single "best" format, it depends on your needs:

  • TXT is the best if you want to work with simple plain text files.
  • XLS is the best all-around format if you don't mind reading from a proprietary binary format.
  • CAMT. 053 is adequate if you want machine-readable files without proprietary formats.

About the available file formats from ICS

The Creditcard Online self-service website only provides the monthly credit card statements in PDF format.

Sample filename: Statement-01234567890-2023-11.pdf

Use icspdfparser to parse this kind of file. The text in those files is easily extractable thanks to the pypdf library.

Rant about PDF file sizes

Over time, the PDF files grew in size.

average file size 1 page 2 pages 3 pages
Before 2023-02 446 KiB 887 KiB ?
After 2023-03 1.8 MiB 3.5 MiB ?

If we extract the JPG images from those PDF files, we can easily see the difference:

$ pdfimages -j Statement-01234567890-2023-02.pdf OLD
$ pdfimages -j Statement-01234567890-2023-03.pdf NEW
$ ls -1Ush OLD* NEW*
444K OLD-000.jpg
1,8M NEW-000.jpg
$ identifi OLD* NEW*
JPEG TrueColor 8-bit srgb  3.0 (JPEG 90) 2480x3507 pixels 8.7MP (210x297mm 300dpi) TopLeft orientation 450961 bytes: OLD-000.jpg
JPEG TrueColor 8-bit srgb  3.0 (JPEG 100) 2480x3507 pixels 8.7MP (875x1237mm 72dpi) 1782810 bytes: NEW-000.jpg

Where identifi is a fancy wrapper around ImageMagick's identify.

So, for each 3KB of useful text we are receiving from half megabyte to almost two megabytes… per page! And what's that image? It's just a full-page background. It's such a huge waste of space. We would need a decade of plain text statements to come close to the size of a single month.

And why did the size increase? Because they changed the logo at the top, which means someone had to re-create the JPEG image, and the image was exported in the maximum possible quality, even though there is no perceived quality difference to the human eye.

I'm pretty sure the PDF would be noticeably smaller if they used vector graphics. They could even get rid of the subtle circle pattern in the background, as I don't think it serves any real purpose. Just for comparison, the Internet Banking can generate plain boring PDF bank statements (that are hard to parse), and a full year of those statements is around 2 MiB, and those statements have many more pages and more transactions than each credit card statement.

Project status

I created this code for my own uses. I'm sharing with the world because I believe more people are in the same situation as me, and more people certainly want to parse their own bank account statements.

That said, I don't plan on improving this code much. I'll fix it and update it as much as I need for my own statements. Simple pull requests are welcome.

I cannot promise the API will be stable.

I'm not planning on releasing this on PyPI. You are free to fork this project and create a well-maintained version yourself.