Out of memory error when running threads #142

lime-n · 2023-02-11T20:03:10Z

lime-n
Feb 11, 2023

So I have a large table ~1 million rows and I am attempting threading capabilities. The following script runs perfectly well on a table with 40k rows, however, with the following it creates a memory leak issue. What is the cause of the memory issue and how to best overcome this? I find that its related to the max_row size, and row_iter. How can I best select these so each thread takes some chunk of the entire data size, and combined they total the dataframe size.

import logging
import threading
import oracledb

NUM_THREADS = 4
SQL =  'SELECT /*+ ENABLE_PARALLEL_DML PARALLEL(AUTO) */ * FROM DF offset :rowoffset rows fetch next :maxrows rows only'
MAX_ROWS = 1190108
def start_workload(fn):
    def wrapped(self, threads, *args, **kwargs):
        assert isinstance(threads, int)
        assert threads > 0

        ts = []
        for i in range(threads):
            new_args = (self, i, *args)
            t = threading.Thread(target=fn, args=new_args, kwargs=kwargs)
            t.start()
            ts.append(t)
        for t in ts:
            t.join()
    return wrapped


class TEST:
    def __init__(self, *args):
        self._pool = oracledb.create_pool(user = args[0], password = args[1], port=1521,host="localhost", service_name="service", min=NUM_THREADS, max=NUM_THREADS)
        self._batchsize = batchsize
    @start_workload
    def do_query(self, tn, count):
        with self._pool.acquire() as connection:
            with connection.cursor() as cursor:
                max_rows = count
                row_iter = max_rows/self._batchsize
                print(max_rows, row_iter)
                cursor.arraysize = row_iter
                cursor.prefetchrows = 100
                cursor.execute(SQL, dict(rowoffset=(tn*row_iter), maxrows=max_rows))
                list_of_fetches = []
                while True:
                    rows = cursor.fetchmany()
                    list_of_fetches.append(rows)
                    if not rows:
                        break
                print(len(list_of_fetches[0]))

if __name__ == '__main__':
    result = TEST(username, password)
    import time
    start=time.time()
    result.do_query(NUM_THREADS,MAX_ROWS)
    end=time.time()
    print('Total Time: %s' % (end-start))

cjbj · 2023-02-20T02:28:34Z

cjbj
Feb 20, 2023
Maintainer

Check your calculations and make sure that each thread is only selecting a unique range of data. When I tried your example, I saw threads were selecting sets of the same rows. The first thread selected everything. The 2nd thread selected all rows except the first batch. The 3rd thread selected all rows except the first two batches. I did have to set batchsize, which was missing, and I used a smaller table, so maybe I wasn't testing what you are.

If memory is an issue, you may want to keep prefetchrows at its default size. Instead just tune arraysize. Since you seem to be holding the whole dataset in memory, you could use fetchall() which may remove some overhead of list appending.

You may be interested in https://medium.com/oracledevs/selecting-from-an-oracle-database-table-in-parallel-using-python-31ecaa2c28c8. There are a lot of variables that affect whether a parallel extraction is faster, including all the memory management you are doing in Python. What performance benefit are you seeing?

If you still have problems, can you share why you think there is a leak and how you are measuring it?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory error when running threads #142

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Out of memory error when running threads #142

lime-n Feb 11, 2023

Replies: 1 comment

cjbj Feb 20, 2023 Maintainer

lime-n
Feb 11, 2023

cjbj
Feb 20, 2023
Maintainer