-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cudaErrorInvalidDevice
in ParquetChunkedReader with a small number of executor cores
#11565
Comments
I am aware of #11215 and rapidsai/cudf#16186, but not entirely sure if these are the same issue. |
What version of the plugin jar are you running with? |
@mattahrens I've been using a branch in my repo which is based on |
Just tested. |
It turns out that the issue is not about the GPU device setting (thanks to @abellina's help for debugging). It's about the memory, especially the so-called "reserve" memory. Because the GPU memory size is fixed in my setting, the total amount of GPU memory used by all executors grows as more executors are created. The amount assigned to each executor was 3.5G and my machine has 24G of total GPU memory. When the executor core count was set to 9, 7 executors were created (because my machine has 64 cpu cores), which would have left only few memory for GPU kernel execution unless otherwise it has exhausted memory. This likely has caused some kernel failure. So, the failure is legit as we cannot proceed query processing in this case, but the error message should be improved. |
@jihoonson Are you running multiple spark processes sharing a single GPU? Was this on purpose? |
@revans2 yes I was, but it was not on purpose. It rather accidentally happened. I wanted to limit the executor process count to 1, but |
@abellina and @jlowe do we want to update GpuDeviceManager so that |
With reserve, in general, I'd be interested to know if there is a way to make the error more user friendly as that's the part that was really confusing here. I think |
I'm personally OK if a spark.rapids.memory.gpu.allocSize setting causes a crash if it goes above the max fraction (or otherwise conflicts with other settings). Essentially you'd have to increase max alloc fraction and/or reduce reserve amount configs to allocate beyond normal upper limit even with explicit allocSize setting. |
The other problem is that there is a race condition on start up. We check the reserve based on the memory that is currently free on the GPU. We don't check it after the pool has been initialized, so if there are multiple tasks trying to grab memory at the same time we might get into trouble with detecting reserved memory. That said some checks that are racy are better than no checks at all. Another thing that concerns me is that I have seen the async allocator treat the pool size as a suggestion more than a hard limit. We limit the total number of bytes that can be allocated, but the async pool is not the one that is doing the limiting. If there are a lot of threads/streams then fragmentation between these sub-pools used for each stream can make it grow beyond the limit we set. This might be the cause of a similar failure we saw in a customer's query. It might be that we actually did run out of GPU memory. Not sure though. |
Describe the bug
A simple join query fails with the error below.
Steps/Code to reproduce bug
Here is the data and query I'm using.
Per my observation so far, this query fails when
spark.executor.cores
is set to less than 10. It works fine otherwise. When the executor core count was set to 10, there were 6 executors created since my machine has 64 cpu cores. When the executor core count was 9, there were 7 executors created.Expected behavior
Ideally, the query should run successfully even with a small number of executor cores. Or, if this is some edge case that the plugin cannot execute the query, then the query should fail with a more user-friendly error message.
Environment details (please complete the following information)
The text was updated successfully, but these errors were encountered: