How does Burn solve WGPU parent device being lost due to timeout? #1469
-
Hi! I am very interested in burn's design of its compute model, and I have a very specific question: As far as I know, in WGPU if there are too many tasks in running or a kernel is running for too long (about 3-5 seconds?) you will panic due to "parent device being lost". How does Burn deal with this limit? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Tagging @nathanielsimard and @louisfd |
Beta Was this translation helpful? Give feedback.
-
We have our own queue on top of the wgpu device's queue where we aggregate pipelines into a single You can look at how we handle compute tasks here: https://github.com/tracel-ai/burn/blob/main/crates/burn-wgpu/src/compute/server.rs |
Beta Was this translation helpful? Give feedback.
We have our own queue on top of the wgpu device's queue where we aggregate pipelines into a single
ComputePassDescriptor
, probably helping stabilize the device. However, we do not have special error handling when the parent device is lost because of a long kernel, though I never actually had that problem.You can look at how we handle compute tasks here: https://github.com/tracel-ai/burn/blob/main/crates/burn-wgpu/src/compute/server.rs