How does Burn solve WGPU parent device being lost due to timeout? #1469

cryscan · 2024-03-13T06:36:34Z

cryscan
Mar 13, 2024

Hi! I am very interested in burn's design of its compute model, and I have a very specific question:

As far as I know, in WGPU if there are too many tasks in running or a kernel is running for too long (about 3-5 seconds?) you will panic due to "parent device being lost". How does Burn deal with this limit?

Answered by nathanielsimard

Mar 13, 2024

We have our own queue on top of the wgpu device's queue where we aggregate pipelines into a single ComputePassDescriptor, probably helping stabilize the device. However, we do not have special error handling when the parent device is lost because of a long kernel, though I never actually had that problem.

You can look at how we handle compute tasks here: https://github.com/tracel-ai/burn/blob/main/crates/burn-wgpu/src/compute/server.rs

View full answer

antimora · 2024-03-13T16:55:19Z

antimora
Mar 13, 2024
Collaborator

Tagging @nathanielsimard and @louisfd

0 replies

nathanielsimard · 2024-03-13T19:57:58Z

nathanielsimard
Mar 13, 2024
Maintainer

We have our own queue on top of the wgpu device's queue where we aggregate pipelines into a single ComputePassDescriptor, probably helping stabilize the device. However, we do not have special error handling when the parent device is lost because of a long kernel, though I never actually had that problem.

You can look at how we handle compute tasks here: https://github.com/tracel-ai/burn/blob/main/crates/burn-wgpu/src/compute/server.rs

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does Burn solve WGPU parent device being lost due to timeout? #1469

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How does Burn solve WGPU parent device being lost due to timeout? #1469

cryscan Mar 13, 2024

Replies: 2 comments

antimora Mar 13, 2024 Collaborator

nathanielsimard Mar 13, 2024 Maintainer

cryscan
Mar 13, 2024

antimora
Mar 13, 2024
Collaborator

nathanielsimard
Mar 13, 2024
Maintainer