-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DrJit possibly overflows some OptiX or cuda internal data structure #210
Comments
Ok, I managed to simplify the problem into a self contained snippet. This code runs into the issue after 1091 iterations on 4090 with cuda 12.3, which takes about 20 minutes to reproduce. import drjit as dr
import mitsuba as mi
import numpy as np
mi.set_variant("cuda_ad_rgb")
def dr_backward(output, output_grad):
dr.set_grad(output, output_grad)
dr.enqueue(dr.ADMode.Backward, output)
dr.traverse(output, dr.ADMode.Backward)
dr.set_log_level(dr.LogLevel.Info)
vert_count = 40_000
face_count = 38_000
verts = np.random.uniform(0, 1, [vert_count, 3])
faces = np.random.randint(0, vert_count, [face_count, 3])
mesh = mi.Mesh(
'mesh',
vertex_count=vert_count,
face_count=face_count,
has_vertex_normals=True,
has_vertex_texcoords=True,
props=mi.Properties(),
)
mesh_params = mi.traverse(mesh)
mesh_params['vertex_positions'] = np.ravel(verts)
mesh_params['faces'] = np.ravel(faces)
mesh_params.update()
scene = mi.load_dict({
'type': 'scene',
'integrator': {'type': 'direct'},
'emitter': {'type': 'constant'},
'shape': mesh,
'sensor': {
'type': 'perspective',
'to_world': mi.ScalarTransform4f.look_at(
[0, 0, 5], [0, 0, 0], [0, 1, 0]
),
'film': {
'type': 'hdrfilm',
'width': 512,
'height': 512,
'pixel_format': 'rgb',
},
},
})
params = mi.traverse(scene)
for i in range(10000):
print(f'Iteration {i}')
new_verts = mi.Float(np.ravel(np.random.uniform(0, 1, [vert_count, 3])))
dr.enable_grad(new_verts)
params['shape.vertex_positions'] = new_verts
params.update()
image = mi.render(scene, params, spp=4)
grad_image = np.random.uniform(0, 1e-4, image.shape)
dr_backward(image, grad_image)
dr.grad(new_verts) |
Somewhat related: mitsuba-renderer/mitsuba3#1033 is what causes the recompilation on each iteration here, but nevertheless the program shouldn't crash. Using the snippet in that issue the problem can be reproduced slightly faster, in about 10 minutes or 3300 iterations. |
Reading through some mitsuba issues, this is likely also the cause of mitsuba-renderer/mitsuba3#703. |
Hello @futscdav, Thank you for reporting this bug and posting a reproducer. |
I've been running into
Program hit CUDA_ERROR_ILLEGAL_ADDRESS (error 700) due to "an illegal memory access was encountered"
when trying to optimize some larger cases. I don't have a simple code to reproduce this, since my working code is rather large. This could be related to issue #125.Running under
CUDA_LAUNCH_BLOCKING=1
andcompute-sanitizer
, I've managed to produce a little bit of context of what might be happening. In the provided stack trace, the top frames are as follows:This happens after roughly ~320 forward/backward passes, and is 100% reproducible with my setup, it's not randomly occurring. Anecdotally, the total number of compiled OptiX kernel ops in those passes (as reported by
dr.LogLevel.Info
with cache misses) is a small smidge over 8 million, which could be interesting or it could be completely coincidental. The reason I think it's overflowing some internal data structure is that if I manually callafter each optimization step, error seems to disappear. If I'm right, a reproducing case could be as easy as writing something that indeed causes a large amount of compilations that add up.
Related, is there a writeup of when to expect cache misses? So far I've observed that if geometry in the scene changes, I get a cache miss.
The text was updated successfully, but these errors were encountered: