DrJit possibly overflows some OptiX or cuda internal data structure #210

futscdav · 2023-12-27T20:52:04Z

I've been running into Program hit CUDA_ERROR_ILLEGAL_ADDRESS (error 700) due to "an illegal memory access was encountered" when trying to optimize some larger cases. I don't have a simple code to reproduce this, since my working code is rather large. This could be related to issue #125.

Running under CUDA_LAUNCH_BLOCKING=1 and compute-sanitizer, I've managed to produce a little bit of context of what might be happening. In the provided stack trace, the top frames are as follows:

========= Program hit CUDA_ERROR_ILLEGAL_ADDRESS (error 700) due to "an illegal memory access was encountered" on CUDA API call to cuEventRecord.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x63a44b]
=========                in /usr/lib/x86_64-linux-gnu/libnvoptix.so.1
=========     Host Frame:jitc_optix_launch(ThreadState*, Kernel const&, unsigned int, void const*, unsigned int) [0x6f6b1968]
=========                in .../my_opt_problem
=========     Host Frame:jitc_run(ThreadState*, ScheduledGroup) [0x6f68d193]
=========                in .../my_opt_problem
=========     Host Frame:jitc_eval(ThreadState*) [0x6f68daa8]
=========                in .../my_opt_problem
=========     Host Frame:jitc_var_gather(unsigned int, unsigned int, unsigned int) [0x6f65bcb0]
=========                in .../my_opt_problem
=========     Host Frame:jit_var_gather [0x6f6aba4b]
...

This happens after roughly ~320 forward/backward passes, and is 100% reproducible with my setup, it's not randomly occurring. Anecdotally, the total number of compiled OptiX kernel ops in those passes (as reported by dr.LogLevel.Info with cache misses) is a small smidge over 8 million, which could be interesting or it could be completely coincidental. The reason I think it's overflowing some internal data structure is that if I manually call

dr.sync_thread()
dr.flush_kernel_cache()
dr.flush_malloc_cache()  # Edit: this is not required for the error to disappear.
dr.sync_thread()

after each optimization step, error seems to disappear. If I'm right, a reproducing case could be as easy as writing something that indeed causes a large amount of compilations that add up.

Related, is there a writeup of when to expect cache misses? So far I've observed that if geometry in the scene changes, I get a cache miss.

The text was updated successfully, but these errors were encountered:

futscdav · 2023-12-27T22:32:35Z

Ok, I managed to simplify the problem into a self contained snippet. This code runs into the issue after 1091 iterations on 4090 with cuda 12.3, which takes about 20 minutes to reproduce.

import drjit as dr
import mitsuba as mi
import numpy as np

mi.set_variant("cuda_ad_rgb")

def dr_backward(output, output_grad):
  dr.set_grad(output, output_grad)
  dr.enqueue(dr.ADMode.Backward, output)
  dr.traverse(output, dr.ADMode.Backward)

dr.set_log_level(dr.LogLevel.Info)
vert_count = 40_000
face_count = 38_000

verts = np.random.uniform(0, 1, [vert_count, 3])
faces = np.random.randint(0, vert_count, [face_count, 3])

mesh = mi.Mesh(
    'mesh',
    vertex_count=vert_count,
    face_count=face_count,
    has_vertex_normals=True,
    has_vertex_texcoords=True,
    props=mi.Properties(),
)
mesh_params = mi.traverse(mesh)
mesh_params['vertex_positions'] = np.ravel(verts)
mesh_params['faces'] = np.ravel(faces)
mesh_params.update()

scene = mi.load_dict({
    'type': 'scene',
    'integrator': {'type': 'direct'},
    'emitter': {'type': 'constant'},
    'shape': mesh,
    'sensor': {
        'type': 'perspective',
        'to_world': mi.ScalarTransform4f.look_at(
            [0, 0, 5], [0, 0, 0], [0, 1, 0]
        ),
        'film': {
            'type': 'hdrfilm',
            'width': 512,
            'height': 512,
            'pixel_format': 'rgb',
        },
    },
})

params = mi.traverse(scene)
for i in range(10000):
  print(f'Iteration {i}')
  new_verts = mi.Float(np.ravel(np.random.uniform(0, 1, [vert_count, 3])))
  dr.enable_grad(new_verts)
  params['shape.vertex_positions'] = new_verts
  params.update()
  image = mi.render(scene, params, spp=4)
  grad_image = np.random.uniform(0, 1e-4, image.shape)
  dr_backward(image, grad_image)
  dr.grad(new_verts)

futscdav · 2024-01-10T18:18:31Z

Somewhat related: mitsuba-renderer/mitsuba3#1033 is what causes the recompilation on each iteration here, but nevertheless the program shouldn't crash. Using the snippet in that issue the problem can be reproduced slightly faster, in about 10 minutes or 3300 iterations.

futscdav · 2024-01-10T21:32:53Z

Reading through some mitsuba issues, this is likely also the cause of mitsuba-renderer/mitsuba3#703.

merlinND · 2024-02-02T14:23:29Z

Hello @futscdav,

Thank you for reporting this bug and posting a reproducer.
Could you please try running your test again with the latest Mitsuba master, which includes a fix for a similar-sounding crash? (mitsuba-renderer/drjit-core#78)

merlinND mentioned this issue Feb 2, 2024

Fix read-after-free in jitc_cuda_assemble mitsuba-renderer/drjit-core#78

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DrJit possibly overflows some OptiX or cuda internal data structure #210

DrJit possibly overflows some OptiX or cuda internal data structure #210

futscdav commented Dec 27, 2023 •

edited

Loading

futscdav commented Dec 27, 2023 •

edited

Loading

futscdav commented Jan 10, 2024

futscdav commented Jan 10, 2024

merlinND commented Feb 2, 2024 •

edited

Loading

DrJit possibly overflows some OptiX or cuda internal data structure #210

DrJit possibly overflows some OptiX or cuda internal data structure #210

Comments

futscdav commented Dec 27, 2023 • edited Loading

futscdav commented Dec 27, 2023 • edited Loading

futscdav commented Jan 10, 2024

futscdav commented Jan 10, 2024

merlinND commented Feb 2, 2024 • edited Loading

futscdav commented Dec 27, 2023 •

edited

Loading

futscdav commented Dec 27, 2023 •

edited

Loading

merlinND commented Feb 2, 2024 •

edited

Loading