Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DrJit possibly overflows some OptiX or cuda internal data structure #210

Open
futscdav opened this issue Dec 27, 2023 · 4 comments
Open

Comments

@futscdav
Copy link

futscdav commented Dec 27, 2023

I've been running into Program hit CUDA_ERROR_ILLEGAL_ADDRESS (error 700) due to "an illegal memory access was encountered" when trying to optimize some larger cases. I don't have a simple code to reproduce this, since my working code is rather large. This could be related to issue #125.

Running under CUDA_LAUNCH_BLOCKING=1 and compute-sanitizer, I've managed to produce a little bit of context of what might be happening. In the provided stack trace, the top frames are as follows:

========= Program hit CUDA_ERROR_ILLEGAL_ADDRESS (error 700) due to "an illegal memory access was encountered" on CUDA API call to cuEventRecord.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x63a44b]
=========                in /usr/lib/x86_64-linux-gnu/libnvoptix.so.1
=========     Host Frame:jitc_optix_launch(ThreadState*, Kernel const&, unsigned int, void const*, unsigned int) [0x6f6b1968]
=========                in .../my_opt_problem
=========     Host Frame:jitc_run(ThreadState*, ScheduledGroup) [0x6f68d193]
=========                in .../my_opt_problem
=========     Host Frame:jitc_eval(ThreadState*) [0x6f68daa8]
=========                in .../my_opt_problem
=========     Host Frame:jitc_var_gather(unsigned int, unsigned int, unsigned int) [0x6f65bcb0]
=========                in .../my_opt_problem
=========     Host Frame:jit_var_gather [0x6f6aba4b]
...

This happens after roughly ~320 forward/backward passes, and is 100% reproducible with my setup, it's not randomly occurring. Anecdotally, the total number of compiled OptiX kernel ops in those passes (as reported by dr.LogLevel.Info with cache misses) is a small smidge over 8 million, which could be interesting or it could be completely coincidental. The reason I think it's overflowing some internal data structure is that if I manually call

dr.sync_thread()
dr.flush_kernel_cache()
dr.flush_malloc_cache()  # Edit: this is not required for the error to disappear.
dr.sync_thread()

after each optimization step, error seems to disappear. If I'm right, a reproducing case could be as easy as writing something that indeed causes a large amount of compilations that add up.

Related, is there a writeup of when to expect cache misses? So far I've observed that if geometry in the scene changes, I get a cache miss.

@futscdav
Copy link
Author

futscdav commented Dec 27, 2023

Ok, I managed to simplify the problem into a self contained snippet. This code runs into the issue after 1091 iterations on 4090 with cuda 12.3, which takes about 20 minutes to reproduce.

import drjit as dr
import mitsuba as mi
import numpy as np

mi.set_variant("cuda_ad_rgb")

def dr_backward(output, output_grad):
  dr.set_grad(output, output_grad)
  dr.enqueue(dr.ADMode.Backward, output)
  dr.traverse(output, dr.ADMode.Backward)

dr.set_log_level(dr.LogLevel.Info)
vert_count = 40_000
face_count = 38_000

verts = np.random.uniform(0, 1, [vert_count, 3])
faces = np.random.randint(0, vert_count, [face_count, 3])

mesh = mi.Mesh(
    'mesh',
    vertex_count=vert_count,
    face_count=face_count,
    has_vertex_normals=True,
    has_vertex_texcoords=True,
    props=mi.Properties(),
)
mesh_params = mi.traverse(mesh)
mesh_params['vertex_positions'] = np.ravel(verts)
mesh_params['faces'] = np.ravel(faces)
mesh_params.update()

scene = mi.load_dict({
    'type': 'scene',
    'integrator': {'type': 'direct'},
    'emitter': {'type': 'constant'},
    'shape': mesh,
    'sensor': {
        'type': 'perspective',
        'to_world': mi.ScalarTransform4f.look_at(
            [0, 0, 5], [0, 0, 0], [0, 1, 0]
        ),
        'film': {
            'type': 'hdrfilm',
            'width': 512,
            'height': 512,
            'pixel_format': 'rgb',
        },
    },
})

params = mi.traverse(scene)
for i in range(10000):
  print(f'Iteration {i}')
  new_verts = mi.Float(np.ravel(np.random.uniform(0, 1, [vert_count, 3])))
  dr.enable_grad(new_verts)
  params['shape.vertex_positions'] = new_verts
  params.update()
  image = mi.render(scene, params, spp=4)
  grad_image = np.random.uniform(0, 1e-4, image.shape)
  dr_backward(image, grad_image)
  dr.grad(new_verts)

@futscdav
Copy link
Author

Somewhat related: mitsuba-renderer/mitsuba3#1033 is what causes the recompilation on each iteration here, but nevertheless the program shouldn't crash. Using the snippet in that issue the problem can be reproduced slightly faster, in about 10 minutes or 3300 iterations.

@futscdav
Copy link
Author

Reading through some mitsuba issues, this is likely also the cause of mitsuba-renderer/mitsuba3#703.

@merlinND
Copy link
Member

merlinND commented Feb 2, 2024

Hello @futscdav,

Thank you for reporting this bug and posting a reproducer.
Could you please try running your test again with the latest Mitsuba master, which includes a fix for a similar-sounding crash? (mitsuba-renderer/drjit-core#78)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants