From bbb9b84d4679922920c260fcd6ec01ef479a2fdc Mon Sep 17 00:00:00 2001 From: Wei Wei Date: Thu, 25 Jul 2024 14:46:20 -0700 Subject: [PATCH] add eager mode check for NaN and Inf (#1015) Summary: Pull Request resolved: https://github.com/facebookincubator/AITemplate/pull/1015 This diff includes some debug tool improvements In IG_CTR MC proposal debug, we noted that some new snapshots generated NaN in results. We want to figure out the root cause. With this diff, we can run with `--run-accuracy-check` which will run the generate merge + load merge through pybind. But it does not check eager mode run. In this diff, I added this feature. The random inputs are created the same way as we did in load merge. Attach the results P1494263214 ``` CUDA_VISIBLE_DEVICES=5 TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=/data/local/models/581303767/85/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR" --run-accuracy-check --debug_operator_range="1397,1397" --generate_sample_inputs=False --min_acc_module_size=0 --disable-multiple-batch-run 2>&1 | tee aot.log ``` If we want to enable the layer print, add "--dispatch-print" and it will print out each layer's output and check if NaN or INF is contained. Reviewed By: hl475, chenyang78 Differential Revision: D60150435 fbshipit-source-id: 2e9efcf7d9563dc5d84c6dcefe472db246b68548 --- fx2ait/fx2ait/ait_splitter.py | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fx2ait/fx2ait/ait_splitter.py b/fx2ait/fx2ait/ait_splitter.py index b0f1a02d9..5eb6a04c6 100644 --- a/fx2ait/fx2ait/ait_splitter.py +++ b/fx2ait/fx2ait/ait_splitter.py @@ -172,9 +172,7 @@ def __init__( settings = AITSplitterSettings() if not operator_support: if settings.debug_operator_range: - min_range, max_range = tuple( - int(x) for x in settings.debug_operator_range.split(",") - ) + min_range, max_range = settings.debug_operator_range operator_support = _range_operator_support( module=module, start=min_range,