Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError when running with CPU only #167

Open
rasmith opened this issue Apr 23, 2020 · 1 comment
Open

RuntimeError when running with CPU only #167

rasmith opened this issue Apr 23, 2020 · 1 comment

Comments

@rasmith
Copy link

rasmith commented Apr 23, 2020

I am getting a runtime error when trying to run on a MacBook Pro with just the CPU. It ends up being something like this:

/opt/anaconda3/envs/relighting_video_capture/lib/python3.7/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[000] v n l2:0.0318 l1:0.1383  etap:0:0:0.00 eta:1:12:36.55 4 0
Process Process-4:8 l1:0.1383  etap:0:0:0.00 eta:1:12:36.55 4 1 
Traceback (most recent call last):
  File "/opt/anaconda3/envs/relighting_video_capture/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/anaconda3/envs/relighting_video_capture/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "./train_and_evaluate_tasks.py", line 32, in run_trainer
    t.train()
  File "/Users/randallsmith/projects/git/relighting/trainers/gan_trainer.py", line 318, in train
    self.task_cfg)
  File "/Users/randallsmith/projects/git/relighting/trainers/gan_trainer.py", line 99, in perform_gan_step
    loss_generator.backward()
  File "/opt/anaconda3/envs/relighting_video_capture/lib/python3.7/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/anaconda3/envs/relighting_video_capture/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/opt/anaconda3/envs/relighting_video_capture/lib/python3.7/site-packages/torch/autograd/function.py", line 77, in apply
    return self._forward_cls.backward(self, *args)
  File "/opt/anaconda3/envs/relighting_video_capture/lib/python3.7/site-packages/torch/autograd/function.py", line 189, in wrapper
    outputs = fn(ctx, *args)
  File "/opt/anaconda3/envs/relighting_video_capture/lib/python3.7/site-packages/inplace_abn/functions.py", line 112, in backward
    y_act, dy_act, weight, bias, ctx.eps, ctx.activation, ctx.activation_param)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead. (view at /Users/distiller/project/conda/conda-bld/pytorch_1579022061893/work/aten/src/ATen/native/TensorShape.cpp:1175)
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 135 (0x10d279267 in libc10.dylib)
frame #1: at::native::view(at::Tensor const&, c10::ArrayRef<long long>) + 827 (0x11862462b in libtorch.dylib)
frame #2: at::CPUType::(anonymous namespace)::view(at::Tensor const&, c10::ArrayRef<long long>) + 61 (0x1188457ed in libtorch.dylib)
frame #3: c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<long long>), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<long long> > >, at::Tensor (at::Tensor const&, c10::ArrayRef<long long>)>::call(c10::OperatorKernel*, at::Tensor const&, c10::ArrayRef<long long>) + 24 (0x11883cef8 in libtorch.dylib)
frame #4: at::Tensor c10::KernelFunction::callUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<long long> >(at::Tensor const&, c10::ArrayRef<long long>) const + 63 (0x1182a607f in libtorch.dylib)
frame #5: std::__1::result_of<at::Tensor (ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)>::type c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > >::read<at::Tensor c10::Dispatcher::doCallUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<long long> >(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, c10::ArrayRef<long long>) const::'lambda'(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)>(at::Tensor&&) const + 175 (0x1182a5fcf in libtorch.dylib)
frame #6: std::__1::result_of<at::Tensor (c10::DispatchTable const&)>::type c10::LeftRight<c10::DispatchTable>::read<at::Tensor c10::Dispatcher::callUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<long long> >(c10::OperatorHandle const&, at::Tensor const&, c10::ArrayRef<long long>) const::'lambda'(c10::DispatchTable const&)>(at::Tensor&&) const + 115 (0x1182a5eb3 in libtorch.dylib)
frame #7: at::Tensor::view(c10::ArrayRef<long long>) const + 341 (0x1182a2ad5 in libtorch.dylib)
frame #8: torch::autograd::VariableType::(anonymous namespace)::view(at::Tensor const&, c10::ArrayRef<long long>) + 1566 (0x11ad0fdae in libtorch.dylib)
frame #9: c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<long long>), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<long long> > >, at::Tensor (at::Tensor const&, c10::ArrayRef<long long>)>::call(c10::OperatorKernel*, at::Tensor const&, c10::ArrayRef<long long>) + 24 (0x11883cef8 in libtorch.dylib)
frame #10: at::Tensor c10::KernelFunction::callUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<long long> >(at::Tensor const&, c10::ArrayRef<long long>) const + 63 (0x11f28decf in _backend.cpython-37m-darwin.so)
frame #11: std::__1::result_of<at::Tensor (ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)>::type c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > >::read<at::Tensor c10::Dispatcher::doCallUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<long long> >(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, c10::ArrayRef<long long>) const::'lambda'(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)>(at::Tensor&&) const + 168 (0x11f28de18 in _backend.cpython-37m-darwin.so)
frame #12: std::__1::result_of<at::Tensor (c10::DispatchTable const&)>::type c10::LeftRight<c10::DispatchTable>::read<at::Tensor c10::Dispatcher::callUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<long long> >(c10::OperatorHandle const&, at::Tensor const&, c10::ArrayRef<long long>) const::'lambda'(c10::DispatchTable const&)>(at::Tensor&&) const + 118 (0x11f28dd06 in _backend.cpython-37m-darwin.so)
frame #13: at::Tensor::view(c10::ArrayRef<long long>) const + 97 (0x11f28dac1 in _backend.cpython-37m-darwin.so)
frame #14: normalize_shape(at::Tensor const&) + 145 (0x11f28da31 in _backend.cpython-37m-darwin.so)
frame #15: std::__1::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> backward_reduce_impl<float, (Activation)0>(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, float, float) + 576 (0x11f28a1e0 in _backend.cpython-37m-darwin.so)
frame #16: backward_reduce_cpu(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, float, Activation, float) + 194 (0x11f275d62 in _backend.cpython-37m-darwin.so)
frame #17: backward_reduce(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, float, Activation, float) + 783 (0x11f28f73f in _backend.cpython-37m-darwin.so)
frame #18: void pybind11::cpp_function::initialize<std::__1::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (*&)(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, float, Activation, float), std::__1::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor>, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, float, Activation, float, pybind11::name, pybind11::scope, pybind11::sibling, char [32]>(std::__1::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (*&)(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, float, Activation, float), std::__1::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> (*)(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, float, Activation, float), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [32])::'lambda'(pybind11::detail::function_call&)::operator()(pybind11::detail::function_call&) const + 109 (0x11f2ad70d in _backend.cpython-37m-darwin.so)
frame #19: pybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 3088 (0x11f29d960 in _backend.cpython-37m-darwin.so)
<omitting python frames>
frame #32: torch::autograd::PyNode::apply(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> >&&) + 578 (0x1176fb5f2 in libtorch_python.dylib)
frame #33: torch::autograd::Node::operator()(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> >&&) + 464 (0x11aed48e0 in libtorch.dylib)
frame #34: torch::autograd::Engine::evaluate_function(std::__1::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 1381 (0x11aecc495 in libtorch.dylib)
frame #35: torch::autograd::Engine::thread_main(std::__1::shared_ptr<torch::autograd::GraphTask> const&, bool) + 532 (0x11aecb364 in libtorch.dylib)
frame #36: torch::autograd::Engine::thread_init(int) + 152 (0x11aecb118 in libtorch.dylib)
frame #37: torch::autograd::python::PythonEngine::thread_init(int) + 44 (0x1176f5cec in libtorch_python.dylib)
frame #38: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (torch::autograd::Engine::*)(int), torch::autograd::Engine*, int> >(void*) + 66 (0x11aed8a72 in libtorch.dylib)
frame #39: _pthread_start + 148 (0x7fff67ec7e65 in libsystem_pthread.dylib)
frame #40: thread_start + 15 (0x7fff67ec383b in libsystem_pthread.dylib)

I removed any calls to view and used contiguous to see if I could force the tensors to be contiguous, but it seems it gets to backward_reduce() and crashes there. Is there any issue with running this on a CPU only machine? I'm doing this because I want to do some trial runs on my local machine before trying it out on the remote machine.

@ducksoup
Copy link
Contributor

ducksoup commented Apr 23, 2020

I tried reproducing the issue on my machine with no success (just run a forward / backward sequence on CPU). Can you please come up with a minimal reproducing example for me to debug?

PS: we don't have access to MacOS machines, so if this is something MacOS specific I'm afraid I won't be able to help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants