You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One node with 8 GPUs - finetuning notebook error = RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
#32
Open
ludovicdenoyer opened this issue
Dec 7, 2024
· 0 comments
When executing the finetuning notebook on a node with 8 GPUs, I get the following error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
It does not happen with 1 or 2 GPUs.... It happens with Lora and QLora
== Complete stacktrace
Traceback (most recent call last):
File "/workspace/hai/agent_research/hai_web_agent/web_task/training/finetuning_smol_vlm/Smol_VLM_FT.py", line 210, in
trainer.train()
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/trainer.py", line 2522, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/trainer.py", line 3655, in training_step
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/trainer.py", line 3709, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/accelerate/utils/operations.py", line 823, in forward
return model_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/accelerate/utils/operations.py", line 811, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/peft/peft_model.py", line 812, in forward
return self.get_base_model()(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/models/idefics3/modeling_idefics3.py", line 1196, in forward
outputs = self.model(
^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/models/idefics3/modeling_idefics3.py", line 1000, in forward
image_hidden_states = self.vision_model(
^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/models/idefics3/modeling_idefics3.py", line 728, in forward
encoder_outputs = self.encoder(
^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/models/idefics3/modeling_idefics3.py", line 511, in forward
layer_outputs = self._gradient_checkpointing_func(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/_compile.py", line 32, in inner
return disable_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 264, in forward
outputs = run_function(*args)
^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/models/idefics3/modeling_idefics3.py", line 430, in forward
hidden_states, attn_weights = self.self_attn(
^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/models/idefics3/modeling_idefics3.py", line 300, in forward
key_states = self.k_proj(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/peft/tuners/lora/layer.py", line 587, in forward
result = result + self.lora_magnitude_vector[active_adapter](
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/peft/tuners/lora/dora.py", line 70, in forward
lora_result = lora_B(lora_A(x))
^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
The text was updated successfully, but these errors were encountered:
Hi,
When executing the finetuning notebook on a node with 8 GPUs, I get the following error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
It does not happen with 1 or 2 GPUs.... It happens with Lora and QLora
== Complete stacktrace
Traceback (most recent call last):
File "/workspace/hai/agent_research/hai_web_agent/web_task/training/finetuning_smol_vlm/Smol_VLM_FT.py", line 210, in
trainer.train()
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/trainer.py", line 2522, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/trainer.py", line 3655, in training_step
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/trainer.py", line 3709, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/accelerate/utils/operations.py", line 823, in forward
return model_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/accelerate/utils/operations.py", line 811, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/peft/peft_model.py", line 812, in forward
return self.get_base_model()(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/models/idefics3/modeling_idefics3.py", line 1196, in forward
outputs = self.model(
^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/models/idefics3/modeling_idefics3.py", line 1000, in forward
image_hidden_states = self.vision_model(
^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/models/idefics3/modeling_idefics3.py", line 728, in forward
encoder_outputs = self.encoder(
^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/models/idefics3/modeling_idefics3.py", line 511, in forward
layer_outputs = self._gradient_checkpointing_func(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/_compile.py", line 32, in inner
return disable_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 264, in forward
outputs = run_function(*args)
^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/models/idefics3/modeling_idefics3.py", line 430, in forward
hidden_states, attn_weights = self.self_attn(
^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/transformers/models/idefics3/modeling_idefics3.py", line 300, in forward
key_states = self.k_proj(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/peft/tuners/lora/layer.py", line 587, in forward
result = result + self.lora_magnitude_vector[active_adapter](
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/peft/tuners/lora/dora.py", line 70, in forward
lora_result = lora_B(lora_A(x))
^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/hai/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
The text was updated successfully, but these errors were encountered: