Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Explicitly state that InternVL 2.5 is supported #10978

Merged
merged 2 commits into from
Dec 7, 2024

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Dec 7, 2024

The model architecture of InternVL2.5 is the same as InternVL2 except for different LM backbone. We have already implemented dynamic LM loading for this model, so no further changes are needed to support this model in vLLM.

I have tested the 4B model locally (vllm serve OpenGVLab/InternVL2_5-4B) and it seems to be working fine.

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 7, 2024
Copy link

github-actions bot commented Dec 7, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added the documentation Improvements or additions to documentation label Dec 7, 2024
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Dec 7, 2024

I'm not completely sure where to get the stop tokens though. The link in examples/offline_inference_vision_language.py doesn't go to the tokenizer config. @Isotr0py can you help with updating the example scripts?

@Isotr0py
Copy link
Collaborator

Isotr0py commented Dec 7, 2024

You can get the stop tokens here: https://huggingface.co/OpenGVLab/InternVL2_5-4B/blob/9e3bfef341bf84ca3efed094ea6c598e6b34f527/conversation.py#L335-L391

Seems that they deleted the stop tokens from README.

@DarkLight1337
Copy link
Member Author

I see, let me update the link then, thanks for pointing that out!

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Dec 7, 2024

So basically, I should include the stop_str and sep from each template type that is listed there?

Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Isotr0py
Copy link
Collaborator

Isotr0py commented Dec 7, 2024

So basically, I should include the stop_str and sep from each template type that is listed there?

Yes. I think we just need stop_str. The current stop_str should be fine, because InternVL2.5 is using Qwen2.5 as LLM backbone, whose stop_str has been included in it.

Signed-off-by: DarkLight1337 <[email protected]>
@Isotr0py Isotr0py enabled auto-merge (squash) December 7, 2024 15:12
@Isotr0py Isotr0py merged commit 1c768fe into vllm-project:main Dec 7, 2024
34 checks passed
@DarkLight1337 DarkLight1337 deleted the internvl-25 branch December 7, 2024 16:59
sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024
@LaoWangGB
Copy link

acbd44cc03c00ae0b9ed651b3660ad7224344fa8_knock_capture_image
when I deploy InternVL2.5-78B with vllm=0.6.6, this error ocurrs.

The model architecture of InternVL2.5 is the same as InternVL2 except for different LM backbone. We have already implemented dynamic LM loading for this model, so no further changes are needed to support this model in vLLM.

I have tested the 4B model locally (vllm serve OpenGVLab/InternVL2_5-4B) and it seems to be working fine.

@DarkLight1337
Copy link
Member Author

Can you show the full stack trace?

@LaoWangGB
Copy link

�[36m(RayWorkerWrapper pid=781)�[0m INFO 01-19 22:13:17 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
�[36m(RayWorkerWrapper pid=781)�[0m INFO 01-19 22:13:34 custom_all_reduce.py:224] Registering 5635 cuda graph addresses
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5932 [1] NCCL INFO Channel 14/1 : 1[1] -> 0[0] via P2P/IPC/read�[32m [repeated 147x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO Connected all trees�[32m [repeated 5x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512�[32m [repeated 5x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer�[32m [repeated 5x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:780 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO ncclCommInitRank comm 0x557cda636e90 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 51000 commId 0xac8a6ac37794e18d - Init COMPLETE�[32m [repeated 5x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO Using non-device net plugin version 0�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO Using network Socket�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO ncclCommInitRank comm 0x557cda636e90 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 51000 commId 0xac8a6ac37794e18d - Init START�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO Setting affinity for GPU 1 to 1f7fff,00000000,001f7fff�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO NVLS multicast support is not available on dev 1�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO comm 0x557cda636e90 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO P2P Chunksize set to 524288�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m nnel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO Connected all rings�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read
�[36m(RayWorkerWrapper pid=780)�[0m INFO 01-19 22:13:14 worker.py:241] Memory profiling takes 8.98 seconds�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m INFO 01-19 22:13:14 worker.py:241] the current vLLM instance can use total_gpu_memory (79.35GiB) x gpu_memory_utilization (0.90) = 71.41GiB�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m INFO 01-19 22:13:14 worker.py:241] model weights take 39.52GiB; non_torch_memory takes 1.47GiB; PyTorch activation peak memory takes 3.94GiB; the rest of the memory reserved for KV Cache is 26.48GiB.�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=769)�[0m INFO 01-19 22:13:17 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.�[32m [repeated 2x across cluster]�[0m
INFO 01-19 22:13:35 custom_all_reduce.py:224] Registering 5635 cuda graph addresses
�[36m(RayWorkerWrapper pid=781)�[0m INFO 01-19 22:13:35 model_runner.py:1535] Graph capturing finished in 18 secs, took 0.42 GiB
INFO 01-19 22:13:35 model_runner.py:1535] Graph capturing finished in 18 secs, took 0.42 GiB
INFO 01-19 22:13:35 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 30.14 seconds
INFO 01-19 22:13:36 chat_utils.py:333] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
INFO 01-19 22:13:37 preprocess.py:215] Your model uses the legacy input pipeline instead of the new multi-modal processor. Please note that the legacy pipeline will be removed in a future release. For more details, see: https://github.com//issues/10114

0%| | 0/61 [00:00<?, ?it/s]
INFO 01-19 22:13:40 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250119-221340.pkl...
INFO 01-19 22:13:40 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250119-221340.pkl.
ERROR 01-19 22:13:40 worker_base.py:467] Error executing method execute_model. This might cause deadlock in distributed execution.
ERROR 01-19 22:13:40 worker_base.py:467] Traceback (most recent call last):
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 01-19 22:13:40 worker_base.py:467] return func(*args, **kwargs)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1747, in execute_model
ERROR 01-19 22:13:40 worker_base.py:467] output: SamplerOutput = self.model.sample(
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/internvl.py", line 772, in sample
ERROR 01-19 22:13:40 worker_base.py:467] return self.language_model.sample(logits, sampling_metadata)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 496, in sample
ERROR 01-19 22:13:40 worker_base.py:467] next_tokens = self.sampler(logits, sampling_metadata)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-19 22:13:40 worker_base.py:467] return self._call_impl(*args, **kwargs)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-19 22:13:40 worker_base.py:467] return forward_call(*args, **kwargs)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 258, in forward
ERROR 01-19 22:13:40 worker_base.py:467] logits = _apply_min_tokens_penalty(logits, sampling_metadata)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 380, in _apply_min_tokens_penalty
ERROR 01-19 22:13:40 worker_base.py:467] logits[tuple(zip(*logits_to_penalize))] = -float("inf")
ERROR 01-19 22:13:40 worker_base.py:467] RuntimeError: Could not infer dtype of NoneType
ERROR 01-19 22:13:40 worker_base.py:467]
ERROR 01-19 22:13:40 worker_base.py:467] The above exception was the direct cause of the following exception:
ERROR 01-19 22:13:40 worker_base.py:467]
ERROR 01-19 22:13:40 worker_base.py:467] Traceback (most recent call last):
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
ERROR 01-19 22:13:40 worker_base.py:467] return executor(*args, **kwargs)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 01-19 22:13:40 worker_base.py:467] output = self.model_runner.execute_model(
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 01-19 22:13:40 worker_base.py:467] return func(*args, **kwargs)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
ERROR 01-19 22:13:40 worker_base.py:467] raise type(err)(
ERROR 01-19 22:13:40 worker_base.py:467] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250119-221340.pkl): Could not infer dtype of NoneType
[Default][2025-01-19 22:13:40.503][INFO]:Error in model execution (input dumped to /tmp/err_execute_model_input_20250119-221340.pkl): Could not infer dtype of NoneType
�[36m(RayWorkerWrapper pid=780)�[0m INFO 01-19 22:13:34 custom_all_reduce.py:224] Registering 5635 cuda graph addresses�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m INFO 01-19 22:13:35 model_runner.py:1535] Graph capturing finished in 18 secs, took 0.42 GiB�[32m [repeated 2x across cluster]�[0m
[rank0]:[W119 22:13:44.396387951 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

main env
torch=2.5.1
vllm=0.6.6
vllm-flash-attn=2.6.1
transformers=4.46.3

main args:
tp=4

@LaoWangGB
Copy link

Can you show the full stack trace?

Additionally, I use the same code and env to infer InternVL2_5-8B successfully. Maybe the vision part and language part(Qwen2) are not well matched, it deliver wrong tensor from vision part to language part.

@Isotr0py
Copy link
Collaborator

Maybe the vision part and language part(Qwen2) are not well matched, it deliver wrong tensor from vision part to language part.

@LaoWangGB Can you try the InternVL2.5-26B model as well? If it also occurred on 26B, this might be the case.

@LaoWangGB
Copy link

LaoWangGB commented Jan 20, 2025

Maybe the vision part and language part(Qwen2) are not well matched, it deliver wrong tensor from vision part to language part.

@LaoWangGB Can you try the InternVL2.5-26B model as well? If it also occurred on 26B, this might be the case.

InternVL2.5-26B works well. So the problem occurs due to the difference in language models. Did you test 78B successfully?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants