[Doc] Explicitly state that InternVL 2.5 is supported #10978

DarkLight1337 · 2024-12-07T14:09:18Z

The model architecture of InternVL2.5 is the same as InternVL2 except for different LM backbone. We have already implemented dynamic LM loading for this model, so no further changes are needed to support this model in vLLM.

I have tested the 4B model locally (vllm serve OpenGVLab/InternVL2_5-4B) and it seems to be working fine.

Signed-off-by: DarkLight1337 <[email protected]>

github-actions · 2024-12-07T14:09:29Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337 · 2024-12-07T14:12:28Z

I'm not completely sure where to get the stop tokens though. The link in examples/offline_inference_vision_language.py doesn't go to the tokenizer config. @Isotr0py can you help with updating the example scripts?

Isotr0py · 2024-12-07T14:30:46Z

You can get the stop tokens here: https://huggingface.co/OpenGVLab/InternVL2_5-4B/blob/9e3bfef341bf84ca3efed094ea6c598e6b34f527/conversation.py#L335-L391

Seems that they deleted the stop tokens from README.

DarkLight1337 · 2024-12-07T14:38:16Z

I see, let me update the link then, thanks for pointing that out!

DarkLight1337 · 2024-12-07T14:39:17Z

So basically, I should include the stop_str and sep from each template type that is listed there?

Isotr0py

LGTM!

Isotr0py · 2024-12-07T14:42:29Z

So basically, I should include the stop_str and sep from each template type that is listed there?

~~Yes.~~ I think we just need stop_str. The current stop_str should be fine, because InternVL2.5 is using Qwen2.5 as LLM backbone, whose stop_str has been included in it.

Signed-off-by: DarkLight1337 <[email protected]>

…0978) Signed-off-by: DarkLight1337 <[email protected]>

LaoWangGB · 2025-01-19T14:33:00Z

when I deploy InternVL2.5-78B with vllm=0.6.6, this error ocurrs.

The model architecture of InternVL2.5 is the same as InternVL2 except for different LM backbone. We have already implemented dynamic LM loading for this model, so no further changes are needed to support this model in vLLM.

I have tested the 4B model locally (vllm serve OpenGVLab/InternVL2_5-4B) and it seems to be working fine.

DarkLight1337 · 2025-01-19T14:55:45Z

Can you show the full stack trace?

LaoWangGB · 2025-01-20T03:40:23Z

�[36m(RayWorkerWrapper pid=781)�[0m INFO 01-19 22:13:17 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
�[36m(RayWorkerWrapper pid=781)�[0m INFO 01-19 22:13:34 custom_all_reduce.py:224] Registering 5635 cuda graph addresses
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5932 [1] NCCL INFO Channel 14/1 : 1[1] -> 0[0] via P2P/IPC/read�[32m [repeated 147x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO Connected all trees�[32m [repeated 5x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512�[32m [repeated 5x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer�[32m [repeated 5x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:780 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO ncclCommInitRank comm 0x557cda636e90 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 51000 commId 0xac8a6ac37794e18d - Init COMPLETE�[32m [repeated 5x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO Using non-device net plugin version 0�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO Using network Socket�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO ncclCommInitRank comm 0x557cda636e90 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 51000 commId 0xac8a6ac37794e18d - Init START�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO Setting affinity for GPU 1 to 1f7fff,00000000,001f7fff�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO NVLS multicast support is not available on dev 1�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO comm 0x557cda636e90 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO P2P Chunksize set to 524288�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m nnel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO
�[36m(RayWorkerWrapper pid=780)�[0m udf-pod-36828-0-ef5a96ed49c9356d:780:5918 [1] NCCL INFO Connected all rings�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read
�[36m(RayWorkerWrapper pid=780)�[0m INFO 01-19 22:13:14 worker.py:241] Memory profiling takes 8.98 seconds�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m INFO 01-19 22:13:14 worker.py:241] the current vLLM instance can use total_gpu_memory (79.35GiB) x gpu_memory_utilization (0.90) = 71.41GiB�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m INFO 01-19 22:13:14 worker.py:241] model weights take 39.52GiB; non_torch_memory takes 1.47GiB; PyTorch activation peak memory takes 3.94GiB; the rest of the memory reserved for KV Cache is 26.48GiB.�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=769)�[0m INFO 01-19 22:13:17 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.�[32m [repeated 2x across cluster]�[0m
INFO 01-19 22:13:35 custom_all_reduce.py:224] Registering 5635 cuda graph addresses
�[36m(RayWorkerWrapper pid=781)�[0m INFO 01-19 22:13:35 model_runner.py:1535] Graph capturing finished in 18 secs, took 0.42 GiB
INFO 01-19 22:13:35 model_runner.py:1535] Graph capturing finished in 18 secs, took 0.42 GiB
INFO 01-19 22:13:35 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 30.14 seconds
INFO 01-19 22:13:36 chat_utils.py:333] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
INFO 01-19 22:13:37 preprocess.py:215] Your model uses the legacy input pipeline instead of the new multi-modal processor. Please note that the legacy pipeline will be removed in a future release. For more details, see: https://github.com//issues/10114

0%| | 0/61 [00:00<?, ?it/s]
INFO 01-19 22:13:40 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250119-221340.pkl...
INFO 01-19 22:13:40 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250119-221340.pkl.
ERROR 01-19 22:13:40 worker_base.py:467] Error executing method execute_model. This might cause deadlock in distributed execution.
ERROR 01-19 22:13:40 worker_base.py:467] Traceback (most recent call last):
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 01-19 22:13:40 worker_base.py:467] return func(*args, **kwargs)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1747, in execute_model
ERROR 01-19 22:13:40 worker_base.py:467] output: SamplerOutput = self.model.sample(
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/internvl.py", line 772, in sample
ERROR 01-19 22:13:40 worker_base.py:467] return self.language_model.sample(logits, sampling_metadata)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 496, in sample
ERROR 01-19 22:13:40 worker_base.py:467] next_tokens = self.sampler(logits, sampling_metadata)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-19 22:13:40 worker_base.py:467] return self._call_impl(*args, **kwargs)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-19 22:13:40 worker_base.py:467] return forward_call(*args, **kwargs)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 258, in forward
ERROR 01-19 22:13:40 worker_base.py:467] logits = _apply_min_tokens_penalty(logits, sampling_metadata)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 380, in _apply_min_tokens_penalty
ERROR 01-19 22:13:40 worker_base.py:467] logits[tuple(zip(*logits_to_penalize))] = -float("inf")
ERROR 01-19 22:13:40 worker_base.py:467] RuntimeError: Could not infer dtype of NoneType
ERROR 01-19 22:13:40 worker_base.py:467]
ERROR 01-19 22:13:40 worker_base.py:467] The above exception was the direct cause of the following exception:
ERROR 01-19 22:13:40 worker_base.py:467]
ERROR 01-19 22:13:40 worker_base.py:467] Traceback (most recent call last):
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 459, in execute_method
ERROR 01-19 22:13:40 worker_base.py:467] return executor(*args, **kwargs)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 01-19 22:13:40 worker_base.py:467] output = self.model_runner.execute_model(
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 01-19 22:13:40 worker_base.py:467] return func(*args, **kwargs)
ERROR 01-19 22:13:40 worker_base.py:467] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
ERROR 01-19 22:13:40 worker_base.py:467] raise type(err)(
ERROR 01-19 22:13:40 worker_base.py:467] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250119-221340.pkl): Could not infer dtype of NoneType
[Default][2025-01-19 22:13:40.503][INFO]:Error in model execution (input dumped to /tmp/err_execute_model_input_20250119-221340.pkl): Could not infer dtype of NoneType
�[36m(RayWorkerWrapper pid=780)�[0m INFO 01-19 22:13:34 custom_all_reduce.py:224] Registering 5635 cuda graph addresses�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=780)�[0m INFO 01-19 22:13:35 model_runner.py:1535] Graph capturing finished in 18 secs, took 0.42 GiB�[32m [repeated 2x across cluster]�[0m
[rank0]:[W119 22:13:44.396387951 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

main env
torch=2.5.1
vllm=0.6.6
vllm-flash-attn=2.6.1
transformers=4.46.3

main args:
tp=4

LaoWangGB · 2025-01-20T03:48:32Z

Can you show the full stack trace?

Additionally, I use the same code and env to infer InternVL2_5-8B successfully. Maybe the vision part and language part(Qwen2) are not well matched, it deliver wrong tensor from vision part to language part.

Isotr0py · 2025-01-20T03:54:37Z

Maybe the vision part and language part(Qwen2) are not well matched, it deliver wrong tensor from vision part to language part.

@LaoWangGB Can you try the InternVL2.5-26B model as well? If it also occurred on 26B, this might be the case.

LaoWangGB · 2025-01-20T12:51:24Z

Maybe the vision part and language part(Qwen2) are not well matched, it deliver wrong tensor from vision part to language part.

@LaoWangGB Can you try the InternVL2.5-26B model as well? If it also occurred on 26B, this might be the case.

InternVL2.5-26B works well. So the problem occurs due to the difference in language models. Did you test 78B successfully?

Explicitly state that InternVL 2.5 is supported

904941c

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 7, 2024

DarkLight1337 requested a review from Isotr0py December 7, 2024 14:09

mergify bot added the documentation Improvements or additions to documentation label Dec 7, 2024

Isotr0py approved these changes Dec 7, 2024

View reviewed changes

Update links

92a1bf2

Signed-off-by: DarkLight1337 <[email protected]>

Isotr0py enabled auto-merge (squash) December 7, 2024 15:12

Isotr0py merged commit 1c768fe into vllm-project:main Dec 7, 2024
34 checks passed

DarkLight1337 deleted the internvl-25 branch December 7, 2024 16:59

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[Doc] Explicitly state that InternVL 2.5 is supported (vllm-project#1…

882b886

…0978) Signed-off-by: DarkLight1337 <[email protected]>

BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024

[Doc] Explicitly state that InternVL 2.5 is supported (vllm-project#1…

82bfca8

…0978) Signed-off-by: DarkLight1337 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doc] Explicitly state that InternVL 2.5 is supported #10978

[Doc] Explicitly state that InternVL 2.5 is supported #10978

DarkLight1337 commented Dec 7, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 7, 2024

DarkLight1337 commented Dec 7, 2024 •

edited

Loading

Isotr0py commented Dec 7, 2024

DarkLight1337 commented Dec 7, 2024

DarkLight1337 commented Dec 7, 2024 •

edited

Loading

Isotr0py left a comment

Isotr0py commented Dec 7, 2024 •

edited

Loading

LaoWangGB commented Jan 19, 2025

DarkLight1337 commented Jan 19, 2025

LaoWangGB commented Jan 20, 2025

LaoWangGB commented Jan 20, 2025

Isotr0py commented Jan 20, 2025

LaoWangGB commented Jan 20, 2025 •

edited

Loading

[Doc] Explicitly state that InternVL 2.5 is supported #10978

[Doc] Explicitly state that InternVL 2.5 is supported #10978

Conversation

DarkLight1337 commented Dec 7, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 7, 2024

DarkLight1337 commented Dec 7, 2024 • edited Loading

Isotr0py commented Dec 7, 2024

DarkLight1337 commented Dec 7, 2024

DarkLight1337 commented Dec 7, 2024 • edited Loading

Isotr0py left a comment

Choose a reason for hiding this comment

Isotr0py commented Dec 7, 2024 • edited Loading

LaoWangGB commented Jan 19, 2025

DarkLight1337 commented Jan 19, 2025

LaoWangGB commented Jan 20, 2025

LaoWangGB commented Jan 20, 2025

Isotr0py commented Jan 20, 2025

LaoWangGB commented Jan 20, 2025 • edited Loading

DarkLight1337 commented Dec 7, 2024 •

edited by github-actions bot

Loading

DarkLight1337 commented Dec 7, 2024 •

edited

Loading

DarkLight1337 commented Dec 7, 2024 •

edited

Loading

Isotr0py commented Dec 7, 2024 •

edited

Loading

LaoWangGB commented Jan 20, 2025 •

edited

Loading