-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 #12313
Conversation
Signed-off-by: Roger Wang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Repro script: from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from vllm.assets.video import VideoAsset
def get_llm():
model_name = "llava-hf/llava-onevision-qwen2-7b-ov-hf"
llm = LLM(
model=model_name,
max_num_seqs=5,
max_num_batched_tokens=32768,
enable_prefix_caching=False,
)
return llm
def get_multi_modal_input(modality):
if modality == "image":
# Input image and question
image = ImageAsset("cherry_blossom") \
.pil_image.convert("RGB")
img_question = "What is the content of this image?"
return {
"data": image,
"question": img_question,
}
if modality == "video":
# Input video and question
video = VideoAsset(name="sample_demo_1.mp4",
num_frames=4).np_ndarrays
vid_question = "Why is this video funny?"
return {
"data": video,
"question": vid_question,
}
msg = f"Modality {modality} is not supported."
raise ValueError(msg)
if __name__ == "__main__":
modalities = ["image", "video", "image"]
inputs = []
for i in range(len(modalities)):
modality = modalities[i]
if modality == "image":
placeholder = "<image>"
elif modality == "video":
placeholder = "<video>"
mm_input = get_multi_modal_input(modality)
data = mm_input["data"]
question = mm_input["question"]
prompt = f"<|im_start|>user {placeholder}\n{question}<|im_end|> \
<|im_start|>assistant\n"
inputs.append(
{
"prompt": prompt,
"multi_modal_data": {
modality: data
},
}
)
llm = get_llm()
params = SamplingParams(max_tokens=16, temperature=0.0)
outputs = llm.generate(inputs, sampling_params=params)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text) Running this on main:
This branch:
|
Also FYI @HwwwwwwwH since you're working on MiniCPM-O, this PR might be relevant to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, just a nit, PTAL!
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
This is a follow up of #12259. When there are multiple modalities involved, embedding merge are based on different assumptions in V0 and V1, therefore
get_input_embeddings
does not work with V0 and will produce a bug when there are multiple modalities in a batch.Fortunately, since embedding generation and LM forward pass are executed together inside the VLM forward pass, it gives us a good way to separate the logics of the two code paths.