Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Multi-modality Support on vLLM #4194

Open
44 of 77 tasks
ywang96 opened this issue Apr 19, 2024 · 91 comments
Open
44 of 77 tasks

[RFC]: Multi-modality Support on vLLM #4194

ywang96 opened this issue Apr 19, 2024 · 91 comments

Comments

@ywang96
Copy link
Member

ywang96 commented Apr 19, 2024

[Open issues - help wanted!]

Update [11/18] - In the upcoming months, we will focus on performance optimization for multimodal models as part of vLLM V1 engine re-arch effort

P0 (We will definitely work on them):

P1 (We should be aware of these and spend some time if possible):

P2 (We should work on these when they become more important/frequently requested):


Update [9/8] - We have finished majority of the refactoring and made extensive progress for supporting multimodal models. See details here.

Roadmap for Q3 2024

In the upcoming months, we will focus on enabling multimodal models to be compatible with other performance-related features on vLLM as well as collaborating with model vendors to directly onboard new multimodal models.

P0 (We will definitely work on them):

P1 (We should be aware of these and spend some time if possible):

P2 (We should work on these when they become more important/frequently requested):


Update [7/3] - We have finished our 2nd refactoring milestone - see details here.

Roadmap for 3rd Milestone In the upcoming months, we will focus on wrapping up the main goal of this refactoring RFC and supporting more models and modalities.

P0 (We will definitely work on these):

P1 (We should be aware of these and spend some time if possible):

P2 (We should work on these when they become more frequently requested) Help wanted!:


Update [6/11] - We have finished our 1st refactoring milestone - see details here.

Roadmap for 2nd Milestone Some of the items @DarkLight1337, @xwjiang2010 and I are looking to work on as part of the next milestone are tentatively:

API Changes: A list of user-facing breaking changes can be found here

Performance related

Model support - Add more vision language models, and better developer facing documentation

Some of the ideas that we should work on in the future:

  • Make VLMs work with chunked prefill
  • Unify tokenizer & multi-modal processor (so that we can leverage AutoProcessor from transformers)
  • Prefix caching for images
  • Streaming inputs of multi-modal data

As always, please provide feedback and feature requests in this issue. Suggestions and contributions are very welcomed!


Original RFC Multi-modality support was brought to vLLM recently, much thanks to https://github.com//pull/3042 from @xwjiang2010. Since then we have seen an increasing amount of interest in such models (from the number of pull requests and issues related). However, there are a few issues we should address with the current design before we bring in more features around multi-modality.
  1. VisionLanguageConfig and MultiModalData

    • Currently the multimodal input can be either pixel_values or image_feaures for simplicity. While this works well with llava 1.5 where pixel_values are the only output from its ClipImageProcessor, this does not work well when it comes to supporting models with more complicated preprocessing to return multiple outputs.(e.g, llava 1.6, fuyu, etc). Developers could add additional preprocessing inside model implementation as a workaround, but this will be unmaintainable over time.

    • The overhead of requiring image_feature_size, image_token_id and image_input_shape is pushed to the user when these can/should be inferred from the model & processor config and not required at the inference time.

  2. The current design assumes multi-modal inputs are already processed to be consumed by the model executable, but vLLM does not have a processor util. This blocks the vision model support on the OpenAI API server for end-to-end inference.

  3. The current prompt format "<Image>" * 576 + prompt makes the underlying implementation easier (especially when it comes to profiling), but complicates the user experience compared to huggingface format "<Image>\n" + prompt and that has caused some confusion on what's needed to make multi-model work on vLLM.

Proposal
Most items in the above issues have been discussed and addressed in the original Llava1.5 PR as well as #3978. We propose a few high-level design decisions for the refactoring and welcome any feedback!

  1. Adding a processor util - We can leverage out-of-box AutoProcessor from transformers the same way we have been doing with tokenizer as an attribute of LLMEngine (e.g., self.multi_modal_processor = AutoProcessor(model)). This allows us to support end-to-end inference with the API server as well as the LLM object.

  2. Frontend input format: Because of 1, we can keep the same format as HuggingFace since that's how users usually discover new models and it makes end-to-end integration test easier. Preprocessing should be hidden away from the interface and user. For example, this preprocessing step can be done inside LLMEngine.add_request() around the same place as

    if arrival_time is None:
    arrival_time = time.time()
    prompt_token_ids = self.encode_request(
    request_id=request_id,
    prompt=prompt,
    prompt_token_ids=prompt_token_ids,
    lora_request=lora_request)

    Here's a pesudocode

if multi_modal_input is None:
   prompt_token_ids = self.encode_request( 
       request_id=request_id, 
       prompt=prompt, 
       prompt_token_ids=prompt_token_ids, 
       lora_request=lora_request)
else:
   # preprocessed_inputs is a dictionary of key(str)-value(tensor)
   # as output of self.multi_modal_processor
   preprocessed_inputs = self.preprocess_request(
       request_id=request_id, 
       prompt=prompt, 
       prompt_token_ids=prompt_token_ids, 
       lora_request=lora_request,
       multi_modal_input=images)
   prompt_token_ids = preprocessed_inputs.pop("input_ids")
   multi_modal_data = MultiModalData(data=preprocessed_inputs)
...

and thus at LLM level, only image tensors will be required.

  1. Refactor MultiModalData: Now this object simply holds the multi-modal data dictionary that we need for the model_executable. At inference time, data is unpacked in the forward pass - this approach is similar to transformer implementation of multi-modal models.
  2. Refactor VisionLanguageConfig: This config is a lot simpler now. One caveat is that sometimes when the image features can be dynamic, users may specify an optional max_feature_size to help engine run the profiling for the worst-case scenario as well as to potentially abort certain requests.
  3. Regarding the original image_feature as input type design: IMO LlaVA is a special case among multi-modal models since its vision encoder is detached from the language model and can be initialized separately, but in this case, one could argue that for the MultiModalProjector as well, and perhaps passing image_feature (outputs of CLIP) is a design decision not generalizable to all other models. Instead, passing multi-modal embeddings (outputs of CLIP -> Projector) at inference time is more flexible and should work nicely with other models. (One followup question is, does it make sense to actually define a separate Llava-no-clip module, since this is so specific to llava, to make our life easier?)

With the above changes, as an end-user, ideally you then should be able to do something like the following

from PIL import Image
from vllm import LLM
from vllm.config import VisionLanguageConfig

model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
llm = LLM(model=model_id, multi_modal_input_type=VisionLanguageConfig.IMAGE_INPUT_TYPE.IMAGE) # This can also be EMBEDDINGS

prompt = "<image>\nUSER: What's the content of the image?\nASSISTANT:"

url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)

llm.generate(prompt, ..., multi_modal_input=image)

Under the hood, the pipeline is

prompt, image
-> prompt_token_ids, MultiModalData(data=preprocessed_inputs) # through preprocess within engine.add_request() 
-> prompt_token_ids, pixel_values, image_sizes  # though unpacking in implementation of model's `forward`.

I will follow up with a series of PR for refactoring but please leave any feedback since this is a pretty significant interface change.

@ywang96 ywang96 added enhancement New feature or request RFC labels Apr 19, 2024
@ywang96
Copy link
Member Author

ywang96 commented Apr 19, 2024

cc @DarkLight1337 @Isotr0py @alsichcan

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 19, 2024

Thank you for kickstarting this conversation!

Re: Issues

I fully agree with the issues which you have pointed out. I would like to add that the current prompt format is hardly extensible for multi-image input if we plan to pursue that further down the line. In #3978, I have proposed some ways of tackling the issue at the level of OpenAI-compatible server. I have thought about them more and have decided that they alone cannot provide the required flexibility, as explained below:

If there are only a small number of standard methods, we can provide a config option to choose which method to apply. I have added the image_openai attribute to VisionLanguageConfig to facilitate this.

I am not confident that this assumption would hold for very long, given the fast-changing pace of the field.

A more flexible option would be to pass the image(s) to the chat template (e.g. by setting the images attribute alongside role and content). This transfers the burden of implementation to the maintainers of the model on HuggingFace, making it more likely that vLLM users have to implement their own template. I have created ConversationMessage class to represent the dictionary for each message.

I feel that this should be limited to cases where we only have to pass a single <image> token. The requirement of duplicating image tokens according to feature size should not be a concern of the chat template.

This is not to mention that you still have to manually duplicate the <image> tokens when using vLLM engine directly.

Re: Proposals

Here are my own thoughts on each proposal:

1. Adding a processor util

I think that we should move this responsibility outside of the Engine class. This is because multi-modal input isn't necessarily limited to image data, so we should expect more data types to be added in the future. To avoid having to modify the core Engine logic each time, we can wrap the data with processor objects (with a common interface to process the data) before passing them into the Engine. This way, we can easily add new data types by simply defining a new processor class. For your reference, I have implemented this pattern in #4197.

2. Frontend input format

My comments on this are similar for Proposal 1. However, #4197 only refactors MultiModalData to define data processing logic. To avoid excessive duplication of the logic of encode_request, we should find a way to let MultiModalData control only parts of the process. Also, in my idea of MultiModalData, the processing logic should remain independent of the model architecture. I guess this is where Proposal 3 comes in: HuggingFace processors should output dictionaries with keys that match the parameter names of model.forward().

3. Refactor MultiModalData

I have refactored this class in #4197 according to this description, and it works well enough to support the image_size parameter of LLaVA-NeXT as shown in #4199.

4. Refactor VisionLanguageConfig

Currently in #4197, MultiModalData has to accept ModelConfig and VisionLanguageConfig separately. Perhaps we can make VisionLanguageConfig an attribute of ModelConfig so we do not have to pass in multiple parameters. Using this approach, we only have to add more attributes to ModelConfig instead of having to pass more config objects around in order to support additional multi-modal data types.

Regarding max_feature_size, refer to my comments on Proposal 5.

5. Regarding the original image_feature as input type design

Instead of indirectly specifying the input shapes through the config, we can have each model implement a method to return a dictionary (the required input shape for each keyword argument). For LLaVA, the feature size can be inferred from the HuggingFace config.json if we consider image size and patch size. To support profiling, we can slightly extend this to have the model define the maximum possible input shapes.

Is the unconventional prompt format "<image>" * image_feature_size + prompt mainly to support profiling? While implementing LLaVA-NeXT, I was under the impression that this is used to simplify the generation of the attention masks. Perhaps @xwjiang2010 would have a better idea.

@jeejeelee
Copy link
Collaborator

@ywang96 Thanks for driving the integration of more MM models into VLLM. 😍

It seems that there is no plan to refactor vision encoder (todo in llava).

In my view, we should prioritize this, with performance being my main consideration.

By refactoring the vision encoder, we can establish an integration standard for MM models, similar to the our LLM models integration . This will not only ensure inference performance but also provide integration guidelines for the community

if I misunderstand, please correct me, thanks for your work again

@Isotr0py
Copy link
Collaborator

Isotr0py commented Apr 19, 2024

Generally, I agreed with @DarkLight1337's opinion about moving processing logics out from Engine to prevent modifying core code frequently. However, I think it's difficult to keep the processing logics fully independent from the model architecture.

For example, FuyuProcessor and Idefics2Processor will pad input_ids with image_feature_size during preprocess, while LlavaProcessor won't (I guess this is also why "<image>" * image_feature_size + prompt is used for llava). This means that we need to pad input_ids for llava manually. (maybe there is a better way to handle this? 🤔)

@ywang96
Copy link
Member Author

ywang96 commented Apr 19, 2024

cc @robertgshaw2-neuralmagic @mgoin (since NM's planned to work on whisper)

Thank you all for the feedback so far! I plan to address feedback altogether after meeting up with the core devs as well as getting more perspectives from other community members who are working/plan to work on multi-modal models.

Some quick ones that I can answer now:

It seems that there is no plan to refactor vision encoder (todo in llava).

@jeejeelee This will need to be done regardless since it's inside the model implementation, and this RFC is more around how we want to support multi-modal models in general, and thus focuses on the interface and component pattern.

However, I think it's difficult to keep the processing logics fully independent from the model architecture.

@DarkLight1337 @Isotr0py If this is just about where the processor should live, I'm indifferent between having it live inside LLMEngine or not. The tricky part IMO is that then we need to rework on the interface of LLMEngine to consume outputs of AutoProcessor as is.

I was under the impression that this is used to simplify the generation of the attention masks.

@DarkLight1337 That's correct too, but I'm worried that as the model gets more and more complicated, this approach might not be generalizable.

@imarcin-rbx
Copy link

Since LLMEngine has support for an output processor interface, e.g. SequenceGroupOutputProcessor. Would it be reasonable within engine, to also add an InputProcessor interface?

This way engine can check for existing of an input processor, but the implementation in this case for llava's single image processing can live outside of engine. It's implementation could be as suggested, based on AutoProcessor.

As for supporting processing of something apart of an Image tag or varying formats - engine could only have a generic input processor executor, within the model executor's code, it would be up to the model implementation to define an input processor and pass it on to engine.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 20, 2024

Generally, I agreed with @DarkLight1337's opinion about moving processing logics out from Engine to prevent modifying core code frequently. However, I think it's difficult to keep the processing logics fully independent from the model architecture.

For example, FuyuProcessor and Idefics2Processor will pad input_ids with image_feature_size during preprocess, while LlavaProcessor won't (I guess this is also why "<image>" * image_feature_size + prompt is used for llava). This means that we need to pad input_ids for llava manually. (maybe there is a better way to handle this? 🤔)

@Isotr0py Perhaps we could follow a registry pattern and have each model separately register how to preprocess the inputs? If the model does not do so, then the default implementation would be to pass the data to HuggingFace processors.

@Isotr0py
Copy link
Collaborator

@Isotr0py Perhaps we could follow a registry pattern and have each model separately register how to preprocess the inputs? If the model does not do so, then the default implementation would be to pass the data to HuggingFace processors.

Yes, I agree that we can use processor registry to solve this. And it seems that transformers_utils/configs could be a good reference for this.

@hmellor hmellor added feature request and removed enhancement New feature or request labels Apr 20, 2024
@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 22, 2024

@Isotr0py Perhaps we could follow a registry pattern and have each model separately register how to preprocess the inputs? If the model does not do so, then the default implementation would be to pass the data to HuggingFace processors.

Yes, I agree that we can use processor registry to solve this. And it seems that transformers_utils/configs could be a good reference for this.

I have added an implementation of the processor registry to #4197.

Edit: I have also moved the specification of dummy data (for profiling) to the top-level registry. Each model can define its own dummy data by registering a factory function.

@DarkLight1337
Copy link
Member

2. Frontend input format

My comments on this are similar for Proposal 1. However, #4197 only refactors MultiModalData to define data processing logic. To avoid excessive duplication of the logic of encode_request, we should find a way to let MultiModalData control only parts of the process. Also, in my idea of MultiModalData, the processing logic should remain independent of the model architecture. I guess this is where Proposal 3 comes in: HuggingFace processors should output dictionaries with keys that match the parameter names of model.forward().

To solve the prompt format problem for LLaVA, I think we have to also deal with generating the attention masks in the processing framework. That would mean abstracting some of the logic of ModelRunner._prepare_prompt.

@DarkLight1337
Copy link
Member

Just a heads up that #4228 will introduce another vision language model to vLLM, so our discussion should take that into account as well.

@ywang96
Copy link
Member Author

ywang96 commented Apr 22, 2024

I discussed with @zhuohan123 offline about this - in particular regarding this comment

To avoid having to modify the core Engine logic each time, we can wrap the data with processor objects (with a common interface to process the data) before passing them into the Engine.

If vLLM's going to use out-of-box AutoProcessor (which includes tokenizer) anyways, then it's logical to make it an attribute of the engine (similar to what we did with tokenizer). As of now for the sake of simplicity, we could add something like self.processor = AutoProcessor(model_id) to this section if the model is an MM model.

if not self.model_config.skip_tokenizer_init:
self.tokenizer: BaseTokenizerGroup
self._init_tokenizer()
self.detokenizer = Detokenizer(self.tokenizer)

then at inference time, depending on if the request has multi-modal data or not, we process with it with either self.tokenizer or self.processor.

(IMO eventually, there really shouldn't be a separation between how we preprocess text data and multi-modal data as they should all go through one InputProcessor class, but that is probably a bigger engineering refactoring that we can leave for later.)

We can also add an additional parameter on the engine level to indicate that we're feeding the engine an already processed dictionary of tensors, so the preprocessing step with self.processor will be skipped. (Very similar to prompt vs prompt_token_ids)

@DarkLight1337 @Isotr0py WDYT? Do you see any issue with this design?

@DarkLight1337
Copy link
Member

I discussed with @zhuohan123 offline about this - in particular regarding this comment

To avoid having to modify the core Engine logic each time, we can wrap the data with processor objects (with a common interface to process the data) before passing them into the Engine.

If vLLM's going to use out-of-box AutoProcessor (which includes tokenizer) anyways, then it's logical to make it an attribute of the engine (similar to what we did with tokenizer). As of now for the sake of simplicity, we could add something like self.processor = AutoProcessor(model_id) to this section if the model is an MM model.

if not self.model_config.skip_tokenizer_init:
self.tokenizer: BaseTokenizerGroup
self._init_tokenizer()
self.detokenizer = Detokenizer(self.tokenizer)

then at inference time, depending on if the request has multi-modal data or not, we process with it with either self.tokenizer or self.processor.

(IMO eventually, there really shouldn't be a separation between how we preprocess text data and multi-modal data as they should all go through one InputProcessor class, but that is probably a bigger engineering refactoring that we can leave for later.)

We can also add an additional parameter on the engine level to indicate that we're feeding the engine an already processed dictionary of tensors, so the preprocessing step with self.processor will be skipped. (Very similar to prompt vs prompt_token_ids)

@DarkLight1337 @Isotr0py WDYT? Do you see any issue with this design?

This is somewhat similar to #4166 where I load the processing logic using AutoProcessor instead of AutoTokenizer for testing the HuggingFace implementation.

I think one potential issue of this design is that the direct dependency on HuggingFace (which we have no control over) would complicate efforts to apply additional preprocessing specific to certain HuggingFace processors (e.g. to adapt to our interface).

Since @Isotr0py 's comment, I have refactored the code in #4197 into using a registry pattern to apply the preprocessor, so that MultiModalData class itself no longer has any preprocessing logic.

@ywang96
Copy link
Member Author

ywang96 commented Apr 23, 2024

@DarkLight1337 Thanks for sharing the thoughts! @zhuohan123 and I actually discussed about the use of AutoProcessor.

I think the point is that today vLLM already relies on AutoTokenizer, and most of model implementations we have in vLLM today are based on the implementation of such models in transformers, so I don't really think having this dependency is a big issue. Using AutoProcessor also allows us to abstract away from image in particular so that the same interface will work for other modalities (e.g, whisper) as well.

The original design of the prompt interface isn't very clean, and is very specific to LlaVa-1.5. I would like to emphasize that not every MM model has a "vision tower + projector + LM" architecture, so IMO the input format should really be one of raw inputs (images), processed inputs (outputs of autoprocessor) or embeddings (prompt embeddings + MM embeddings).

I will also be working on a PR so we can cross review each other's work.

@zhuohan123
Copy link
Member

One thing to add is that we would like to keep vLLM's end-user API easy to use. Having AutoProcessor outside of vLLM requires the user to create and pick the correct Processor for the specific model they are using, which can be error-prone. So I lean towards having AutoProcessor in vLLM and an end user can directly feed in the raw image (e.g. like a jpg image) to vLLM.

@DarkLight1337
Copy link
Member

@DarkLight1337 Thanks for sharing the thoughts! @zhuohan123 and I actually discussed about the use of AutoProcessor.

I think the point is that today vLLM already relies on AutoTokenizer, and most of model implementations we have in vLLM today are based on the implementation of such models in transformers, so I don't really think having this dependency is an big issue. Using AutoProcessor also allows us to abstract away from image in particular so that the same interface will work for other modalities (e.g, whisper) as well.

The original design of the prompt interface isn't very clean, and is very specific to LlaVa-1.5. I would like to emphasize that not every MM model has a "vision tower + projector + LM" architecture, so IMO the input format should really be one of raw inputs (images), processed inputs (outputs of autoprocessor) or embeddings (prompt embeddings + MM embeddings).

I will also be working on a PR so we can cross review each other's work.

In this case, we would have to refactor the computation of attention masks so that it can accept single <image> token for LLaVA, since that is what its HuggingFace processor expects. How can we integrate this into vLLM's computation of the attention masks?

@Isotr0py
Copy link
Collaborator

Isotr0py commented Apr 23, 2024

Regarding #4228, I think there may be a situation that some MM models don't have a Processor implemented.

In this case, we would have to refactor the computation of attention masks so that it can accept single <image> token for LLaVA, since that is what its HuggingFace processor expects.

@DarkLight1337 IMO, there may be a solution that we can inherit and modify the LLaVA processor to handle num_features calculation and inputs_ids padding etc, so that it can create the right attention masks from current attention masks computation codes.

@DarkLight1337
Copy link
Member

Regarding #4228, I think there may be a situation that some MM models don't have a Processor implemented.

In this case, we would have to refactor the computation of attention masks so that it can accept single <image> token for LLaVA, since that is what its HuggingFace processor expects.

@DarkLight1337 IMO, there may be a solution that we can inherit and modify the LLaVA processor to handle num_features calculation and inputs_ids padding etc, so that it can create the right attention masks from current attention masks computation codes.

I like the idea of simply inheriting from the existing HuggingFace processor. How should we ensure that our implementation is loaded instead of the HuggingFace one?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 23, 2024

Also, I think that we should wrap the input prompt to LLM.generate in order to better distinguish the kwargs to pass to the HF processor from the other arguments to LLM.generate. It is rather awkward right now that we have to pass a list of multi-modal data with length equal to the input prompts. If we use HF processor directly, the multi-modal inputs would become part of those kwargs instead of a separate MultiModalData instance.

Edit: Opened #4328

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 23, 2024

I have noticed when using distributed inference on LLaVA-NeXT (#4199), there is a bug where the image tokens are not sent to the workers, resulting in an error when trying to merge the vision embeddings. This doesn't happen with LLaVA-1.5 because the model can be loaded inside a single GPU. Does anyone have a setup where LLaVA-1.5 is loaded across multiple GPUs to check whether this issue occurs in the existing vLLM code as well?

Edit: Nevermind, it's just a typo in the chat template I passed to the command for running the OpenAI-compatible server. To avoid such confusion in the future, I have opened #4292 to detect whether the string looks like a file path.

@Isotr0py
Copy link
Collaborator

Isotr0py commented Apr 23, 2024

How should we ensure that our implementation is loaded instead of the HuggingFace one?

I think we can refer to get_config() in transformers_utils/config.py, but searching registried processor firstly then AutoProcessor, so that the get_processor() could be:

def get_processor(model: str,
               model_type: str,
               trust_remote_code: bool,
               revision: Optional[str] = None,
               code_revision: Optional[str] = None) -> ProcessorMixin:
    if model_type in _PROCESSOR_REGISTRY:
        processor_class = _PROCESSOR_REGISTRY[model_type]
        processor = processor_class.from_pretrained(model,
                                              revision=revision,
                                              code_revision=code_revision)
        return processor
    try:
        processor = AutoProcessor.from_pretrained(
            model,
            trust_remote_code=trust_remote_code,
            revision=revision,
            code_revision=code_revision)
    except ValueError as e:
        # do something else

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 23, 2024

I think we can refer to get_config() in transformers_utils/config.py, but searching registried processor firstly then AutoProcessor, so that the get_processor() could be:

def get_processor(model: str,
               model_type: str,
               trust_remote_code: bool,
               revision: Optional[str] = None,
               code_revision: Optional[str] = None) -> ProcessorMixin:
    if model_type in _PROCESSOR_REGISTRY:
        processor_class = _PROCESSOR_REGISTRY[model_type]
        processor = processor_class.from_pretrained(model,
                                              revision=revision,
                                              code_revision=code_revision)
        return processor
    try:
        processor = AutoProcessor.from_pretrained(
            model,
            trust_remote_code=trust_remote_code,
            revision=revision,
            code_revision=code_revision)
    except ValueError as e:
        # do something else

To be honest, I'm not a big fan of having to potentially add multiple files in different places* for each new model, but I guess that would work for now. Further down the line, we could consider adopting a more explicit interface for adding new models to vLLM.

*Currently, we have to add a new file in model_executor/models and possibly transformers_utils/configs. After adding multi-modal support, we also have to worry about transformers_utils/processors.

@TKONIY
Copy link
Contributor

TKONIY commented Aug 23, 2024

Is anyone working on prefix caching on multimodality input? I just finished the video support and planning to start working on prefix caching.

@ywang96
Copy link
Member Author

ywang96 commented Aug 23, 2024

Is anyone working on prefix caching on multimodality input? I just finished the video support and planning to start working on prefix caching.

@TKONIY Not in my knowledge. Can you make a RFC issue about the high level design in your mind for discussion?

For multimodal inputs, there are actually two possible layers of caching:

  1. Cache embeddings for a previously seen image/audio/video, to avoid recomputation of the encoder & projector component
  2. Extend current APC(automatic prefix caching) implementation to make it work with sequences that have multimodal data.

For 1, this is somewhat already addressed by #6613, since users then can implement their own embedding caching outside vLLM.

For 2, you can read more about it here. The technical challenge here is that currently each block of KV cache is uniquely identified by the token(id)s within the block. However, for multimodal data, their representation will always be the placeholder token id in the original sequence. I think if we're able to address this problem and make it work with embedding based inputs, then this would benefit vLLM in a bigger scope if we decide to support embeddings as input for LMs eventually (i.e, #6869).

@TKONIY
Copy link
Contributor

TKONIY commented Aug 23, 2024

Is anyone working on prefix caching on multimodality input? I just finished the video support and planning to start working on prefix caching.

@TKONIY Not in my knowledge. Can you make a RFC issue about the high level design in your mind for discussion?

For multimodal inputs, there are actually two possible layers of caching:

  1. Cache embeddings for a previously seen image/audio/video, to avoid recomputation of the encoder & projector component

  2. Extend current APC(automatic prefix caching) implementation to make it work with sequences that have multimodal data.

For 1, this is somewhat already addressed by #6613, since users then can implement their own embedding caching outside vLLM.

For 2, you can read more about it here. The technical challenge here is that currently each block of KV cache is uniquely identified by the token(id)s within the block. However, for multimodal data, their representation will always be the placeholder token id in the original sequence. I think if we're able to address this problem and make it work with embedding based inputs, then this would benefit vLLM in a bigger scope if we decide to support embeddings as input for LMs eventually (i.e, #6869).

Thanks for the introduction! In terms of the identifier, I will try to figure out a solution and open a RFC.

@DarkLight1337 DarkLight1337 mentioned this issue Sep 4, 2024
1 task
@ywang96
Copy link
Member Author

ywang96 commented Sep 9, 2024

With the release of v0.6.0, it's a good time now to wrap up the recent work on multi-modality!

In the past two months, we have made tremendous progress in multi-modality! On behalf of the vLLM team, @DarkLight1337 and I would like to thank all the community members for their amazing contributions to this workstream! To summarize the update:

  • New modality - audio: vLLM now supports audio modality with Ultravox as its first supported audio LMM. Shoutout to @juberti @petersalas and the fixie.ai team for choosing vLLM to open source their model!
  • Tensor parallelism on ViT: Thanks to @ChristopherCho's contribution, vision encoders are now sharded when the VLM is deployed over multiple GPUs. This significantly improves space efficiency since ViT is no longer replicated on each GPU.
  • Multi-image inference: A highly requested feature from the community is inference with multi-image input. This is now supported for both offline and online inference with supported models. Kudos to @Isotr0py @zifeitong @petersalas for their help to enable multi-image/audio inference!
  • Image embeddings as input: For late-fusion, embedding-based LMMs, sometimes it makes sense to host ViT in a separate host, and have LMM takes image embeddings directly instead of PIL images. This is now supported for most multi-modal models on vLLM.
  • Model support expansion: As LMM development evolves rapidly, it's important to keep up with the pace, and vLLM now supports 11 LMMs! Props to @Isotr0py @HwwwwwwwH @jeejeelee @alex-jw-brooks for contributing InternVL2, MiniCPM-V and Qwen-VL!

We're also very excited about the upcoming video support with dynamic number of frames (@TKONIY) and Qwen2-VL model support (@fyabc from Qwen Team) that will be available in 0.6.1 release!

As usual, the roadmap for this workstream will be updated in the OP of this issue in the upcoming week. Feedbacks and contributions are always very welcomed!

@PancakeAwesome
Copy link

PancakeAwesome commented Sep 10, 2024

Multi-image/Video support for Qwenvl2 & InternVL2, Thank u!

@ywang96
Copy link
Member Author

ywang96 commented Sep 15, 2024

A friendly bump that our roadmap for multimodality has been updated in the OP of this thread!

@xiezhipeng-git
Copy link

xiezhipeng-git commented Oct 23, 2024

@ywang96
hiyouga/LLaMA-Factory#3645
LLaMA-Factory webui cannot open.
And I noticed. The last time I used pip install vllm to install vllm, it caused my torch version to change from GPU to CPU. Must be reinstalled

@DarkLight1337
Copy link
Member

@ywang96 hiyouga/LLaMA-Factory#3645 LLaMA-Factory webui cannot open. And I noticed. The last time I used pip install vllm to install vllm, it caused my torch version to change from GPU to CPU. Must be reinstalled

File "D:\my\env\python3.10.10\lib\site-packages\llmtuner\chat\vllm_engine.py", line 16, in
from vllm.sequence import MultiModalData

Looks like an incompatibility in llmtuner, please report this issue on their repo instead.

@xiezhipeng-git
Copy link

xiezhipeng-git commented Oct 24, 2024

@ywang96 hiyouga/LLaMA-Factory#3645 LLaMA-Factory webui cannot open. And I noticed. The last time I used pip install vllm to install vllm, it caused my torch version to change from GPU to CPU. Must be reinstalled

File "D:\my\env\python3.10.10\lib\site-packages\llmtuner\chat\vllm_engine.py", line 16, in
from vllm.sequence import MultiModalData

Looks like an incompatibility in llmtuner, please report this issue on their repo instead.

1.
Illegal llmtuner is just one of the issues.
2.
When I open it directly using the source code, there is no problem with llmtuner anymore, it becomes another issue. It's a VLLM issue.
3.
The most serious issue is that pip install vllm (0.6.3)will force a reinstallation of the CPU version torch and replace cuda torch on windows
is vllm error
1× 2√ 3√

@https://github.com/DarkLight1337

@DarkLight1337
Copy link
Member

When I open it directly using the source code, there is no problem with llmtuner anymore, it becomes another issue. It's a VLLM issue.

Can you show the error in this case?

@xiezhipeng-git
Copy link

When I open it directly using the source code, there is no problem with llmtuner anymore, it becomes another issue. It's a VLLM issue.

Can you show the error in this case?

cannot import name 'ImagePixelData' from 'vllm.multimodal.image' (d:\my\env\python3.10.10\lib\site-packages\vllm\multimodal\image.py)
File "D:\my\work\LLM\LLaMA-Factory\LLaMA-Factory\src\llamafactory\chat\vllm_engine.py", line 33, in
from vllm.multimodal.image import ImagePixelData
File "D:\my\work\LLM\LLaMA-Factory\LLaMA-Factory\src\llamafactory\chat\chat_model.py", line 25, in
from .vllm_engine import VllmEngine
File "D:\my\work\LLM\LLaMA-Factory\LLaMA-Factory\src\llamafactory\chat_init.py", line 16, in
from .chat_model import ChatModel
File "D:\my\work\LLM\LLaMA-Factory\LLaMA-Factory\src\llamafactory\api\app.py", line 21, in
from ..chat import ChatModel
File "D:\my\work\LLM\LLaMA-Factory\LLaMA-Factory\src\llamafactory\cli.py", line 22, in
from .api.app import run_api
File "D:\my\work\LLM\LLaMA-Factory\LLaMA-Factory\src\llamafactory_init.py", line 17, in
from .cli import VERSION
File "D:\my\work\LLM\LLaMA-Factory\LLaMA-Factory\src\train.py", line 15, in
from llamafactory.train.tuner import run_exp
File "D:\my\env\python3.10.10\Lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\my\env\python3.10.10\Lib\runpy.py", line 196, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,
ImportError: cannot import name 'ImagePixelData' from 'vllm.multimodal.image' (d:\my\env\python3.10.10\lib\site-packages\vllm\multimodal\image.py)

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 24, 2024

I believe this is still some incompatibility issue, since ImagePixelData no longer exists in the current vLLM version. It was introduced in v0.5.0 and removed in v0.5.1, which was quite a while ago.

Instead, you should ask LLaMA-Factory to support newer versions of vLLM.

@xiezhipeng-git
Copy link

xiezhipeng-git commented Oct 24, 2024

You should ask LLaMA-Factory to support newer version of vLLM.您应该要求LLaMA-Factory支持vLLM的新版本。

with reason. The LLaMA-Factory should be asked to remove the ImagePixelData class. So now the problem has become pip install vllm (0.6.3) will force a reinstallation of the CPU version torch and replace cuda torch on windows only.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 24, 2024

pip install vllm (0.6.3) will force a reinstallation of the CPU version torch and replace cuda torch on windows only.

I don't quite get what you mean, how can you have different versions of torch for CPU and GPU at the same time?

@xiezhipeng-git
Copy link

xiezhipeng-git commented Oct 24, 2024

pip install vllm (0.6.3) will force a reinstallation of the CPU version torch and replace cuda torch on windows only.pip install vllm(0.6.3)将强制重新安装CPU版本的torch并仅在Windows上替换cuda torch。

I don't quite get what you mean, how can you have different versions of torch for CPU and GPU at the same time?我不太明白你的意思,你怎么能有不同版本的火炬CPU和GPU在同一时间?

only cuda torch

 pip install vllm --no-deps
Collecting vllm
  Using cached vllm-0.6.3.post1.tar.gz (2.7 MB)
  Installing build dependencies ... error
  error: subprocess-exited-with-error

  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 2
  ╰─> [86 lines of output]
      Collecting cmake>=3.26
        Using cached cmake-3.30.5-py3-none-win_amd64.whl.metadata (6.4 kB)
      Collecting ninja
        Using cached ninja-1.11.1.1-py2.py3-none-win_amd64.whl.metadata (5.4 kB)

      Collecting packaging
        Using cached packaging-24.1-py3-none-any.whl.metadata (3.2 kB)
      Collecting setuptools>=61
        Using cached setuptools-75.2.0-py3-none-any.whl.metadata (6.9 kB)
      Collecting setuptools-scm>=8.0
        Using cached setuptools_scm-8.1.0-py3-none-any.whl.metadata (6.6 kB)
      Collecting torch==2.4.0
        Using cached torch-2.4.0-cp310-cp310-win_amd64.whl.metadata (27 kB)
      Collecting wheel
        Using cached wheel-0.44.0-py3-none-any.whl.metadata (2.3 kB)
      Collecting jinja2
        Using cached jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
      Collecting filelock (from torch==2.4.0)
        Using cached filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
      Collecting typing-extensions>=4.8.0 (from torch==2.4.0)
        Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)

      Collecting sympy (from torch==2.4.0)
        Using cached sympy-1.13.3-py3-none-any.whl.metadata (12 kB)
      Collecting networkx (from torch==2.4.0)
        Using cached networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
      Collecting fsspec (from torch==2.4.0)
        Using cached fsspec-2024.10.0-py3-none-any.whl.metadata (11 kB)
      Collecting tomli>=1 (from setuptools-scm>=8.0)
        Using cached tomli-2.0.2-py3-none-any.whl.metadata (10.0 kB)
      Collecting MarkupSafe>=2.0 (from jinja2)
        Using cached MarkupSafe-3.0.2-cp310-cp310-win_amd64.whl.metadata (4.1 kB
)
      Collecting mpmath<1.4,>=1.1.0 (from sympy->torch==2.4.0)
        Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
      Downloading torch-2.4.0-cp310-cp310-win_amd64.whl (197.9 MB)
                                                  3.9/197.9 MB 21.3 kB/s eta 2:3
1:31
      ERROR: Exception:
      Traceback (most recent call last):
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 438, in _error_catcher
          yield
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 561, in read
          data = self._fp_read(amt) if not fp_closed else b""
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 527, in _fp_read
          return self._fp.read(amt) if amt is not None else self._fp.read()
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\cachecontrol
\filewrapper.py", line 98, in read
          data: bytes = self.__fp.read(amt)
        File "D:\my\env\python3.10.10\lib\http\client.py", line 465, in read
          s = self.fp.read(amt)
        File "D:\my\env\python3.10.10\lib\socket.py", line 705, in readinto
          return self._sock.recv_into(b)
        File "D:\my\env\python3.10.10\lib\ssl.py", line 1274, in recv_into
          return self.read(nbytes, buffer)
        File "D:\my\env\python3.10.10\lib\ssl.py", line 1130, in read
          return self._sslobj.read(len, buffer)
      TimeoutError: The read operation timed out
     
      During handling of the above exception, another exception occurred:
     
      Traceback (most recent call last):
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\cli\base_c
ommand.py", line 105, in _run_wrapper
          status = _inner_run()
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\cli\base_c
ommand.py", line 96, in _inner_run
          return self.run(options, args)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\cli\req_co
mmand.py", line 67, in wrapper
          return func(self, options, args)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\commands\i
nstall.py", line 379, in run
          requirement_set = resolver.resolve(
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\resolution
\resolvelib\resolver.py", line 179, in resolve
          self.factory.preparer.prepare_linked_requirements_more(reqs)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\operations
\prepare.py", line 554, in prepare_linked_requirements_more
          self._complete_partial_requirements(
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\operations
\prepare.py", line 469, in _complete_partial_requirements
          for link, (filepath, _) in batch_download:
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\network\do
wnload.py", line 184, in __call__
          for chunk in chunks:
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\cli\progre
ss_bars.py", line 55, in _rich_progress_bar
          for chunk in iterable:
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\network\ut
ils.py", line 65, in response_chunks
          for chunk in response.raw.stream(
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 622, in stream
          data = self.read(amt=amt, decode_content=decode_content)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 560, in read
          with self._error_catcher():
        File "D:\my\env\python3.10.10\lib\contextlib.py", line 153, in __exit__
          self.gen.throw(typ, value, traceback)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 443, in _error_catcher
          raise ReadTimeoutError(self._pool, None, "Read timed out.")
      pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host=
'files.pythonhosted.org', port=443): Read timed out.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem wit
h pip.
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 2
╰─> See above for output.

If you internet is not good. You are so lucky. Because it will fail during the process of forcibly replacing CUDA torch with CPU. If you have a good internet connection. So things will become very bad. Your torch will transition from CUDA to a lower version CPU.
And pip install vllm --no-deps or pip install vllm has same issue

@DarkLight1337
Copy link
Member

pip install vllm (0.6.3) will force a reinstallation of the CPU version torch and replace cuda torch on windows only.pip install vllm(0.6.3)将强制重新安装CPU版本的torch并仅在Windows上替换cuda torch。

I don't quite get what you mean, how can you have different versions of torch for CPU and GPU at the same time?我不太明白你的意思,你怎么能有不同版本的火炬CPU和GPU在同一时间?

only cuda torch

 pip install vllm --no-deps
Collecting vllm
  Using cached vllm-0.6.3.post1.tar.gz (2.7 MB)
  Installing build dependencies ... error
  error: subprocess-exited-with-error

  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 2
  ╰─> [86 lines of output]
      Collecting cmake>=3.26
        Using cached cmake-3.30.5-py3-none-win_amd64.whl.metadata (6.4 kB)
      Collecting ninja
        Using cached ninja-1.11.1.1-py2.py3-none-win_amd64.whl.metadata (5.4 kB)

      Collecting packaging
        Using cached packaging-24.1-py3-none-any.whl.metadata (3.2 kB)
      Collecting setuptools>=61
        Using cached setuptools-75.2.0-py3-none-any.whl.metadata (6.9 kB)
      Collecting setuptools-scm>=8.0
        Using cached setuptools_scm-8.1.0-py3-none-any.whl.metadata (6.6 kB)
      Collecting torch==2.4.0
        Using cached torch-2.4.0-cp310-cp310-win_amd64.whl.metadata (27 kB)
      Collecting wheel
        Using cached wheel-0.44.0-py3-none-any.whl.metadata (2.3 kB)
      Collecting jinja2
        Using cached jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
      Collecting filelock (from torch==2.4.0)
        Using cached filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
      Collecting typing-extensions>=4.8.0 (from torch==2.4.0)
        Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)

      Collecting sympy (from torch==2.4.0)
        Using cached sympy-1.13.3-py3-none-any.whl.metadata (12 kB)
      Collecting networkx (from torch==2.4.0)
        Using cached networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
      Collecting fsspec (from torch==2.4.0)
        Using cached fsspec-2024.10.0-py3-none-any.whl.metadata (11 kB)
      Collecting tomli>=1 (from setuptools-scm>=8.0)
        Using cached tomli-2.0.2-py3-none-any.whl.metadata (10.0 kB)
      Collecting MarkupSafe>=2.0 (from jinja2)
        Using cached MarkupSafe-3.0.2-cp310-cp310-win_amd64.whl.metadata (4.1 kB
)
      Collecting mpmath<1.4,>=1.1.0 (from sympy->torch==2.4.0)
        Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
      Downloading torch-2.4.0-cp310-cp310-win_amd64.whl (197.9 MB)
                                                  3.9/197.9 MB 21.3 kB/s eta 2:3
1:31
      ERROR: Exception:
      Traceback (most recent call last):
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 438, in _error_catcher
          yield
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 561, in read
          data = self._fp_read(amt) if not fp_closed else b""
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 527, in _fp_read
          return self._fp.read(amt) if amt is not None else self._fp.read()
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\cachecontrol
\filewrapper.py", line 98, in read
          data: bytes = self.__fp.read(amt)
        File "D:\my\env\python3.10.10\lib\http\client.py", line 465, in read
          s = self.fp.read(amt)
        File "D:\my\env\python3.10.10\lib\socket.py", line 705, in readinto
          return self._sock.recv_into(b)
        File "D:\my\env\python3.10.10\lib\ssl.py", line 1274, in recv_into
          return self.read(nbytes, buffer)
        File "D:\my\env\python3.10.10\lib\ssl.py", line 1130, in read
          return self._sslobj.read(len, buffer)
      TimeoutError: The read operation timed out
     
      During handling of the above exception, another exception occurred:
     
      Traceback (most recent call last):
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\cli\base_c
ommand.py", line 105, in _run_wrapper
          status = _inner_run()
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\cli\base_c
ommand.py", line 96, in _inner_run
          return self.run(options, args)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\cli\req_co
mmand.py", line 67, in wrapper
          return func(self, options, args)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\commands\i
nstall.py", line 379, in run
          requirement_set = resolver.resolve(
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\resolution
\resolvelib\resolver.py", line 179, in resolve
          self.factory.preparer.prepare_linked_requirements_more(reqs)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\operations
\prepare.py", line 554, in prepare_linked_requirements_more
          self._complete_partial_requirements(
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\operations
\prepare.py", line 469, in _complete_partial_requirements
          for link, (filepath, _) in batch_download:
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\network\do
wnload.py", line 184, in __call__
          for chunk in chunks:
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\cli\progre
ss_bars.py", line 55, in _rich_progress_bar
          for chunk in iterable:
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\network\ut
ils.py", line 65, in response_chunks
          for chunk in response.raw.stream(
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 622, in stream
          data = self.read(amt=amt, decode_content=decode_content)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 560, in read
          with self._error_catcher():
        File "D:\my\env\python3.10.10\lib\contextlib.py", line 153, in __exit__
          self.gen.throw(typ, value, traceback)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 443, in _error_catcher
          raise ReadTimeoutError(self._pool, None, "Read timed out.")
      pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host=
'files.pythonhosted.org', port=443): Read timed out.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem wit
h pip.
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 2
╰─> See above for output.

If you internet is not good. You are so lucky. Because it will fail during the process of forcibly replacing CUDA torch with CPU. If you have a good internet connection. So things will become very bad. Your torch will transition from CUDA to a lower version CPU. And pip install vllm --no-deps or pip install vllm has same issue

What is your original version of pytorch?

@xiezhipeng-git
Copy link

xiezhipeng-git commented Oct 24, 2024

@DarkLight1337 torch Version: 2.5.0+cu124
Torch version 2.4.+ I have the same problem. I forgot the exact version.

@DarkLight1337
Copy link
Member

@DarkLight1337 torch Version: 2.5.0+cu124 Torch version 2.4.+ I have the same problem. I forgot the exact version.

Can you raise this in a new issue (with installation tag) so we can better focus on this?

@Wiselnn570
Copy link

Wiselnn570 commented Nov 4, 2024

@ywang96 @DarkLight1337 Great job! I'm a newcomer here, and I'm using this framework primarily to validate the extrapolation capabilities of large multimodal models, specifically when the context exceeds 32k tokens. However, I have encountered an issue while modifying the positional encoding in the mrope_input_positions section of the Qwen2-VL code, and I have tried but don't know how to resolve it. In short, I'm aiming to explore the model's performance when extrapolating to a 60k context on the Qwen2-VL 7B model, using video data for testing. I tried replacing this section (

MRotaryEmbedding.get_input_positions(
) with vanilla-ROPE(That is, placing image, video, and text tokens all on the main diagonal of the M-RoPE.), which caused the max value of the mrope_input_positions up to approximately 59k, but it eventually led to an error.

../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [8320,0,0], thread: [64,0,0] Assertion `-sizes[i] <= index && in
dex < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [8320,0,0], thread: [65,0,0] Assertion `-sizes[i] <= index && in
dex < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [8320,0,0], thread: [66,0,0] Assertion `-sizes[i] <= index && in
dex < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [8320,0,0], thread: [67,0,0] Assertion `-sizes[i] <= index && in
dex < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [8320,0,0], thread: [68,0,0] Assertion `-sizes[i] <= index && in
dex < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [8320,0,0], thread: [69,0,0] Assertion `-sizes[i] <= index && in
dex < sizes[i] && "index out of bounds"` failed.
...
INFO 11-03 18:58:15 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241103-185815.pkl...
WARNING 11-03 18:58:15 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: device-side assert triggered
WARNING 11-03 18:58:15 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
WARNING 11-03 18:58:15 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
WARNING 11-03 18:58:15 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
WARNING 11-03 18:58:15 model_runner_base.py:143] 
RuntimeError: Error in model execution: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I have already tested the original M-RoPE, which outputs correctly with a 60k context, and the maximum mrope_input_positions value is around 300. So, I am wondering if the position value is too large, causing it to exceed the index. How should I modify it to support vanilla-RoPE (Or perhaps some other 3D positional encoding, where the positional encoding values are quite large.) for evaluation?

p.s. I noticed that this function (

def _compute_multi_modal_input(self, inter_data: InterDataForSeqGroup,
) was called several times before inferring on my provided video test data, and I’m wondering if this might be related.

Thanks!

@ywang96
Copy link
Member Author

ywang96 commented Nov 4, 2024

Hi there! @Wiselnn570 Thanks for using vLLM and I would suggest you open a separate issue for your question as it seems related to Qwen2VL in particular. Thanks for your understanding!

@Wiselnn570
Copy link

Hi there! @Wiselnn570 Thanks for using vLLM and I would suggest you open a separate issue for your question as it seems related to Qwen2VL in particular. Thanks for your understanding!

@ywang96 Thanks for your reply, the issue is here #9965

@ywang96
Copy link
Member Author

ywang96 commented Nov 19, 2024

Friendly bump to the thread: It's been a while since our last update, and we have just planned out the roadmap with items we will work on in the next a few months. Please check it out in the OP of this issue!

As always, feedbacks and contributions are very welcomed!

@ywang96 ywang96 changed the title [RFC]: Multi-modality Support Refactoring [RFC]: Multi-modality Support on vLLM Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests