-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference hardware requirements #2
Comments
Overall, it's built on LLava with slight adaptations (mainly about input image processing), so it's definitely possible to run on CPU (take Ollama as a reference). I remember 4bit llava can run very smoothly on my laptop. |
@abrichr I'm running this within a docker container and my memory footprint is the following: It's running on |
Hi, I am implement the inference "single_infer.py" on my mac with CPU only. It is very slow. Is there any suggestion on the configurations settings regarding to "load_pretrained_model", "tokenizer_image_token" and "model.generate()". |
@GuoHaoren @abrichr @dprokhorov17 We have trained and released a stronger while smaller 2B model: I'm still trying to see what's the best way to run on CPU. Maybe by quantization like I tried with the GGUF one, which runs pretty fast on my 16GB Macbook (with LM Studio). But so far I have no idea how to handle image inputs. |
@boyugou and still, you didn't provide any finetuning/training script and most important the data in order to be able to reproduce your results. |
Let me clarify a little bit. Here are the main training script and code for the previous training: You will likely need to change the dataloader logic a little bit, as I assumed using a parquet file from s3 for streaming data and mistakenly deleted the naive implementation on top of original LLaVA's train.py: The Qwen2-VL-based models are trained by the infra of Mosaic ML, using an in-house codebase which I cannot share with you. But I will share the yaml files for the details of hyper-parameters used. (I think the critical things are: lr=1e-6, max_pixels=1344*1344, ep=~1.5ep The data used for the above two are totally the same |
Regarding the data, plz give me a bit more time. And I will release the code I used first. For big companies, it should be easy to collect better data than what I used (Web-Hybrid) with the same pipeline, by using:
I know I have been asked a lot regarding the data, especially in another issue. Sorry for the delay. For the raw data, fair use, copyright, potential harmful contents are our main concerns. Hope you can understand. Overall, the Qwen-based UGround-V1 is not the only thing in our release plan. They are still in our plan and will release soon (with a bunch of other stuff). |
Do the above answers address your question? I hope to and will have them ready for everyone soon. If you have any urgent projects or need any resources, feel free to contact me directly via email. I will do my best to assist you. For unknown reasons, GitHub seems not pushing every message to me through email. |
Hello, and thank you for the excellent work!
In the paper it says:
Can you please clarify what are the inference time hardware requirements? Any chance of running this on CPU?
Thanks again!
The text was updated successfully, but these errors were encountered: