Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference hardware requirements #2

Open
abrichr opened this issue Oct 10, 2024 · 8 comments
Open

Inference hardware requirements #2

abrichr opened this issue Oct 10, 2024 · 8 comments

Comments

@abrichr
Copy link

abrichr commented Oct 10, 2024

Hello, and thank you for the excellent work!

In the paper it says:

The first stage takes about 50 hours on a single 4x NVIDIA A100 machine (global batch size 128 with gradient
accumulation). And for the large scale GUI data training, we use 112 NVIDIA H100 GPUs and finish the
training in about 6 hours (global batch size 448).

Can you please clarify what are the inference time hardware requirements? Any chance of running this on CPU?

Thanks again!

@boyugou
Copy link
Collaborator

boyugou commented Oct 10, 2024

Overall, it's built on LLava with slight adaptations (mainly about input image processing), so it's definitely possible to run on CPU (take Ollama as a reference). I remember 4bit llava can run very smoothly on my laptop.

@dprokhorov17
Copy link

@abrichr I'm running this within a docker container and my memory footprint is the following:

image

It's running on bf16 float precision using cached=True

@GuoHaoren
Copy link

Hi, I am implement the inference "single_infer.py" on my mac with CPU only. It is very slow. Is there any suggestion on the configurations settings regarding to "load_pretrained_model", "tokenizer_image_token" and "model.generate()".

@boyugou
Copy link
Collaborator

boyugou commented Jan 4, 2025

@GuoHaoren @abrichr @dprokhorov17

We have trained and released a stronger while smaller 2B model:

osunlp/UGround-V1-2B

I'm still trying to see what's the best way to run on CPU.

Maybe by quantization like
GGUF: https://huggingface.co/mradermacher/UGround-V1-2B-GGUF
or AWQ/GPQ which are suggested by Qwen2-VL

I tried with the GGUF one, which runs pretty fast on my 16GB Macbook (with LM Studio). But so far I have no idea how to handle image inputs.

@dprokhorov17
Copy link

dprokhorov17 commented Jan 6, 2025

@boyugou and still, you didn't provide any finetuning/training script and most important the data in order to be able to reproduce your results.

@boyugou
Copy link
Collaborator

boyugou commented Jan 6, 2025

@boyugou and still, you didn't provide any finetuning/training script and most important the data in order to be able to reproduce your results.

Let me clarify a little bit. Here are the main training script and code for the previous training:

Pretrain:
https://github.com/boyugou/llava_uground/blob/90ff02d24c3f8c7a9fb5c90050fa003b0512910f/scripts/ui_v1/pretrain_7b.sh

SFT:
https://github.com/boyugou/llava_uground/blob/90ff02d24c3f8c7a9fb5c90050fa003b0512910f/scripts/ui_v1/finetune_task_lora.sh

You will likely need to change the dataloader logic a little bit, as I assumed using a parquet file from s3 for streaming data and mistakenly deleted the naive implementation on top of original LLaVA's train.py:
https://github.com/boyugou/llava_uground/blob/90ff02d24c3f8c7a9fb5c90050fa003b0512910f/llava/train/train_s3.py

The Qwen2-VL-based models are trained by the infra of Mosaic ML, using an in-house codebase which I cannot share with you. But I will share the yaml files for the details of hyper-parameters used. (I think the critical things are: lr=1e-6, max_pixels=1344*1344, ep=~1.5ep

The data used for the above two are totally the same

@boyugou
Copy link
Collaborator

boyugou commented Jan 6, 2025

@boyugou and still, you didn't provide any finetuning/training script and most important the data in order to be able to reproduce your results.

Regarding the data, plz give me a bit more time. And I will release the code I used first. For big companies, it should be easy to collect better data than what I used (Web-Hybrid) with the same pipeline, by using:

  • better webpage url list (I randomly sampled from Common Crawl)
  • better captioning MLLM and rewriting LM
  • larger scale

I know I have been asked a lot regarding the data, especially in another issue. Sorry for the delay. For the raw data, fair use, copyright, potential harmful contents are our main concerns. Hope you can understand.

Overall, the Qwen-based UGround-V1 is not the only thing in our release plan. They are still in our plan and will release soon (with a bunch of other stuff).

@boyugou
Copy link
Collaborator

boyugou commented Jan 6, 2025

@dprokhorov17

Do the above answers address your question? I hope to and will have them ready for everyone soon.

If you have any urgent projects or need any resources, feel free to contact me directly via email. I will do my best to assist you. For unknown reasons, GitHub seems not pushing every message to me through email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants