Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hope Qwen2VL-based UGround-v1.1 opening early #6

Open
shuqingjinse opened this issue Nov 5, 2024 · 14 comments
Open

hope Qwen2VL-based UGround-v1.1 opening early #6

shuqingjinse opened this issue Nov 5, 2024 · 14 comments

Comments

@shuqingjinse
Copy link

good job!

@korbinian-hoermann
Copy link

yes, really nice work ! is there an approximate date for the release of UGround-v1.1?

@boyugou
Copy link
Collaborator

boyugou commented Nov 25, 2024

Hi All,

Let me provide an update on the timeline for the release plan here.

Apologies for the delayed updates. I have been extremely busy and quite unwell after returning from EMNLP (caught a severe cold, and it’s still very bad). I know many are waiting for things like data scripts, Qwen2-VL-based UGround, and the newer UGround. I will be working intensively on these this week, along with improving the SeeAct codebase. I aim to gradually release all of them this week and next.

Thank you for your patience.

@korbinian-hoermann

@korbinian-hoermann
Copy link

Sounds awesome, ty for the quick response!

@boyugou
Copy link
Collaborator

boyugou commented Jan 3, 2025

V1 uploaded.

https://huggingface.co/osunlp/UGround-V1-7B
https://huggingface.co/osunlp/UGround-V1-2B

Main Results:

ScreenSpot (Standard) Arch SFT data Mobile-Text Mobile-Icon Desktop-Text Desktop-Icon Web-Text Web-Icon Avg
Qwen-VL Qwen-VL   9.5 4.8 5.7 5.0 3.5 2.4 5.2
CogAgent CogAgent CogAgent 67 24 74.2 20 70.4 28.6 47.4
SeeClick Qwen-VL SeeClick 78.0 52.0 72.2 30.0 55.7 32.5 53.4
Qwen-GUI Qwen-VL GUICourse 52.4 10.9 45.9 5.7 43.0 13.6 28.6
UGround-V1 (Qwen-VL) Qwen-VL Web-Hybrid 68.5 28.4 69.6 34.3 63.5 39.3 50.6
UGround-V1 LLaVA-UGround-V1 UGround-V1 82.8 60.3 82.5 63.6 80.4 70.4 73.3
Qwen2-VL Qwen2-VL   61.3 39.3 52.0 45.0 33.0 21.8 42.1
Auguvis-G-7B Qwen2-VL Aguvis-Stage-1 88.3 78.2 88.1 70.7 85.7 74.8 81.0
Auguvis-7B Qwen2-VL Aguvis-Stage-1&2 95.6 77.7 93.8 67.1 88.3 75.2 83.0
OS-Atlas-Base-4B InternVL OS-Atlas 85.7 58.5 72.2 45.7 82.6 63.1 68.0
OS-Atlas-Base-7B Qwen2-VL OS-Atlas 93.0 72.9 91.8 62.9 90.9 74.3 81.0
ShowUI-G ShowUI ShowUI 91.6 69.0 81.8 59.0 83.0 65.5 75.0
ShowUI ShowUI ShowUI 92.3 75.5 76.3 61.1 81.7 63.6 75.1
Iris Iris SeeClick 85.3 64.2 86.7 57.5 82.6 71.2 74.6
Aria-UI Aria Aria-UI 92.3 73.8 93.3 64.3 86.5 76.2 81.1
UGround-V1-2B (Qwen2-VL) Qwen2-VL UGround-V1 89.4 72.0 88.7 65.7 81.3 68.9 77.7
UGround-V1-7B (Qwen2-VL) Qwen2-VL UGround-V1 93.0 79.9 93.8 76.4 90.9 84.0 86.3

@boyugou
Copy link
Collaborator

boyugou commented Jan 3, 2025

Version 1.1 still requires some time. Because there are many excellent dataset works recently, and we are in the process of integrating them and conducting some ablation studies.

We hope that Qwen2-VL-based UGround can make inference and training more convenient for everyone.

@dprokhorov17
Copy link

@boyugou What is the point of training VLMs with >7B parameters? Just to get a small amount of gain within the benchmarks? I mean, the grounding models basically have a single task to achieve, that is to provide x, y coordinates, so I think it is far more interesting to train or develop models sub 1B parameters, otherwise the price-to-performance ratio is quite poor. Check out the following model: https://huggingface.co/Samsung/TinyClick, though I haven't evaluated it yet, but 271m parameters is the way to go!

@boyugou
Copy link
Collaborator

boyugou commented Jan 6, 2025

@dprokhorov17 Great question! Thanks for pointing it out and I'm glad to have some discussion here.

Also share another interesting work I saw in this field: https://openreview.net/forum?id=M9iky9Ruhx&nesting=2&sort=date-desc

image

@boyugou
Copy link
Collaborator

boyugou commented Jan 6, 2025

@dprokhorov17

Firstly, after UGround, there has been extensive discussion about modular design (SeeAct-V) and end-to-end (e2e) models for GUI Agents. I've been approached by both academic researchers and industry professionals regarding this discussion. Many projects/works have demonstrated the success of SeeAct-V-like design.

In general, building a generalist e2e GUI Agent model likely requires a large model—approximately ~100B parameters (consider the success of Aguvis-72B as an example). To support this, we aim to provide robust, grounded GUI foundation models across various sizes (e.g., 2B, 7B, 72B, etc.), enabling the community—and ourselves—to build innovative applications on top of these models. An e2e agent model is one representative example, akin to how Salesforce and Shanghai AI Lab trained Aguvis and OS-Atlas. BTW, if you have tried the new UGround-V1 model, you may find they have not substantially lost general capabilities, which is awesome!

On the other hand, I agree that smaller models hold significant value, particularly for on-device deployment and modular systems like SeeAct-V. Intuitively, grounding is a highly specific task and appears to be less challenging than planning. This insight was a major motivation behind our UGround project. However, results such as ShowUI (2B), UGround (2B), and UGround (7B) demonstrate that the 7B model still significantly outperforms the 2B model (Also check the new results on ScreenSpot-Pro). The 7B model seems to strike an excellent balance between performance and size.

By the way, I so far feel that GUI grounding (essentially GUI Understanding + GUI REC/Localization) is still challenging for very small models due to numerous hard, long-tail cases that are difficult to generalize—even for strong base models like Qwen2-VL-7B. That said, I am excited to see the community's efforts toward developing more powerful small-sized models!

@dprokhorov17
Copy link

@boyugou Sure, building an end-to-end GUI agent is quite different from pure x, y-coordinate grounding. I completely agree that a model with over 72B parameters (or even larger) would likely be necessary to achieve reasonable performance. Let’s see how this evolves in the future!

@abrichr
Copy link

abrichr commented Jan 6, 2025

@boyugou thanks for the active discussion here. Any thoughts on grounding in action space, e.g. on trajectories?

@dprokhorov17
Copy link

@abrichr Would you please clarify the term "grounding in action space, e.g. on trajectories"?

@abrichr
Copy link

abrichr commented Jan 7, 2025

By grounding in action space, I mean instead of grounding in x/y coordinates, grounding in sequences of actions.

For example, from https://ariaui.github.io/:

To handle dynamic contexts in task performing, Aria-UI incorporates textual and text-image interleaved action histories, enabling robust context-aware reasoning for grounding.

https://arxiv.org/abs/2412.09605:

We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models.

@boyugou
Copy link
Collaborator

boyugou commented Jan 7, 2025

By grounding in action space, I mean instead of grounding in x/y coordinates, grounding in sequences of actions.

For example, from https://ariaui.github.io/:

To handle dynamic contexts in task performing, Aria-UI incorporates textual and text-image interleaved action histories, enabling robust context-aware reasoning for grounding.

https://arxiv.org/abs/2412.09605:

We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models.

interleaved text-image action history makes sense to me. I actually wanted to try that. It intuitively brings several benefits.

@dprokhorov17
Copy link

@abrichr @boyugou Do you guys have any idea why most open-source VLMs larger than 70B aren't trained on web, mobile, and desktop GUI data from the ground up? I mean, there are plenty of open-source datasets of reasonable size out there, yet we still need models specifically trained on GUI data... Why not incorporate this data into the e.g. stage 2 of the model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants