hope Qwen2VL-based UGround-v1.1 opening early #6

shuqingjinse · 2024-11-05T09:53:35Z

good job!

korbinian-hoermann · 2024-11-23T19:38:34Z

yes, really nice work ! is there an approximate date for the release of UGround-v1.1?

boyugou · 2024-11-25T16:56:25Z

Hi All,

Let me provide an update on the timeline for the release plan here.

Apologies for the delayed updates. I have been extremely busy and quite unwell after returning from EMNLP (caught a severe cold, and it’s still very bad). I know many are waiting for things like data scripts, Qwen2-VL-based UGround, and the newer UGround. I will be working intensively on these this week, along with improving the SeeAct codebase. I aim to gradually release all of them this week and next.

Thank you for your patience.

@korbinian-hoermann

korbinian-hoermann · 2024-11-26T15:49:53Z

Sounds awesome, ty for the quick response!

boyugou · 2025-01-03T16:52:24Z

V1 uploaded.

https://huggingface.co/osunlp/UGround-V1-7B
https://huggingface.co/osunlp/UGround-V1-2B

Main Results:

ScreenSpot (Standard)	Arch	SFT data	Mobile-Text	Mobile-Icon	Desktop-Text	Desktop-Icon	Web-Text	Web-Icon	Avg
Qwen-VL	Qwen-VL		9.5	4.8	5.7	5.0	3.5	2.4	5.2
CogAgent	CogAgent	CogAgent	67	24	74.2	20	70.4	28.6	47.4
SeeClick	Qwen-VL	SeeClick	78.0	52.0	72.2	30.0	55.7	32.5	53.4
Qwen-GUI	Qwen-VL	GUICourse	52.4	10.9	45.9	5.7	43.0	13.6	28.6
UGround-V1 (Qwen-VL)	Qwen-VL	Web-Hybrid	68.5	28.4	69.6	34.3	63.5	39.3	50.6
UGround-V1	LLaVA-UGround-V1	UGround-V1	82.8	60.3	82.5	63.6	80.4	70.4	73.3
Qwen2-VL	Qwen2-VL		61.3	39.3	52.0	45.0	33.0	21.8	42.1
Auguvis-G-7B	Qwen2-VL	Aguvis-Stage-1	88.3	78.2	88.1	70.7	85.7	74.8	81.0
Auguvis-7B	Qwen2-VL	Aguvis-Stage-1&2	95.6	77.7	93.8	67.1	88.3	75.2	83.0
OS-Atlas-Base-4B	InternVL	OS-Atlas	85.7	58.5	72.2	45.7	82.6	63.1	68.0
OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.0	72.9	91.8	62.9	90.9	74.3	81.0
ShowUI-G	ShowUI	ShowUI	91.6	69.0	81.8	59.0	83.0	65.5	75.0
ShowUI	ShowUI	ShowUI	92.3	75.5	76.3	61.1	81.7	63.6	75.1
Iris	Iris	SeeClick	85.3	64.2	86.7	57.5	82.6	71.2	74.6
Aria-UI	Aria	Aria-UI	92.3	73.8	93.3	64.3	86.5	76.2	81.1
UGround-V1-2B (Qwen2-VL)	Qwen2-VL	UGround-V1	89.4	72.0	88.7	65.7	81.3	68.9	77.7
UGround-V1-7B (Qwen2-VL)	Qwen2-VL	UGround-V1	93.0	79.9	93.8	76.4	90.9	84.0	86.3

boyugou · 2025-01-03T17:23:25Z

Version 1.1 still requires some time. Because there are many excellent dataset works recently, and we are in the process of integrating them and conducting some ablation studies.

We hope that Qwen2-VL-based UGround can make inference and training more convenient for everyone.

dprokhorov17 · 2025-01-06T15:40:16Z

@boyugou What is the point of training VLMs with >7B parameters? Just to get a small amount of gain within the benchmarks? I mean, the grounding models basically have a single task to achieve, that is to provide x, y coordinates, so I think it is far more interesting to train or develop models sub 1B parameters, otherwise the price-to-performance ratio is quite poor. Check out the following model: https://huggingface.co/Samsung/TinyClick, though I haven't evaluated it yet, but 271m parameters is the way to go!

boyugou · 2025-01-06T19:49:55Z

@dprokhorov17 Great question! Thanks for pointing it out and I'm glad to have some discussion here.

Also share another interesting work I saw in this field: https://openreview.net/forum?id=M9iky9Ruhx&nesting=2&sort=date-desc

boyugou · 2025-01-06T20:03:47Z

@dprokhorov17

Firstly, after UGround, there has been extensive discussion about modular design (SeeAct-V) and end-to-end (e2e) models for GUI Agents. I've been approached by both academic researchers and industry professionals regarding this discussion. Many projects/works have demonstrated the success of SeeAct-V-like design.

In general, building a generalist e2e GUI Agent model likely requires a large model—approximately ~100B parameters (consider the success of Aguvis-72B as an example). To support this, we aim to provide robust, grounded GUI foundation models across various sizes (e.g., 2B, 7B, 72B, etc.), enabling the community—and ourselves—to build innovative applications on top of these models. An e2e agent model is one representative example, akin to how Salesforce and Shanghai AI Lab trained Aguvis and OS-Atlas. BTW, if you have tried the new UGround-V1 model, you may find they have not substantially lost general capabilities, which is awesome!

On the other hand, I agree that smaller models hold significant value, particularly for on-device deployment and modular systems like SeeAct-V. Intuitively, grounding is a highly specific task and appears to be less challenging than planning. This insight was a major motivation behind our UGround project. However, results such as ShowUI (2B), UGround (2B), and UGround (7B) demonstrate that the 7B model still significantly outperforms the 2B model (Also check the new results on ScreenSpot-Pro). The 7B model seems to strike an excellent balance between performance and size.

By the way, I so far feel that GUI grounding (essentially GUI Understanding + GUI REC/Localization) is still challenging for very small models due to numerous hard, long-tail cases that are difficult to generalize—even for strong base models like Qwen2-VL-7B. That said, I am excited to see the community's efforts toward developing more powerful small-sized models!

dprokhorov17 · 2025-01-06T22:16:30Z

@boyugou Sure, building an end-to-end GUI agent is quite different from pure x, y-coordinate grounding. I completely agree that a model with over 72B parameters (or even larger) would likely be necessary to achieve reasonable performance. Let’s see how this evolves in the future!

abrichr · 2025-01-06T23:48:06Z

@boyugou thanks for the active discussion here. Any thoughts on grounding in action space, e.g. on trajectories?

dprokhorov17 · 2025-01-07T08:13:51Z

@abrichr Would you please clarify the term "grounding in action space, e.g. on trajectories"?

abrichr · 2025-01-07T15:56:51Z

By grounding in action space, I mean instead of grounding in x/y coordinates, grounding in sequences of actions.

For example, from https://ariaui.github.io/:

To handle dynamic contexts in task performing, Aria-UI incorporates textual and text-image interleaved action histories, enabling robust context-aware reasoning for grounding.

https://arxiv.org/abs/2412.09605:

We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models.

boyugou · 2025-01-07T16:03:56Z

By grounding in action space, I mean instead of grounding in x/y coordinates, grounding in sequences of actions.

For example, from https://ariaui.github.io/:

To handle dynamic contexts in task performing, Aria-UI incorporates textual and text-image interleaved action histories, enabling robust context-aware reasoning for grounding.

https://arxiv.org/abs/2412.09605:

We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models.

interleaved text-image action history makes sense to me. I actually wanted to try that. It intuitively brings several benefits.

dprokhorov17 · 2025-01-07T21:47:59Z

@abrichr @boyugou Do you guys have any idea why most open-source VLMs larger than 70B aren't trained on web, mobile, and desktop GUI data from the ground up? I mean, there are plenty of open-source datasets of reasonable size out there, yet we still need models specifically trained on GUI data... Why not incorporate this data into the e.g. stage 2 of the model?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hope Qwen2VL-based UGround-v1.1 opening early #6

hope Qwen2VL-based UGround-v1.1 opening early #6

shuqingjinse commented Nov 5, 2024

korbinian-hoermann commented Nov 23, 2024

boyugou commented Nov 25, 2024 •

edited

Loading

korbinian-hoermann commented Nov 26, 2024

boyugou commented Jan 3, 2025 •

edited

Loading

boyugou commented Jan 3, 2025 •

edited

Loading

dprokhorov17 commented Jan 6, 2025

boyugou commented Jan 6, 2025 •

edited

Loading

boyugou commented Jan 6, 2025 •

edited

Loading

dprokhorov17 commented Jan 6, 2025

abrichr commented Jan 6, 2025

dprokhorov17 commented Jan 7, 2025

abrichr commented Jan 7, 2025 •

edited

Loading

boyugou commented Jan 7, 2025 •

edited

Loading

dprokhorov17 commented Jan 7, 2025

hope Qwen2VL-based UGround-v1.1 opening early #6

hope Qwen2VL-based UGround-v1.1 opening early #6

Comments

shuqingjinse commented Nov 5, 2024

korbinian-hoermann commented Nov 23, 2024

boyugou commented Nov 25, 2024 • edited Loading

korbinian-hoermann commented Nov 26, 2024

boyugou commented Jan 3, 2025 • edited Loading

boyugou commented Jan 3, 2025 • edited Loading

dprokhorov17 commented Jan 6, 2025

boyugou commented Jan 6, 2025 • edited Loading

boyugou commented Jan 6, 2025 • edited Loading

dprokhorov17 commented Jan 6, 2025

abrichr commented Jan 6, 2025

dprokhorov17 commented Jan 7, 2025

abrichr commented Jan 7, 2025 • edited Loading

boyugou commented Jan 7, 2025 • edited Loading

dprokhorov17 commented Jan 7, 2025

boyugou commented Nov 25, 2024 •

edited

Loading

boyugou commented Jan 3, 2025 •

edited

Loading

boyugou commented Jan 3, 2025 •

edited

Loading

boyugou commented Jan 6, 2025 •

edited

Loading

boyugou commented Jan 6, 2025 •

edited

Loading

abrichr commented Jan 7, 2025 •

edited

Loading

boyugou commented Jan 7, 2025 •

edited

Loading