-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hope Qwen2VL-based UGround-v1.1 opening early #6
Comments
yes, really nice work ! is there an approximate date for the release of UGround-v1.1? |
Hi All, Let me provide an update on the timeline for the release plan here. Apologies for the delayed updates. I have been extremely busy and quite unwell after returning from EMNLP (caught a severe cold, and it’s still very bad). I know many are waiting for things like data scripts, Qwen2-VL-based UGround, and the newer UGround. I will be working intensively on these this week, along with improving the SeeAct codebase. I aim to gradually release all of them this week and next. Thank you for your patience. |
Sounds awesome, ty for the quick response! |
V1 uploaded. https://huggingface.co/osunlp/UGround-V1-7B Main Results:
|
Version 1.1 still requires some time. Because there are many excellent dataset works recently, and we are in the process of integrating them and conducting some ablation studies. We hope that Qwen2-VL-based UGround can make inference and training more convenient for everyone. |
@boyugou What is the point of training VLMs with >7B parameters? Just to get a small amount of gain within the benchmarks? I mean, the grounding models basically have a single task to achieve, that is to provide x, y coordinates, so I think it is far more interesting to train or develop models sub 1B parameters, otherwise the price-to-performance ratio is quite poor. Check out the following model: https://huggingface.co/Samsung/TinyClick, though I haven't evaluated it yet, but 271m parameters is the way to go! |
@dprokhorov17 Great question! Thanks for pointing it out and I'm glad to have some discussion here. Also share another interesting work I saw in this field: https://openreview.net/forum?id=M9iky9Ruhx&nesting=2&sort=date-desc |
Firstly, after UGround, there has been extensive discussion about modular design (SeeAct-V) and end-to-end (e2e) models for GUI Agents. I've been approached by both academic researchers and industry professionals regarding this discussion. Many projects/works have demonstrated the success of SeeAct-V-like design. In general, building a generalist e2e GUI Agent model likely requires a large model—approximately ~100B parameters (consider the success of Aguvis-72B as an example). To support this, we aim to provide robust, grounded GUI foundation models across various sizes (e.g., 2B, 7B, 72B, etc.), enabling the community—and ourselves—to build innovative applications on top of these models. An e2e agent model is one representative example, akin to how Salesforce and Shanghai AI Lab trained Aguvis and OS-Atlas. BTW, if you have tried the new UGround-V1 model, you may find they have not substantially lost general capabilities, which is awesome! On the other hand, I agree that smaller models hold significant value, particularly for on-device deployment and modular systems like SeeAct-V. Intuitively, grounding is a highly specific task and appears to be less challenging than planning. This insight was a major motivation behind our UGround project. However, results such as ShowUI (2B), UGround (2B), and UGround (7B) demonstrate that the 7B model still significantly outperforms the 2B model (Also check the new results on ScreenSpot-Pro). The 7B model seems to strike an excellent balance between performance and size. By the way, I so far feel that GUI grounding (essentially GUI Understanding + GUI REC/Localization) is still challenging for very small models due to numerous hard, long-tail cases that are difficult to generalize—even for strong base models like Qwen2-VL-7B. That said, I am excited to see the community's efforts toward developing more powerful small-sized models! |
@boyugou Sure, building an end-to-end GUI agent is quite different from pure x, y-coordinate grounding. I completely agree that a model with over 72B parameters (or even larger) would likely be necessary to achieve reasonable performance. Let’s see how this evolves in the future! |
@boyugou thanks for the active discussion here. Any thoughts on grounding in action space, e.g. on trajectories? |
@abrichr Would you please clarify the term "grounding in action space, e.g. on trajectories"? |
By grounding in action space, I mean instead of grounding in x/y coordinates, grounding in sequences of actions. For example, from https://ariaui.github.io/:
https://arxiv.org/abs/2412.09605:
|
interleaved text-image action history makes sense to me. I actually wanted to try that. It intuitively brings several benefits. |
@abrichr @boyugou Do you guys have any idea why most open-source VLMs larger than 70B aren't trained on web, mobile, and desktop GUI data from the ground up? I mean, there are plenty of open-source datasets of reasonable size out there, yet we still need models specifically trained on GUI data... Why not incorporate this data into the e.g. stage 2 of the model? |
good job!
The text was updated successfully, but these errors were encountered: