-
-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract web text directly instead of OCR #51
Comments
Seconding the ask for a Motivations section that discusses when to use this in lieu of parsing the DOM. |
Thats a good question. Would be curious to see their approaches and performance is like. For us, it's very important to contain as much of the visual structure of the page as possible. This includes positions of the text on the 2D plane. Using just the HTML and skipping the actual rendering of the page, you lose a lot of this information. We need this because a) we want our agents to reason about and take actions on the page just as we would, and b) because visibility of elements on screen is required for automation frameworks to actually take actions (you cannot "click" on elements that don't actually appear on the page) For example, suppose you had a scrollable container element containing 10 child elements total, with 5 elements overflowing and requiring scrolling the parent container to view. I would imagine the other approaches would display the overflowed elements in the ultimate representation, while we want to avoid doing this (Because if an agent were to try and click on these elements, it would cause an element_not_found error) Hope this makes sense, happy to elaborate further @eshoyuan. (And apologies for the late response) If @will-holley or anyone wants to add this to the README, happy to take a PR! |
I agree with the motivation. One issue I see with the approach is when there are images embedded in a webpage that contain text but are not really actionable by the LLM. E.g., https://app.sequence-erp.com/login, the only thing that matters are the login elements on the left hand side, but the OCR algo fails at recognizing this:
|
@tvatter the thing is that there are some cases where the image information might actually be important. Perhaps those cases are far and few between though What's happening specifically is that the smaller text of the image hurts the rendering of the page. We could maybe pass in an option to ignore images in the OCR (turning them invisible before taking a screenshot) |
For this example, how does the agent know that this element is scrollable? For example if it needed to scroll down to find a specific element. Have you found that agents can handle this case? I'd imagine it necessary to add a scrollable tag for OCR. Separately what do you mean by |
hey @craigmulligan yeah a scrollable tag would be of interest. We should make a ticket! playwright typically will require the element be visible on screen or somewhat clickable (at least with defaults) |
I'm working on something pretty similar to what you guys are doing and had a thought. Why not grab text directly from the web instead of using OCR? Langchain and llamaindex both have such tools, and there are also some repos about converting html to markdown.
Just a thought. Would love to know what you think!
The text was updated successfully, but these errors were encountered: