You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using LLMExtractionStrategy to extract relevant information about a specific topic, and while it generally performs well, I've noticed an issue when checking random URLs. In many cases, the extracted data has no connection to the actual webpage content.
I suspect this problem arises from websites with infinite scrolling. These sites dynamically load related threads or additional content as you scroll, which seems to be interfering with the extraction process.
Is there a way to address this issue? For example, can the extraction process be adjusted to avoid capturing dynamically loaded content?
The text was updated successfully, but these errors were encountered:
@ergosumdre This use case is very interesting. I would love to see your code snippet and test it myself as well. In general, you can avoid issues in the crawling part. You can add some JavaScript code to prevent dynamically loaded content. You can also set hooks to disable the network request after setting a degree. However, I'm not very sure until I see the code and url. Regarding the LLM extraction, it takes whatever markdown is generated and whatever content is produced, and it passes that to the language model. We should handle this before passing it to the language model. There are ways to fix it with the language model, but I'm confident we can address it with heuristic approaches. If you provide me with a few more examples, I can offer you some ideas.
I'm using LLMExtractionStrategy to extract relevant information about a specific topic, and while it generally performs well, I've noticed an issue when checking random URLs. In many cases, the extracted data has no connection to the actual webpage content.
I suspect this problem arises from websites with infinite scrolling. These sites dynamically load related threads or additional content as you scroll, which seems to be interfering with the extraction process.
Is there a way to address this issue? For example, can the extraction process be adjusted to avoid capturing dynamically loaded content?
The text was updated successfully, but these errors were encountered: