Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLMExtractionStrategy Extracting Irrelevant Data from Infinite Scrolling Pages #386

Open
ergosumdre opened this issue Dec 29, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@ergosumdre
Copy link

I'm using LLMExtractionStrategy to extract relevant information about a specific topic, and while it generally performs well, I've noticed an issue when checking random URLs. In many cases, the extracted data has no connection to the actual webpage content.

I suspect this problem arises from websites with infinite scrolling. These sites dynamically load related threads or additional content as you scroll, which seems to be interfering with the extraction process.

Is there a way to address this issue? For example, can the extraction process be adjusted to avoid capturing dynamically loaded content?

@unclecode unclecode self-assigned this Dec 30, 2024
@unclecode unclecode added the question Further information is requested label Dec 30, 2024
@unclecode
Copy link
Owner

@ergosumdre This use case is very interesting. I would love to see your code snippet and test it myself as well. In general, you can avoid issues in the crawling part. You can add some JavaScript code to prevent dynamically loaded content. You can also set hooks to disable the network request after setting a degree. However, I'm not very sure until I see the code and url. Regarding the LLM extraction, it takes whatever markdown is generated and whatever content is produced, and it passes that to the language model. We should handle this before passing it to the language model. There are ways to fix it with the language model, but I'm confident we can address it with heuristic approaches. If you provide me with a few more examples, I can offer you some ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants