LLMExtractionStrategy Extracting Irrelevant Data from Infinite Scrolling Pages #386

ergosumdre · 2024-12-29T01:36:24Z

I'm using LLMExtractionStrategy to extract relevant information about a specific topic, and while it generally performs well, I've noticed an issue when checking random URLs. In many cases, the extracted data has no connection to the actual webpage content.

I suspect this problem arises from websites with infinite scrolling. These sites dynamically load related threads or additional content as you scroll, which seems to be interfering with the extraction process.

Is there a way to address this issue? For example, can the extraction process be adjusted to avoid capturing dynamically loaded content?

unclecode · 2024-12-30T12:09:57Z

@ergosumdre This use case is very interesting. I would love to see your code snippet and test it myself as well. In general, you can avoid issues in the crawling part. You can add some JavaScript code to prevent dynamically loaded content. You can also set hooks to disable the network request after setting a degree. However, I'm not very sure until I see the code and url. Regarding the LLM extraction, it takes whatever markdown is generated and whatever content is produced, and it passes that to the language model. We should handle this before passing it to the language model. There are ways to fix it with the language model, but I'm confident we can address it with heuristic approaches. If you provide me with a few more examples, I can offer you some ideas.

unclecode self-assigned this Dec 30, 2024

unclecode added the question Further information is requested label Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLMExtractionStrategy Extracting Irrelevant Data from Infinite Scrolling Pages #386

LLMExtractionStrategy Extracting Irrelevant Data from Infinite Scrolling Pages #386

ergosumdre commented Dec 29, 2024

unclecode commented Dec 30, 2024

LLMExtractionStrategy Extracting Irrelevant Data from Infinite Scrolling Pages #386

LLMExtractionStrategy Extracting Irrelevant Data from Infinite Scrolling Pages #386

Comments

ergosumdre commented Dec 29, 2024

unclecode commented Dec 30, 2024