Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use crawler.crawl to full-page scrolling? #406

Open
helenatthais opened this issue Jan 3, 2025 · 5 comments
Open

How to use crawler.crawl to full-page scrolling? #406

helenatthais opened this issue Jan 3, 2025 · 5 comments

Comments

@helenatthais
Copy link

helenatthais commented Jan 3, 2025

Despite simulate full-page scrolling feature released with 0.4.1. version, I'm struggling to make it work because I'm still not sure where to insert crawler.crawl function. The docs (https://crawl4ai.com/mkdocs/blog/releases/0.4.1/) cite the following example:

await crawler.crawl( url="https://example.com", scan_full_page=True, # Enables scrolling scroll_delay=0.2 # Waits 200ms between scrolls (optional) )

@TheCutestCat
Copy link
Contributor

TheCutestCat commented Jan 3, 2025

@helenatthais I have fixed this problem with this PR,
and here is a disscusion about some parameters for screenshot that was not mentioned : link

@helenatthais
Copy link
Author

helenatthais commented Jan 3, 2025

Tried to execute the code from the referred PR and still the full scrolling page feature doesn't work:

async def main():
# Configure the browser settings
browser_config = BrowserConfig(headless=False, verbose=True)

# Set run configurations, including cache mode and markdown generator
crawl_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    screenshot=True,
    # Set these two flags
    scan_full_page=True,
    wait_for_images=True,
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(
        url='https://www.nytimes.com/ca/',
        config=crawl_config

@TheCutestCat
Copy link
Contributor

@helenatthais Hi, could you please provide more details about your setup and how you're running the code? I've tested it in my local environment and everything seems to work fine.

One possible cause of the issue might be that the original crawl4ai package is still installed. Could you check if that's the case?

@helenatthais
Copy link
Author

helenatthais commented Jan 3, 2025

Sure, I installed crawl4ai with pip install crawl4ai and I've recently upgraded with --upgrade. I'm trying to run the following code to scrape Google Maps reviews:

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
import json

async def main():
    browser_config = BrowserConfig(headless=False, verbose=True)
    
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        screenshot=False,
        scan_full_page=True,
        js_code="window.scrollTo(0, document.body.scrollHeight);",
        scroll_delay=2000,
        css_selector="div.GHT2ce.NsCY4, span.wiI7pd",
        exclude_external_links=True,
        exclude_social_media_links=True,
        exclude_external_images=True,
        simulate_user=True
    )
        
    async with AsyncWebCrawler(verbose=True, config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.google.com.br/maps/place/Dra+Regina+C%C3%A9lia+de+Aquino+Barbosa/@-22.8744795,-43.3429393,17z/data=!4m8!3m7!1s0x9962d7809bdfe3:0x9871497b1081f14e!8m2!3d-22.8744795!4d-43.3403644!9m1!1b1!16s%2Fg%2F1wf2320v?entry=ttu&g_ep=EgoyMDI0MTIxMS4wIKXMDSoASAFQAw%3D%3D",
            config=crawl_config
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

@TheCutestCat
Copy link
Contributor

@helenatthais I understand that. This is because my PR hasn't been merged into the main branch yet. You can either:

Wait for the new version of crawl4ai (which should be available soon), or
Use the modified original code (though this will be a bit more complex) by implementing the changes shown here: changes in PR #403

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants