Crawl4AI / docs /md_v2 /advanced /identity_based_crawling.md
amaye15
test
03c0888
|
raw
history blame
7 kB

Preserve Your Identity with Crawl4AI

Crawl4AI empowers you to navigate and interact with the web using your authentic digital identity, ensuring that you are recognized as a human and not mistaken for a bot. This document introduces Managed Browsers, the recommended approach for preserving your rights to access the web, and Magic Mode, a simplified solution for specific scenarios.


Managed Browsers: Your Digital Identity Solution

Managed Browsers enable developers to create and use persistent browser profiles. These profiles store local storage, cookies, and other session-related data, allowing you to interact with websites as a recognized user. By leveraging your unique identity, Managed Browsers ensure that your experience reflects your rights as a human browsing the web.

Why Use Managed Browsers?

  1. Authentic Browsing Experience: Managed Browsers retain session data and browser fingerprints, mirroring genuine user behavior.
  2. Effortless Configuration: Once you interact with the site using the browser (e.g., solving a CAPTCHA), the session data is saved and reused, providing seamless access.
  3. Empowered Data Access: By using your identity, Managed Browsers empower users to access data they can view on their own screens without artificial restrictions.

Steps to Use Managed Browsers

  1. Setup the Browser Configuration:

    from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
    from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
    
    browser_config = BrowserConfig(
        headless=False,  # Set to False for initial setup to view browser actions
        verbose=True,
        user_agent_mode="random",
        use_managed_browser=True,  # Enables persistent browser sessions
        browser_type="chromium",
        user_data_dir="/path/to/user_profile_data"  # Path to save session data
    )
    
  2. Perform an Initial Run:

    • Run the crawler with headless=False.
    • Manually interact with the site (e.g., solve CAPTCHA or log in).
    • The browser session saves cookies, local storage, and other required data.
  3. Subsequent Runs:

    • Switch to headless=True for automation.
    • The session data is reused, allowing seamless crawling.

Example: Extracting Data Using Managed Browsers

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    # Define schema for structured data extraction
    schema = {
        "name": "Example Data",
        "baseSelector": "div.example",
        "fields": [
            {"name": "title", "selector": "h1", "type": "text"},
            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }

    # Configure crawler
    browser_config = BrowserConfig(
        headless=True,  # Automate subsequent runs
        verbose=True,
        use_managed_browser=True,
        user_data_dir="/path/to/user_profile_data"
    )

    crawl_config = CrawlerRunConfig(
        extraction_strategy=JsonCssExtractionStrategy(schema),
        wait_for="css:div.example"  # Wait for the targeted element to load
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=crawl_config
        )

        if result.success:
            print("Extracted Data:", result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

Benefits of Managed Browsers Over Other Methods

Managed Browsers eliminate the need for manual detection workarounds by enabling developers to work directly with their identity and user profile data. This approach ensures maximum compatibility with websites and simplifies the crawling process while preserving your right to access data freely.


Magic Mode: Simplified Automation

While Managed Browsers are the preferred approach, Magic Mode provides an alternative for scenarios where persistent user profiles are unnecessary or infeasible. Magic Mode automates user-like behavior and simplifies configuration.

What Magic Mode Does:

  • Simulates human browsing by randomizing interaction patterns and timing.
  • Masks browser automation signals.
  • Handles cookie popups and modals.
  • Modifies navigator properties for enhanced compatibility.

Using Magic Mode

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        magic=True  # Enables all automation features
    )

Magic Mode is particularly useful for:

  • Quick prototyping when a Managed Browser setup is not available.
  • Basic sites requiring minimal interaction or configuration.

Example: Combining Magic Mode with Additional Options

async def crawl_with_magic_mode(url: str):
    async with AsyncWebCrawler(headless=True) as crawler:
        result = await crawler.arun(
            url=url,
            magic=True,
            remove_overlay_elements=True,  # Remove popups/modals
            page_timeout=60000            # Increased timeout for complex pages
        )

        return result.markdown if result.success else None

Magic Mode vs. Managed Browsers

While Magic Mode simplifies many tasks, it cannot match the reliability and authenticity of Managed Browsers. By using your identity and persistent profiles, Managed Browsers render Magic Mode largely unnecessary. However, Magic Mode remains a viable fallback for specific situations where user identity is not a factor.


Key Comparison: Managed Browsers vs. Magic Mode

Feature Managed Browsers Magic Mode
Session Persistence Retains cookies and local storage. No session retention.
Human Interaction Uses real user profiles and data. Simulates human-like patterns.
Complex Sites Best suited for heavily configured sites. Works well with simpler challenges.
Setup Complexity Requires initial manual interaction. Fully automated, one-line setup.

Recommendation:

  • Use Managed Browsers for reliable, session-based crawling and data extraction.
  • Use Magic Mode for quick prototyping or when persistent profiles are not required.

Conclusion

  • Use Managed Browsers to preserve your digital identity and ensure reliable, identity-based crawling with persistent sessions. This approach works seamlessly for even the most complex websites.
  • Leverage Magic Mode for quick automation or in scenarios where persistent user profiles are not needed.

By combining these approaches, Crawl4AI provides unparalleled flexibility and capability for your crawling needs.