04-Scraping

Scraping Architecture

Overview

The scraping system extracts content from web pages using multiple providers with automatic fallback. It handles regular web pages, JavaScript-rendered content, premium proxies, and even special cases like Twitter/X posts.

Scraping Providers

Plain text

URL Input
    ↓
┌─────────────────────────────────────────────────────────────────┐
│  TIER 1: ScrapingBee (Default)                                  │
│  ├── Standard proxy                                             │
│  ├── Optional JS rendering (first retry)                        │
│  └── Optional Premium proxy (second+ retry)                      │
│       ↓ If fails or content insufficient                        │
├─────────────────────────────────────────────────────────────────┤
│  TIER 2: Firecrawl (Deep Research)                              │
│  ├── AI-powered scraping                                        │
│  ├── Markdown conversion                                        │
│  └── PDF text extraction                                        │
│       ↓ If fails                                                │
├─────────────────────────────────────────────────────────────────┤
│  TIER 3: Direct HTTP Fetch (Fallback)                           │
│  └── Basic HTML fetch                                          │
└─────────────────────────────────────────────────────────────────┘
    ↓
Clean Text/Markdown

URL Input ↓ ┌─────────────────────────────────────────────────────────────────┐ │ TIER 1: ScrapingBee (Default) │ │ ├── Standard proxy │ │ ├── Optional JS rendering (first retry) │ │ └── Optional Premium proxy (second+ retry) │ │ ↓ If fails or content insufficient │ ├─────────────────────────────────────────────────────────────────┤ │ TIER 2: Firecrawl (Deep Research) │ │ ├── AI-powered scraping │ │ ├── Markdown conversion │ │ └── PDF text extraction │ │ ↓ If fails │ ├─────────────────────────────────────────────────────────────────┤ │ TIER 3: Direct HTTP Fetch (Fallback) │ │ └── Basic HTML fetch │ └─────────────────────────────────────────────────────────────────┘ ↓ Clean Text/Markdown

Scraping Architecture

Overview

Scraping Providers

04-Scraping

Scraping Architecture

Overview

Scraping Providers

1. ScrapingBee Integration

Basic Configuration

Core Scraping Function

API Call Structure

2. Firecrawl Integration

3. Deep Research 4-Tier Scraping

4. Twitter/X Special Handling

5. Parallel URL Scraping

6. HTML to Markdown Conversion

7. Retry Architecture

8. Error Handling Strategy

Scraping Error Hierarchy

Error Handling in Code

9. Integration with Search

10. Performance Optimization

Scraping Best Practices

Summary