InkdownInkdown
Start writing

Merlin Backend

12 files·0 subfolders

Shared Workspace

Merlin Backend
01-Orchestration.md

04-Scraping

Shared from "Merlin Backend" on Inkdown

Scraping Architecture

Overview

The scraping system extracts content from web pages using multiple providers with automatic fallback. It handles regular web pages, JavaScript-rendered content, premium proxies, and even special cases like Twitter/X posts.


Scraping Providers

Plain text
URL Input
    ↓
┌─────────────────────────────────────────────────────────────────┐
│  TIER 1: ScrapingBee (Default)                                  │
│  ├── Standard proxy                                             │
│  ├── Optional JS rendering (first retry)                        │
│  └── Optional Premium proxy (second+ retry)                      │
│       ↓ If fails or content insufficient                        │
├─────────────────────────────────────────────────────────────────┤
│  TIER 2: Firecrawl (Deep Research)                              │
│  ├── AI-powered scraping                                        │
│  ├── Markdown conversion                                        │
│  └── PDF text extraction                                        │
│       ↓ If fails                                                │
├─────────────────────────────────────────────────────────────────┤
│  TIER 3: Direct HTTP Fetch (Fallback)                           │
│  └── Basic HTML fetch                                          │
└─────────────────────────────────────────────────────────────────┘
    ↓
Clean Text/Markdown
02-DeepResearch.md
03-Search.md
04-Scraping.md
05-Streaming.md
06-MultiProviderLLM.md
07-MemoryAndContext.md
08-ErrorHandling.md
09-RateLimiting.md
10-TaskQueue.md
11-SecurityAndAuth.md
Orchestration-2nd-draft

1. ScrapingBee Integration

File: src/server/services/scrapper.ts

Basic Configuration
TypeScript
const BASE_URL = `https://app.scrapingbee.com/api/v1`;
const SCRAPING_BEE_API_KEY =
	"XV16TL7MIMLW6TSLY7M51E0MIY1O6R1BWLGMZO6BC8GVMIKDAPD95QQRLTQ4F7CIBP0JCJ9M2KPIA0AA";
const SCRAPPER_BEE_MAX_TIMEOUT = 10000; // 10 seconds
const SCRAPPER_BEE_MAX_RETRIES = 1; // Normal mode
const SCRAPPER_BEE_MAX_RETRIES_FOR_DEEP_RESEARCH = 3; // Deep research mode
Core Scraping Function

File: src/server/services/scrapper.ts:40

TypeScript
export const scrap = async (
	url: string,
	isDeepResearch?: boolean,
	_signal?: AbortSignal,
) => {
	try {
		return await retryAsyncFunction(
			(retryContext) =>
				scrapperBeeAPICall({
					url,
					// Business decision: Only enable JS/premium for deep research
					// Contact siddsax before changing
					...(isDeepResearch && {
						// Premium proxy from second retry onward (better success rate)
						premium_proxy: retryContext.retryCount > 1,
						// JS rendering only on first retry (expensive, use sparingly)
						render_js: retryContext.retryCount === 1,
					}),
				}),
			isDeepResearch
				? SCRAPPER_BEE_MAX_RETRIES_FOR_DEEP_RESEARCH
				: SCRAPPER_BEE_MAX_RETRIES,
			(error) => {
				// Client errors (4xx) - don't retry, fail fast
				if (error.response?.status >= 400 && error.response?.status < 500) {
					throw error;
				}
			},
		);
	} catch (e: any) {
		logger.error({ url, error: e.response?.data }, `ERROR/SCRAP`);
		throw e;
	}
};
API Call Structure

File: src/server/services/scrapper.ts:22

TypeScript
export const scrapperBeeAPICall = async ({
	url,
	render_js = false, // JavaScript rendering
	premium_proxy = false, // Premium proxy (better for blocked sites)
}: TScrapperRequest): Promise<string> => {
	const params = {
		api_key: SCRAPING_BEE_API_KEY,
		url: url,
		render_js,
		premium_proxy,
	};

	const response = await axiosInstance.get(BASE_URL, {
		params,
		timeout: SCRAPPER_BEE_MAX_TIMEOUT,
	});

	return response.data; // Raw HTML string
};

ScrapingBee Parameters:

  • api_key: Authentication
  • url: Target URL to scrape
  • render_js: Execute JavaScript (slower, more expensive)
  • premium_proxy: Use residential proxies (better success rate, more expensive)

Retry Logic (Deep Research):

  • Attempt 1: Standard proxy, no JS
  • Attempt 2: Standard proxy, JS enabled
  • Attempt 3+: Premium proxy, JS enabled

2. Firecrawl Integration

File: src/server/services/firecrawl.ts

For deep research, AI-powered scraping via Firecrawl:

TypeScript
import FirecrawlApp from "@mendable/firecrawl-js";

export const FIRECRAWL_API_KEY = "fc-bdf154398aca47c0b291a80ba010cb78";
export const firecrawl = new FirecrawlApp({ apiKey: FIRECRAWL_API_KEY });

Usage in Deep Research:

TypeScript
// Scrape and convert to markdown
const result = await firecrawl.scrapeUrl(url, {
	formats: ["markdown"],
});

// Search and scrape results
const searchResults = await firecrawl.search(query, {
	limit: 10,
	scrapeOptions: { formats: ["markdown"] },
});

Firecrawl Features:

  • Automatic content extraction
  • Markdown conversion
  • PDF text extraction
  • Rate limiting handled by library

3. Deep Research 4-Tier Scraping

File: src/server/endpoints/unified/features/deepResearch/firecrawlSerp.ts

For deep research, a sophisticated multi-tier approach:

Plain text
URL from Search Result
    ↓
┌─────────────────────────────────────────────────────────────┐
│ TIER 1: Bing Search Direct                                  │
│ └── Fastest - uses Bing's cached content                    │
│     ↓ If fails                                              │
├─────────────────────────────────────────────────────────────┤
│ TIER 2: Direct Scraping (getCleanHTMLAsMarkdownFromUrl)     │
│ └── ScrapingBee + HTML-to-Markdown conversion               │
│     ↓ If fails or empty                                     │
├─────────────────────────────────────────────────────────────┤
│ TIER 3: Firecrawl Fallback                                  │
│ └── AI-powered scraping with retry logic                    │
│     ↓ If fails                                              │
├─────────────────────────────────────────────────────────────┤
│ TIER 4: Basic Fallback                                      │
│ └── Raw HTTP fetch with basic extraction                    │
└─────────────────────────────────────────────────────────────┘
    ↓
Markdown Text

4. Twitter/X Special Handling

File: src/server/services/scrapper.ts:81

Twitter/X uses a special CDN API instead of scraping:

TypeScript
const TWITTER_CDN_URL = "https://cdn.syndication.twimg.com/tweet-result";
const TWITTER_CDN_TIMEOUT = 4000; // 4 seconds

export const getTweetFromTwitterUrl = async (
	url: string,
	controller?: AbortController,
) => {
	if (!controller) controller = new AbortController();

	try {
		const tweetId = (url.split("/").at(-1) ?? "").trim();
		if (isNaN(Number(tweetId))) {
			return { cleanHTML: "" };
		}

		// Generate token from tweet ID (Twitter's client-side algorithm)
		const getToken = (id: string) => {
			return ((Number(id) / 1e15) * Math.PI)
				.toString(6 ** 2)
				.replace(/(0+|\.)/g, "");
		};

		const params = {
			id: tweetId,
			lang: "en",
			token: getToken(tweetId),
		};

		const response = await axiosInstance.get(TWITTER_CDN_URL, {
			params,
			signal: controller.signal,
			timeout: TWITTER_CDN_TIMEOUT,
		});

		return { cleanHTML: response.data?.text ?? "" };
	} catch (_error) {
		logger.info("ERROR/TWITTER_CDN_API_ERROR");
		return { cleanHTML: "" };
	}
};

Why CDN API:

  • Twitter blocks most scrapers
  • CDN endpoint is public but requires token generation
  • Returns clean text (no HTML parsing needed)
  • 4s timeout (fast fail)

Token Algorithm:

JavaScript
// ((tweetId / 1e15) * Math.PI).toString(36).replace(/(0+|\.)/g, "")
// Example: 1234567890123456789 → "a1b2c3d4e5f6"

This mimics Twitter's frontend token generation to access their CDN.


5. Parallel URL Scraping

File: src/server/endpoints/unified/features/web-access/utils.ts

When multiple search results need scraping, fetch in parallel:

TypeScript
export async function getDetailedResults(
	results: TOrganicResult[],
	isDeepResearch?: boolean,
): Promise<Array<{ result: TOrganicResult; content: string | undefined }>> {
	const controller = new AbortController();

	// Create fetch promises for all URLs
	const fetchPromises = results.map(async (result) => {
		try {
			let content: string | undefined;

			// Special handling for Twitter/X
			if (isTwitterUrl(result.url)) {
				const tweet = await getTweetFromTwitterUrl(result.url, controller);
				content = tweet.cleanHTML || undefined;
			} else {
				// Standard scraping with timeout
				content = await Promise.race([
					getCleanHTMLAsMarkdownFromUrl(result.url, isDeepResearch),
					new Promise<undefined>((_, reject) =>
						setTimeout(() => reject(new Error("Timeout")), 8000),
					),
				]);
			}

			return { result, content };
		} catch (error) {
			logger.error(
				{ url: result.url, error },
				"ERROR/WEB_ACCESS/DETAILED_RESULTS",
			);
			return { result, content: undefined };
		}
	});

	// Wait for all to complete
	const allResults = await Promise.all(fetchPromises);

	// Sort by completion time (fastest first)
	const successfulResults = allResults.filter(
		(r): r is { result: TOrganicResult; content: string } =>
			r.content !== undefined,
	);

	return successfulResults;
}

Key Features:

  • Parallel fetching (all at once)
  • 8-second timeout per URL
  • Returns fastest results first
  • Graceful degradation (undefined for failures)

6. HTML to Markdown Conversion

File: src/server/endpoints/unified/features/web-access/utils.ts

TypeScript
export async function getCleanHTMLAsMarkdownFromUrl(
	url: string,
	isDeepResearch?: boolean,
): Promise<string | undefined> {
	try {
		let html: string;

		// Tier 1: ScrapingBee
		try {
			html = await scrap(url, isDeepResearch);
		} catch (error) {
			logger.error({ url, error }, "ERROR/WEB_ACCESS/SCRAPINGBEE_FAILED");
			return undefined;
		}

		if (!html) {
			return undefined;
		}

		// Convert HTML to Markdown
		const markdown = await convertHtmlToMarkdown(html, url);

		// Validate minimum content length
		if (markdown.length < 100) {
			logger.info(
				{ url, length: markdown.length },
				"WEB_ACCESS/MARKDOWN_TOO_SHORT",
			);
			return undefined;
		}

		return markdown;
	} catch (error) {
		logger.error({ url, error }, "ERROR/WEB_ACCESS/GET_CLEAN_HTML");
		return undefined;
	}
}

Conversion Process:

  1. Scrape HTML (via ScrapingBee)
  2. Parse with JSDOM or similar
  3. Extract main content (remove nav, ads, etc.)
  4. Convert to Markdown (Turndown or similar)
  5. Validate minimum length (100 chars)

7. Retry Architecture

File: src/server/utilities/retry.ts

The retry system used by scrapers:

TypeScript
export async function retryAsyncFunction<
	T extends (context: TRetryContext) => any,
>(
	asyncFunc: T,
	currentRetryCount = DEFAULT_RETRIES, // 3 default
	errorHandler?: (error: any) => void,
): Promise<Awaited<ReturnType<T>>> {
	const totalRetryCount = currentRetryCount;
	let result;

	while (currentRetryCount > 0) {
		try {
			// Pass retry context (allows progressive enhancement)
			result = await asyncFunc({
				retryCount: totalRetryCount - currentRetryCount,
			});
			break; // Success - exit loop
		} catch (error) {
			if (errorHandler) {
				errorHandler(error);
			}
			currentRetryCount--;
			if (currentRetryCount === 0) throw error; // Exhausted
		}
	}
	return result;
}

Usage in Scraping:

TypeScript
await retryAsyncFunction(
	(retryContext) =>
		scrapperBeeAPICall({
			url,
			premium_proxy: retryContext.retryCount > 1,
			render_js: retryContext.retryCount === 1,
		}),
	isDeepResearch ? 3 : 1,
	(error) => {
		// Don't retry 4xx client errors
		if (error.response?.status >= 400 && error.response?.status < 500) {
			throw error;
		}
	},
);

Progressive Enhancement:

  • Retry 0: Standard proxy, no JS
  • Retry 1: Standard proxy, JS enabled
  • Retry 2+: Premium proxy, JS enabled

8. Error Handling Strategy

Scraping Error Hierarchy
Plain text
Level 1: Network Error (timeout, DNS, etc.)
  └── Logged, retry with backoff

Level 2: Client Error (4xx)
  └── Don't retry (page doesn't exist, blocked, etc.)

Level 3: Server Error (5xx)
  └── Retry with exponential backoff

Level 4: Content Error (empty, too short, malformed)
  └── Try next provider in cascade

Level 5: All Providers Failed
  └── Return undefined (graceful degradation)
Error Handling in Code
TypeScript
try {
	const html = await scrap(url, isDeepResearch);
	const markdown = convertHtmlToMarkdown(html);

	if (markdown.length < 100) {
		// Content too short - try Firecrawl
		return await firecrawl.scrapeUrl(url);
	}

	return markdown;
} catch (error) {
	if (error.response?.status === 403) {
		// Blocked - try premium proxy
		return await scrapWithPremiumProxy(url);
	}

	if (error.code === "ECONNABORTED") {
		// Timeout - try Firecrawl
		return await firecrawl.scrapeUrl(url);
	}

	// Unknown error - log and return empty
	logger.error({ url, error }, "SCRAPING_FAILED_ALL_PROVIDERS");
	return undefined;
}

9. Integration with Search

Scraping is the second step after search:

Plain text
Search → Get URLs → Scrape Top 3 → Process Content
    ↓         ↓           ↓              ↓
 SerpAPI   Results   Parallel      Extract
            ↓        Scraping      Learnings
        [url1,
         url2,
         url3]

In Deep Research:

TypeScript
// After getting search results
const searchResults = await webSearch(query);

// Scrape top 3 in parallel
const detailedResults = await getDetailedResults(
	searchResults.slice(0, 3),
	true,
);

// Process each result
for (const { result, content } of detailedResults) {
	if (content) {
		const learnings = await extractLearnings(content, query);
		allLearnings.push(...learnings);
	}
}

10. Performance Optimization

Scraping Best Practices
  1. Parallel Execution

    TypeScript
    // Fastest 3 results (not all)
    const top3 = searchResults.slice(0, 3);
    const results = await Promise.all(top3.map((url) => scrape(url)));
  2. Timeout Management

    TypeScript
    // 8-second timeout per scrape
    const content = await Promise.race([
    	scrape(url),
    	new Promise((_, reject) => setTimeout(reject, 8000)),
    ]);
  3. Content Validation

    TypeScript
    // Reject if too short (probably error page)
    if (content.length < 100) return undefined;
  4. Smart Retries

    TypeScript
    // Don't retry 4xx errors
    if (error.response?.status >= 400 && error.response?.status < 500) {
    	throw error; // Fail fast
    }

Summary

The scraping architecture:

  1. Multi-Tier Fallback: ScrapingBee → Firecrawl → Direct
  2. Progressive Enhancement: Standard → JS → Premium proxy
  3. Parallel Execution: All URLs scraped simultaneously
  4. Smart Retries: Context-aware with exponential backoff
  5. Special Cases: Twitter CDN API for X.com
  6. Content Validation: Minimum length checks
  7. Error Resilience: Graceful degradation to undefined

Key Principle: Never let a scraping failure crash the research. Always have a fallback, always return gracefully.