Merlin Backend 12 files · 0 subfolders
Copy to Workspace 04-Scraping Shared from "Merlin Backend" on Inkdown
Scraping Architecture
Overview
The scraping system extracts content from web pages using multiple providers with automatic fallback. It handles regular web pages, JavaScript-rendered content, premium proxies, and even special cases like Twitter/X posts.
Scraping Providers
Plain text
URL Input
↓
┌─────────────────────────────────────────────────────────────────┐
│ TIER 1: ScrapingBee (Default) │
│ ├── Standard proxy │
│ ├── Optional JS rendering (first retry) │
│ └── Optional Premium proxy (second+ retry) │
│ ↓ If fails or content insufficient │
├─────────────────────────────────────────────────────────────────┤
│ TIER 2: Firecrawl (Deep Research) │
│ ├── AI-powered scraping │
│ ├── Markdown conversion │
│ └── PDF text extraction │
│ ↓ If fails │
├─────────────────────────────────────────────────────────────────┤
│ TIER 3: Direct HTTP Fetch (Fallback) │
│ └── Basic HTML fetch │
└─────────────────────────────────────────────────────────────────┘
↓
Clean Text/Markdown
02-DeepResearch.md
1. ScrapingBee Integration File: src/server/services/scrapper.ts
Basic Configuration TypeScript
const BASE_URL = `https://app.scrapingbee.com/api/v1` ;
const SCRAPING_BEE_API_KEY =
"XV16TL7MIMLW6TSLY7M51E0MIY1O6R1BWLGMZO6BC8GVMIKDAPD95QQRLTQ4F7CIBP0JCJ9M2KPIA0AA" ;
const SCRAPPER_BEE_MAX_TIMEOUT = 10000 ;
const SCRAPPER_BEE_MAX_RETRIES = 1 ;
const SCRAPPER_BEE_MAX_RETRIES_FOR_DEEP_RESEARCH = 3 ;
Core Scraping Function File: src/server/services/scrapper.ts:40
TypeScript
export const scrap = async (
url : string ,
isDeepResearch ?: boolean ,
_signal ?: AbortSignal ,
) => {
try {
return await retryAsyncFunction (
(retryContext ) =>
scrapperBeeAPICall ({
url,
...(isDeepResearch && {
premium_proxy : retryContext.retryCount > 1 ,
render_js : retryContext.retryCount === 1 ,
}),
}),
isDeepResearch
? SCRAPPER_BEE_MAX_RETRIES_FOR_DEEP_RESEARCH
: SCRAPPER_BEE_MAX_RETRIES ,
(error ) => {
if (error.response ?.status >= 400 && error.response ?.status < 500 ) {
throw error;
}
},
);
} catch (e : any ) {
logger.error ({ url, error : e.response ?.data }, `ERROR/SCRAP` );
throw e;
}
};
API Call Structure File: src/server/services/scrapper.ts:22
TypeScript
export const scrapperBeeAPICall = async ({
url,
render_js = false ,
premium_proxy = false ,
}: TScrapperRequest ): Promise <string > => {
const params = {
api_key : SCRAPING_BEE_API_KEY ,
url : url,
render_js,
premium_proxy,
};
const response = await axiosInstance.get (BASE_URL , {
params,
timeout : SCRAPPER_BEE_MAX_TIMEOUT ,
});
return response.data ;
};
api_key: Authentication
url: Target URL to scrape
render_js: Execute JavaScript (slower, more expensive)
premium_proxy: Use residential proxies (better success rate, more expensive)
Retry Logic (Deep Research):
Attempt 1: Standard proxy, no JS
Attempt 2: Standard proxy, JS enabled
Attempt 3+: Premium proxy, JS enabled
2. Firecrawl Integration File: src/server/services/firecrawl.ts
For deep research, AI-powered scraping via Firecrawl:
TypeScript
import FirecrawlApp from "@mendable/firecrawl-js" ;
export const FIRECRAWL_API_KEY = "fc-bdf154398aca47c0b291a80ba010cb78" ;
export const firecrawl = new FirecrawlApp ({ apiKey : FIRECRAWL_API_KEY });TypeScript
const result = await firecrawl.scrapeUrl (url, {
formats : ["markdown" ],
});
const searchResults = await firecrawl.search (query, {
limit : 10 ,
scrapeOptions : { formats : ["markdown" ] },
});
Automatic content extraction
Markdown conversion
PDF text extraction
Rate limiting handled by library
3. Deep Research 4-Tier Scraping File: src/server/endpoints/unified/features/deepResearch/firecrawlSerp.ts
For deep research, a sophisticated multi-tier approach:
Plain text
URL from Search Result
↓
┌─────────────────────────────────────────────────────────────┐
│ TIER 1: Bing Search Direct │
│ └── Fastest - uses Bing's cached content │
│ ↓ If fails │
├─────────────────────────────────────────────────────────────┤
│ TIER 2: Direct Scraping (getCleanHTMLAsMarkdownFromUrl) │
│ └── ScrapingBee + HTML-to-Markdown conversion │
│ ↓ If fails or empty │
├─────────────────────────────────────────────────────────────┤
│ TIER 3: Firecrawl Fallback │
│ └── AI-powered scraping with retry logic │
│ ↓ If fails │
├─────────────────────────────────────────────────────────────┤
│ TIER 4: Basic Fallback │
│ └── Raw HTTP fetch with basic extraction │
└─────────────────────────────────────────────────────────────┘
↓
Markdown Text
4. Twitter/X Special Handling File: src/server/services/scrapper.ts:81
Twitter/X uses a special CDN API instead of scraping:
TypeScript
const TWITTER_CDN_URL = "https://cdn.syndication.twimg.com/tweet-result" ;
const TWITTER_CDN_TIMEOUT = 4000 ;
export const getTweetFromTwitterUrl = async (
url : string ,
controller ?: AbortController ,
) => {
if (!controller) controller = new AbortController ();
try {
const tweetId = (url.split ("/" ).at (-1 ) ?? "" ).trim ();
if (isNaN (Number (tweetId))) {
return { cleanHTML : "" };
}
const getToken = (id : string ) => {
return ((Number (id) / 1e15 ) * Math .PI )
.toString (6 ** 2 )
.replace (/(0+|\.)/g , "" );
};
const params = {
id : tweetId,
lang : "en" ,
token : getToken (tweetId),
};
const response = await axiosInstance.get (TWITTER_CDN_URL , {
params,
signal : controller.signal ,
timeout : TWITTER_CDN_TIMEOUT ,
});
return { cleanHTML : response.data ?.text ?? "" };
} catch (_error) {
logger.info ("ERROR/TWITTER_CDN_API_ERROR" );
return { cleanHTML : "" };
}
};
Twitter blocks most scrapers
CDN endpoint is public but requires token generation
Returns clean text (no HTML parsing needed)
4s timeout (fast fail)
This mimics Twitter's frontend token generation to access their CDN.
5. Parallel URL Scraping File: src/server/endpoints/unified/features/web-access/utils.ts
When multiple search results need scraping, fetch in parallel:
TypeScript
export async function getDetailedResults (
results : TOrganicResult [],
isDeepResearch ?: boolean ,
): Promise <Array <{ result : TOrganicResult ; content : string | undefined }>> {
const controller = new AbortController ();
const fetchPromises = results.map (async (result) => {
try {
let content : string | undefined ;
if (isTwitterUrl (result.url )) {
const tweet = await getTweetFromTwitterUrl (result.url , controller);
content = tweet.cleanHTML || undefined ;
} else {
content = await Promise .race ([
getCleanHTMLAsMarkdownFromUrl (result.url , isDeepResearch),
new Promise <undefined >((_, reject ) =>
setTimeout (() => reject (new Error ("Timeout" )), 8000 ),
),
]);
}
return { result, content };
} catch (error) {
logger.error (
{ url : result.url , error },
"ERROR/WEB_ACCESS/DETAILED_RESULTS" ,
);
return { result, content : undefined };
}
});
const allResults = await Promise .all (fetchPromises);
const successfulResults = allResults.filter (
(r): r is { result : TOrganicResult ; content : string } =>
r.content !== undefined ,
);
return successfulResults;
}
Parallel fetching (all at once)
8-second timeout per URL
Returns fastest results first
Graceful degradation (undefined for failures)
6. HTML to Markdown Conversion File: src/server/endpoints/unified/features/web-access/utils.ts
TypeScript
export async function getCleanHTMLAsMarkdownFromUrl (
url : string ,
isDeepResearch ?: boolean ,
): Promise <string | undefined > {
try {
let html : string ;
try {
html = await scrap (url, isDeepResearch);
} catch (error) {
logger.error ({ url, error }, "ERROR/WEB_ACCESS/SCRAPINGBEE_FAILED" );
return undefined ;
}
if (!html) {
return undefined ;
}
const markdown = await convertHtmlToMarkdown (html, url);
if (markdown.length < 100 ) {
logger.info (
{ url, length : markdown.length },
"WEB_ACCESS/MARKDOWN_TOO_SHORT" ,
);
return undefined ;
}
return markdown;
} catch (error) {
logger.error ({ url, error }, "ERROR/WEB_ACCESS/GET_CLEAN_HTML" );
return undefined ;
}
}
Scrape HTML (via ScrapingBee)
Parse with JSDOM or similar
Extract main content (remove nav, ads, etc.)
Convert to Markdown (Turndown or similar)
Validate minimum length (100 chars)
7. Retry Architecture File: src/server/utilities/retry.ts
The retry system used by scrapers:
TypeScript
export async function retryAsyncFunction<
T extends (context : TRetryContext ) => any ,
>(
asyncFunc : T,
currentRetryCount = DEFAULT_RETRIES ,
errorHandler ?: (error : any ) => void ,
): Promise <Awaited <ReturnType <T>>> {
const totalRetryCount = currentRetryCount;
let result;
while (currentRetryCount > 0 ) {
try {
result = await asyncFunc ({
retryCount : totalRetryCount - currentRetryCount,
});
break ;
} catch (error) {
if (errorHandler) {
errorHandler (error);
}
currentRetryCount--;
if (currentRetryCount === 0 ) throw error;
}
}
return result;
}TypeScript
await retryAsyncFunction (
(retryContext ) =>
scrapperBeeAPICall ({
url,
premium_proxy : retryContext.retryCount > 1 ,
render_js : retryContext.retryCount === 1 ,
}),
isDeepResearch ? 3 : 1 ,
(error ) => {
if (error.response ?.status >= 400 && error.response ?.status < 500 ) {
throw error;
}
},
);
Retry 0: Standard proxy, no JS
Retry 1: Standard proxy, JS enabled
Retry 2+: Premium proxy, JS enabled
8. Error Handling Strategy
Scraping Error Hierarchy Plain text
Level 1: Network Error (timeout, DNS, etc.)
└── Logged, retry with backoff
Level 2: Client Error (4xx)
└── Don't retry (page doesn't exist, blocked, etc.)
Level 3: Server Error (5xx)
└── Retry with exponential backoff
Level 4: Content Error (empty, too short, malformed)
└── Try next provider in cascade
Level 5: All Providers Failed
└── Return undefined (graceful degradation)
Error Handling in Code TypeScript
try {
const html = await scrap (url, isDeepResearch);
const markdown = convertHtmlToMarkdown (html);
if (markdown.length < 100 ) {
return await firecrawl.scrapeUrl (url);
}
return markdown;
} catch (error) {
if (error.response ?.status === 403 ) {
return await scrapWithPremiumProxy (url);
}
if (error.code === "ECONNABORTED" ) {
return await firecrawl.scrapeUrl (url);
}
logger.error ({ url, error }, "SCRAPING_FAILED_ALL_PROVIDERS" );
return undefined ;
}
9. Integration with Search Scraping is the second step after search:
Plain text
Search → Get URLs → Scrape Top 3 → Process Content
↓ ↓ ↓ ↓
SerpAPI Results Parallel Extract
↓ Scraping Learnings
[url1,
url2,
url3]TypeScript
const searchResults = await webSearch (query);
const detailedResults = await getDetailedResults (
searchResults.slice (0 , 3 ),
true ,
);
for (const { result, content } of detailedResults) {
if (content) {
const learnings = await extractLearnings (content, query);
allLearnings.push (...learnings);
}
}
10. Performance Optimization
Scraping Best Practices
Parallel Execution
TypeScript
const top3 = searchResults.slice (0 , 3 );
const results = await Promise .all (top3.map ((url ) => scrape (url)));
Timeout Management
TypeScript
const content = await Promise .race ([
scrape (url),
new Promise ((_, reject ) => setTimeout (reject, 8000 )),
]);
Content Validation
TypeScript
if (content.length < 100 ) return undefined ;
Smart Retries
TypeScript
if (error.response ?.status >= 400 && error.response ?.status < 500 ) {
throw error;
}
Summary The scraping architecture:
Multi-Tier Fallback : ScrapingBee → Firecrawl → Direct
Progressive Enhancement : Standard → JS → Premium proxy
Parallel Execution : All URLs scraped simultaneously
Smart Retries : Context-aware with exponential backoff
Special Cases : Twitter CDN API for X.com
Content Validation : Minimum length checks
Error Resilience : Graceful degradation to undefined
Key Principle: Never let a scraping failure crash the research. Always have a fallback, always return gracefully.