InkdownInkdown
Start writing

Study

59 files·8 subfolders

Shared Workspace

Study
core

06-MultiProviderLLM

Shared from "Study" on Inkdown

Multi-Provider LLM Architecture

Overview

The LLM provider system routes requests to multiple AI models through a unified abstraction layer called Rune. It handles model selection, token cost calculation, caching, and provider-specific transformations.


Architecture

Plain text
Request
    ↓
┌─────────────────────────────────────────────────────────────┐
│                      Rune API Layer                          │
│  ┌─────────────────────────────────────────────────────────┐│
│  │ Provider: OpenAI, Anthropic, Google, Mistral, etc.    ││
│  │ Transform: Convert to provider-specific format          ││
│  │ Route: Based on model parameter                        ││
│  └─────────────────────────────────────────────────────────┘│
│                           ↓                                 │
│  ┌─────────────────────────────────────────────────────────┐│
│  │                Token Cost Calculation                  ││
│  │ Base cost × Multipliers:                               ││
│  │ • Web search: ×2                                       ││
│  │ • Data analysis: +15                                   ││
│  │ • RAG: +250 (if Project + ProFinder)                  ││
│  │ • Large context: ×5 (disabled)                        ││
│  └─────────────────────────────────────────────────────────┘│
│                           ↓                                 │
│  ┌─────────────────────────────────────────────────────────┐│
│  │                  Caching Layer                         ││
│  │ Cache last message (prompt caching)                    ││
│  │ Track cached tokens for billing                        ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
    ↓
LLM Response Stream
programming-language-concepts.md
zero-language-explanation.md
DB
01-introduction.md
02-relational-databases.md
03-database-design.md
04-indexing.md
05-transactions-acid.md
06-nosql-databases.md
07-query-optimization.md
08-replication-ha.md
09-sharding-partitioning.md
10-caching-strategies.md
11-cap-theorem.md
12-connection-pooling.md
13-backup-recovery.md
14-monitoring.md
15-database-selection.md
README.md
JS
Event loop
Merlin Backend
01-Orchestration.md
02-DeepResearch.md
03-Search.md
04-Scraping.md
05-Streaming.md
06-MultiProviderLLM.md
07-MemoryAndContext.md
08-ErrorHandling.md
09-RateLimiting.md
10-TaskQueue.md
11-SecurityAndAuth.md
Orchestration-2nd-draft
OpenAI Agents Python
00_OVERVIEW.md
01_AGENT_SYSTEM.md
02_RUNNER_SYSTEM.md
03_TOOL_SYSTEM.md
04_ITEMS_SYSTEM.md
05_GUARDRAILS.md
06_HANDOFFS.md
07_MEMORY_SESSIONS.md
08_MODEL_PROVIDERS.md
09_SANDBOX_SYSTEM.md
10_TRACING.md
11_RUN_STATE.md
12_CONTEXT.md
13_LIFECYCLE_HOOKS.md
14_CONFIGURATION.md
15_ERROR_HANDLING.md
16_STREAMING.md
17_EXTENSIONS.md
18_MCP_INTEGRATION.md
19_BEST_PRACTICES.md
20_ARCHITECTURE_PATTERNS.md
opencode-study
context-handling
core
Python
Alembic
Basics
sqlalchemy - fastapi
SQLAlchemy overview
tweets
system_design_for_agentic_apps.md

Core Provider: chat()

File: src/server/repositories/provider/provider.ts:24

TypeScript
const chat = async ({
	messages,
	model,
	metaData = {},
	params = {},
}: {
	messages: TPromptMessage[];
	apiKey?: string;
	model: TLLMModels;
	metaData?: Record<string, unknown>;
	params?: {
		tools?: OpenAI.Chat.Completions.ChatCompletionTool[];
	} & Record<string, unknown>;
}) => {
	const { chatStateManager, user }: TUnifiedPostwareRequestContext =
		requestContext.get();

	// Get model configuration
	const modelConfig = getLLMModel(model);

	// Calculate tool token count
	const toolCalls = params.tools;
	const toolCallTokens = await Promise.all(
		toolCalls?.map((toolCall) => tokenizer.encode(JSON.stringify(toolCall))) ??
			[],
	);
	const toolCallTokenCount = toolCallTokens.reduce(
		(acc, token) => acc + (token?.length ?? 0),
		0,
	);

	// Calculate total input size
	const inputSize =
		messages.reduce((acc, message) => acc + (message.tokens ?? 0), 0) +
		((metaData?.systemPromptTokens as number) ?? 0) +
		toolCallTokenCount;

	// Serialize messages for provider
	// Cache last message only (for prompt caching)
	let serializedMessages = [];
	for (let i = 0; i < messages.length; i++) {
		if (i == messages.length - 1) {
			messages[i].cache = true;
			serializedMessages.push(messages[i].toJSON());
			messages[i].cache = false;
		} else {
			serializedMessages.push(messages[i].toJSON());
		}
	}
	serializedMessages = serializedMessages.flat();

	// Build provider configuration
	const rawPayload: TProviderConfig = {
		config: { messages: serializedMessages },
		mode: "CHAT",
		params: {
			model: modelConfig.id,
			...params,
		},
		metaData: {
			systemPrompt: modelConfig.prompts.system,
			apiKey: undefined,
			...metaData,
		},
	};

	// Apply provider-specific overrides
	const payload = getFinalProviderConfig({ config: rawPayload });

	return {
		payload,

		// STREAMING REQUEST
		stream: async () => {
			try {
				const config: AxiosRequestConfig = {
					method: "POST",
					maxBodyLength: Infinity,
					url: RUNE.DOMAIN,
					headers: {
						"Content-Type": "application/json",
						"x-rune-api-key": RUNE.AUTH_TOKEN,
						"x-rune-config": modelConfig.runeConfig,
					},
					data: transformPayload(payload, true), // true = streaming
					responseType: "stream",
				};

				const response = await axiosInstance.request(config);

				return {
					data: response.data, // Readable stream

					// USAGE CALCULATOR (called after stream ends)
					usage: async (
						chunks: string[],
						options: { usage?: CompletionUsage; tools: TToolName[] },
					) => {
						const { usage, tools } = options;
						if (!chunks) chunks = [""];

						// Start with base cost
						let effectiveCost = modelConfig.queryCost;

						// Apply multipliers based on tools used
						if (tools.includes("web_search_tool")) {
							const count = tools.filter((t) => t === "web_search_tool").length;
							effectiveCost *= count + 1; // Multiplicative per search
						}

						if (
							tools.includes("rag_tool") &&
							chatStateManager.hasUsedMode("PROJECT") &&
							chatStateManager.hasUsedMode("PRO_FINDER")
						) {
							effectiveCost += 250; // High-cost RAG operation
						}

						if (tools.includes("data_analysis_tool")) {
							const count = tools.filter(
								(t) => t === "data_analysis_tool",
							).length;
							effectiveCost += 15 * count; // Per analysis
						}

						// Calculate output size
						const outputSize =
							usage?.completion_tokens ??
							(
								await tokenizer.encode(
									`<|start|>assistant\n${chunks.join("")}<|end|>\n`,
								)
							).length;

						// Build usage config
						return {
							queries: effectiveCost,
							tokens: {
								input: usage?.prompt_tokens ?? inputSize,
								output: outputSize,
								cached: usage?.prompt_tokens_details?.cached_tokens ?? 0,
								reasoning:
									usage?.completion_tokens_details?.reasoning_tokens ?? 0,
							},
							model,
						} as TUsageConfig;
					},
				};
			} catch (error) {
				if (error instanceof AxiosError) {
					const extractedError = await handleAxiosError(error);
					logger.error(extractedError, "ERROR/RUNE/CHAT_FAILED");
					throw error;
				}
				throw error;
			}
		},

		// NON-STREAMING REQUEST
		normal: async () => {
			const config = {
				method: "POST",
				maxBodyLength: Infinity,
				url: RUNE.DOMAIN,
				headers: {
					"Content-Type": "application/json",
					"x-rune-api-key": RUNE.AUTH_TOKEN,
					"x-rune-config": modelConfig.runeConfig,
				},
				data: transformPayload(payload, false), // false = normal
			};

			const response =
				await axiosInstance.request<OpenAI.ChatCompletion>(config);

			return {
				data: {
					data: {
						content: response.data.choices[0].message.content ?? "",
						raw: response.data,
					},
				},
				usage: async (calcEffectiveCost: boolean = false) => {
					const tokensUsed = messages.reduce(
						(acc, msg) => (msg.tokens ?? 0) + acc,
						0,
					);
					let effectiveCost = modelConfig.queryCost;

					if (calcEffectiveCost) {
						if (chatStateManager.hasUsedMode("WEB_SEARCH")) {
							effectiveCost *= 2;
						}
						if (
							chatStateManager.hasUsedMode("PROJECT") &&
							chatStateManager.hasUsedMode("PRO_FINDER")
						) {
							effectiveCost += 250;
						}
						if (chatStateManager.hasUsedMode("DATA_ANALYSIS")) {
							effectiveCost += 15;
						}
						if (chatStateManager.hasUsedMode("MCP")) {
							effectiveCost += 30;
						}
					}

					return {
						queries: effectiveCost,
						tokens: {
							input: response.data.usage?.prompt_tokens ?? tokensUsed,
							output:
								response.data.usage?.completion_tokens ??
								(
									await tokenizer.encode(
										`<|start|>assistant\n${response.data.choices[0].message.content ?? ""}<|end|>\n`,
									)
								).length,
							cached:
								response.data.usage?.prompt_tokens_details?.cached_tokens ?? 0,
							reasoning:
								response.data.usage?.completion_tokens_details
									?.reasoning_tokens ?? 0,
						},
						model,
					} as TUsageConfig;
				},
			};
		},

		// FUNCTION CALLING (for structured output)
		fnCall: async ({
			name,
			functions,
		}: {
			name: string;
			functions: FunctionDef[];
		}) => {
			const newPayload = {
				...payload,
				params: {
					...payload.params,
					model: "gpt-4o-mini", // Force smaller model for function calls
					functions,
					function_call: { name },
				},
			};

			const config = {
				method: "POST",
				maxBodyLength: Infinity,
				url: RUNE.DOMAIN,
				headers: {
					"Content-Type": "application/json",
					"x-rune-api-key": RUNE.AUTH_TOKEN,
					"x-rune-config": modelConfig.runeConfig,
				},
				data: transformPayload(newPayload, false),
			};

			const response =
				await axiosInstance.request<OpenAI.ChatCompletion>(config);

			return {
				data: {
					data: {
						content: {
							arguments:
								response.data.choices[0].message.function_call?.arguments ?? "",
						},
					},
				},
			};
		},
	};
};

Rune Configuration

File: src/server/constantsSchemasAndTypes/endpointConstants.ts

TypeScript
export const RUNE = {
	DOMAIN: "https://api.runellm.com/v1",
	AUTH_TOKEN: process.env.RUNE_AUTH_TOKEN,
};

Rune acts as a unified gateway to multiple LLM providers.


Model Configuration

Models are defined with costs, prompts, and capabilities:

TypeScript
const getLLMModel = (modelId: TLLMModels): TModelConfig => {
	const models: Record<TLLMModels, TModelConfig> = {
		"gpt-4o": {
			id: "gpt-4o",
			queryCost: 100, // Base cost in query units
			runeConfig: "openai/gpt-4o",
			prompts: {
				system: "You are a helpful AI assistant...",
			},
			getMaxOutputTokens: (override: boolean) => (override ? 16384 : 4096),
		},
		"gpt-4o-mini": {
			id: "gpt-4o-mini",
			queryCost: 15,
			runeConfig: "openai/gpt-4o-mini",
			prompts: { system: "..." },
			getMaxOutputTokens: () => 16384,
		},
		"claude-3-5-sonnet": {
			id: "claude-3-5-sonnet-20241022",
			queryCost: 120,
			runeConfig: "anthropic/claude-3.5-sonnet",
			prompts: { system: "..." },
			getMaxOutputTokens: () => 8192,
		},
		// ... more models
	};

	return models[modelId];
};

Cost Multipliers Applied:

OperationBase CostMultiplierFinal Cost
Standard chat1001×100
With web search1002×200
With data analysis1001× + 15115
RAG (Project + ProFinder)1001× + 250350
Multiple searches (3×)100(3+1)×400

Payload Transformation

File: src/server/repositories/provider/rune.ts

TypeScript
export const transformPayload = (
	payload: TProviderConfig,
	isStreaming: boolean,
): object => {
	const basePayload = {
		model: payload.params.model,
		messages: payload.config.messages,
		stream: isStreaming,
		max_tokens: payload.params.max_tokens,
		...(payload.params.tools && {
			tools: payload.params.tools,
			tool_choice: payload.params.tool_choice,
		}),
	};

	// Provider-specific transformations
	switch (getProviderFromModel(payload.params.model)) {
		case "anthropic":
			return {
				...basePayload,
				anthropic_version: "2023-06-01",
			};
		case "google":
			return {
				contents: convertMessagesToGoogleFormat(payload.config.messages),
				generationConfig: { maxOutputTokens: payload.params.max_tokens },
			};
		default:
			return basePayload; // OpenAI format (default)
	}
};

Why Transform:

  • Each provider has different API formats
  • Rune normalizes, but we still need provider-specific tweaks
  • Tools, function calling, system prompts vary

Usage Tracking

Token Counting:

TypeScript
// Input tokens
const inputSize =
	messages.reduce((acc, msg) => acc + (msg.tokens ?? 0), 0) +
	systemPromptTokens +
	toolCallTokenCount;

// Output tokens (if not provided by API)
const outputSize = await tokenizer.encode(
	`<|start|>assistant\n${content}<|end|>\n`,
).length;

// Cached tokens (for prompt caching)
const cached = usage?.prompt_tokens_details?.cached_tokens ?? 0;

// Reasoning tokens (Claude thinking)
const reasoning = usage?.completion_tokens_details?.reasoning_tokens ?? 0;

Usage Config:

TypeScript
type TUsageConfig = {
	queries: number; // Cost in query units
	tokens: {
		input: number; // Input/prompt tokens
		output: number; // Completion tokens
		cached: number; // Cached prompt tokens
		reasoning: number; // Reasoning/thinking tokens
	};
	model: TLLMModels; // Which model was used
};

Query Function (Simplified)

File: src/server/repositories/provider/provider.ts:322

For simple queries without conversation context:

TypeScript
const query = async ({
	query,
	context,
	model = "gpt-4o-mini",
	metaData,
	params,
}: {
	query: string;
	context: string;
	model?: TLLMModels;
	metaData?: Record<string, unknown>;
	params?: Record<string, unknown>;
}) => {
	const modelConfig = getLLMModel(model);

	const payload: TProviderConfig = {
		config: {
			context,
			messages: UserMessage({
				content: [
					{
						type: "TEXT",
						text: query,
						tokens: (await tokenizer.encode(query)).length,
					},
				],
			}).toJSON(),
		},
		mode: "CHAT",
		params: { model: modelConfig.id, ...params },
		metaData: { systemPrompt: modelConfig.prompts.system, ...metaData },
	};

	return {
		stream: async () => {
			/* ... */
		},
		normal: async () => {
			/* ... */
		},
	};
};

Use Case:

  • One-off queries
  • No conversation history
  • Faster initialization
  • Used by background tasks

Error Handling

File: src/server/repositories/provider/functions/helpers.ts

TypeScript
export const handleAxiosError = async (error: AxiosError): Promise<Error> => {
	const response = error.response;

	if (!response) {
		return new Error("NETWORK_ERROR");
	}

	const errorData = response.data as {
		error?: { message?: string; code?: string };
	};

	switch (response.status) {
		case 400:
			if (errorData.error?.code === "content_filter") {
				return new ClientError(400, ErrorType.CONTENT_FILTERED);
			}
			return new ClientError(400, ErrorType.BAD_REQUEST);

		case 401:
			return new ClientError(401, ErrorType.UNAUTHORIZED);

		case 429:
			return new ClientError(429, ErrorType.RATE_LIMITED);

		case 500:
		case 502:
		case 503:
			return new ServerError(response.status, ErrorType.LLM_PROVIDER_ERROR);

		default:
			return new Error(`UNKNOWN_ERROR: ${response.status}`);
	}
};

Error Types:

  • content_filter: Input/output blocked by safety filters
  • rate_limited: Too many requests to provider
  • context_length_exceeded: Input too long for model
  • server_error: Provider downtime

Caching Strategy

Prompt Caching:

TypeScript
// Only cache the last message (most likely to be reused)
for (let i = 0; i < messages.length; i++) {
	if (i == messages.length - 1) {
		messages[i].cache = true; // Mark for caching
		serializedMessages.push(messages[i].toJSON());
		messages[i].cache = false; // Reset
	} else {
		serializedMessages.push(messages[i].toJSON());
	}
}

Why Only Last Message:

  • System prompt and history rarely change
  • Last message (user query) varies
  • Anthropic/Claude support prompt caching
  • Reduces costs for long conversations

Integration with Orchestrator

TypeScript
// In toolOrchestrator.ts
const llmInput = {
    model: modelId,
    messages: [...currentMessages, ...state.inLoopTrimmedMessages],
    params: {
        tools: openAiTools,
        tool_choice: state.toolChoice,
        parallel_tool_calls: shouldUseParallelToolCalls(...),
        max_tokens: modelConfig.getMaxOutputTokens?.(false),
    },
};

const chatRequest = await provider.chat(llmInput);
const chatResponse = await chatRequest.stream();

// Stream through streamer
const streamOutput = await streamer.streamV2(
    config.request,
    config.response,
    chatResponse.data,
    { contentIndex: state.currentContentIndex }
);

// Calculate usage after stream ends
const streamUsage = await chatResponse.usage(streamOutput.content, {
    usage: streamOutput.usage,
    tools: successfulToolNames,
});

Summary

The multi-provider LLM architecture:

  1. Rune Gateway: Unified API for multiple providers
  2. Cost Calculation: Base cost × multipliers based on tools used
  3. Model Selection: Config per model (cost, limits, prompts)
  4. Dual Modes: Streaming (real-time) and Normal (blocking)
  5. Function Calling: Structured output via gpt-4o-mini
  6. Prompt Caching: Only last message cached
  7. Token Tracking: Input, output, cached, reasoning
  8. Error Mapping: Provider errors → Client/Server errors

Key Principle: One unified interface, multiple providers, accurate cost tracking, seamless failover between models.