InkdownInkdown
Start writing

Claude-Code

62 files·4 subfolders

Shared Workspace

Claude-Code
codex

21-error-recovery

Shared from "Claude-Code" on Inkdown

Error Handling, Recovery & Resilience Architecture

Overview

Claude Code is a mission-critical tool for developers. It cannot crash, lose work, or leave the system in a bad state. The error handling system provides graceful degradation and automatic recovery.

Plain text
┌─────────────────────────────────────────────────────────────────────────────┐
│                    ERROR HANDLING LAYERS                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  LAYER 4: RECOVERY                                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Automatic Recovery                                                  │   │
│  │  - Retry with backoff          - Degraded mode                      │   │
│  │  - Snapshot restore            - Safe state reset                   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                              │
│  LAYER 3: ERROR BOUNDARIES                                                   │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Component Isolation                                                  │   │
│  │  - Tool error doesn't crash UI   - UI error doesn't kill session   │   │
│  │  - Partial rendering on failure  - Graceful component death        │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                              │
│  LAYER 2: OPERATIONAL ERRORS                                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Expected Failures                                                   │   │
│  │  - API rate limits     - File not found     - Network timeout      │   │
│  │  - Permission denied   - Invalid input      - Git conflict         │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                              │
│  LAYER 1: PREVENTION                                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Defensive Programming                                               │   │
│  │  - Type safety (TypeScript)    - Schema validation (Zod)            │   │
│  │  - Null checks                 - Input sanitization                │   │
│  │  - Bounds checking             - Resource limits                   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
0000_start_here_index_and_recommended_reading_order.md
0100_project_overview_tech_stack_runtime_modes_and_folder_map.md
0200_startup_flow_entry_points_and_cold_start_sequence.md
0300_codebase_modules_layers_state_models_and_schemas.md
0400_system_architecture_and_design_rationale.md
0500_interactive_repl_request_flow_end_to_end.md
0600_headless_sdk_and_print_mode_request_flow_end_to_end.md
0700_mcp_integration_connection_and_tool_call_flow.md
0800_external_services_sdks_storage_and_local_dependencies.md
0900_environment_variables_settings_feature_flags_and_failure_modes.md
1000_non_obvious_patterns_gotchas_and_debugging_traps.md
1100_full_codebase_file_inventory_grouped_by_directory.md
kimi
00-overview.md
01-entrypoints.md
02-state-management.md
03-query-system.md
04-tools-system.md
05-tasks-system.md
06-ui-components.md
07-bridge-remote.md
08-services.md
09-skills-plugins.md
10-commands.md
11-testing-architecture.md
12-permission-system.md
13-build-system.md
14-ink-internals.md
15-git-internals.md
16-context-compaction.md
17-vim-mode.md
18-mailbox-notifications.md
19-session-persistence.md
20-hooks-system.md
21-error-recovery.md
README.md
qwen
00-overview.md
01-entry-points.md
02-query-engine.md
03-tools-and-tasks.md
04-commands-and-skills.md
05-state-management.md
06-ink-rendering.md
07-bridge-remote.md
08-mcp-services.md
09-services-overview.md
10-multi-agent.md
11-system-prompt-constants.md
12-tool-interface.md
13-memory-system.md
14-buddy-companion.md
15-keybindings.md
16-stop-hooks.md
17-vim-mode.md
18-upstreamproxy.md
19-cost-tracking-history.md
20-contexts-styles-onboarding.md
21-hooks.md
22-screens.md
tweets-explain
claude-code-memory-analysis.md
compact
memory-system
agentic-architecture

Core Files

FilePurpose
components/ErrorBoundary.tsxReact error boundaries
utils/errors.tsError types and utilities
utils/gracefulShutdown.tsClean exit handling
utils/cleanupRegistry.tsResource cleanup
services/api/withRetry.tsRetry logic
services/api/errors.tsAPI error categorization
utils/warningHandler.tsWarning management
utils/debug.tsDebug utilities

Error Classification

TypeScript
// utils/errors.ts
export type ErrorCategory =
  | 'user'           // User input error (can fix)
  | 'network'        // Connectivity issue (retry)
  | 'api'            // External API error (retry/degrade)
  | 'file_system'    // File operation (handle gracefully)
  | 'permission'     // Access denied (inform user)
  | 'resource'       // Out of memory/disk (cleanup)
  | 'internal'       // Bug in code (report)

export type ClassifiedError = {
  category: ErrorCategory
  retryable: boolean
  userMessage: string
  shouldReport: boolean
}

export function classifyError(error: unknown): ClassifiedError {
  if (error instanceof APIError) {
    if (error.status === 429) {
      return {
        category: 'api',
        retryable: true,
        userMessage: 'Rate limited. Retrying...',
        shouldReport: false,
      }
    }
    if (error.status >= 500) {
      return {
        category: 'api',
        retryable: true,
        userMessage: 'Service temporarily unavailable.',
        shouldReport: true,
      }
    }
  }

  if (isENOENT(error)) {
    return {
      category: 'file_system',
      retryable: false,
      userMessage: `File not found: ${error.path}`,
      shouldReport: false,
    }
  }

  if (error instanceof PermissionDeniedError) {
    return {
      category: 'permission',
      retryable: false,
      userMessage: 'Permission denied. Check file permissions.',
      shouldReport: false,
    }
  }

  if (error instanceof NetworkError) {
    return {
      category: 'network',
      retryable: true,
      userMessage: 'Network error. Check connection.',
      shouldReport: false,
    }
  }

  // Default: internal error
  return {
    category: 'internal',
    retryable: false,
    userMessage: 'Something went wrong. Please try again.',
    shouldReport: true,
  }
}

Error Boundaries

React Error Boundaries
TypeScript
// components/ErrorBoundary.tsx
import React, { Component, ReactNode } from 'react'

interface Props {
  children: ReactNode
  fallback?: ReactNode
  onError?: (error: Error, errorInfo: React.ErrorInfo) => void
}

interface State {
  hasError: boolean
  error?: Error
}

export class ErrorBoundary extends Component<Props, State> {
  constructor(props: Props) {
    super(props)
    this.state = { hasError: false }
  }

  static getDerivedStateFromError(error: Error): State {
    return { hasError: true, error }
  }

  componentDidCatch(error: Error, errorInfo: React.ErrorInfo) {
    // Log to error tracking
    logError('React Error Boundary caught error', {
      error: error.message,
      stack: error.stack,
      componentStack: errorInfo.componentStack,
    })

    this.props.onError?.(error, errorInfo)
  }

  render() {
    if (this.state.hasError) {
      return this.props.fallback || (
        <Box flexDirection="column" padding={1}>
          <Text color="red" bold>Something went wrong</Text>
          <Text dimColor>{this.state.error?.message}</Text>
          <Button onPress={() => this.setState({ hasError: false })}>
            Try Again
          </Button>
        </Box>
      )
    }

    return this.props.children
  }
}
Nested Boundaries
TypeScript
// App.tsx - Multiple layers of protection
<ErrorBoundary fallback={<AppCrashFallback />}>
  <ErrorBoundary
    fallback={<SidebarError />}
    onError={logSidebarError}
  >
    <Sidebar />
  </ErrorBoundary>

  <ErrorBoundary
    fallback={<MainPanelError />}
    onError={logMainError}
  >
    <MainPanel />
  </ErrorBoundary>

  <ErrorBoundary fallback={null}>
    {/* Non-critical: notifications */}
    <NotificationArea />
  </ErrorBoundary>
</ErrorBoundary>

Retry Logic

Exponential Backoff
TypeScript
// services/api/withRetry.ts
export async function withRetry<T>(
  operation: () => Promise<T>,
  options: RetryOptions = {}
): Promise<T> {
  const {
    maxAttempts = 3,
    baseDelay = 1000,
    maxDelay = 30000,
    shouldRetry,
    onRetry,
  } = options

  let lastError: Error | undefined

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await operation()
    } catch (error) {
      lastError = toError(error)

      const classified = classifyError(lastError)

      // Check if we should retry
      if (attempt >= maxAttempts) {
        throw lastError  // Exhausted retries
      }

      if (shouldRetry && !shouldRetry(lastError)) {
        throw lastError  // Custom check says no
      }

      if (!classified.retryable) {
        throw lastError  // Not retryable
      }

      // Calculate delay with exponential backoff + jitter
      const delay = Math.min(
        baseDelay * Math.pow(2, attempt - 1) + Math.random() * 1000,
        maxDelay
      )

      onRetry?.(attempt, delay, lastError)

      await sleep(delay)
    }
  }

  throw lastError!
}
API Call with Retry
TypeScript
// services/api/claude.ts
export async function callClaudeAPI(
  request: APIRequest
): Promise<APIResponse> {
  return withRetry(
    () => makeAPICall(request),
    {
      maxAttempts: 5,
      baseDelay: 1000,
      shouldRetry: (error) => {
        // Don't retry auth errors
        if (error.status === 401) return false

        // Do retry rate limits, server errors
        return [429, 500, 502, 503].includes(error.status)
      },
      onRetry: (attempt, delay, error) => {
        showNotification({
          type: 'warning',
          title: `Retrying (${attempt}/5)`,
          message: `Error: ${error.message}. Waiting ${delay}ms...`,
          timeout: delay,
        })
      },
    }
  )
}

Graceful Degradation

Degraded Mode Activation
TypeScript
// state/AppStateStore.ts
type DegradedFeatures = {
  streaming: boolean
  tools: boolean
  compaction: boolean
  mcp: boolean
  bridge: boolean
}

export type AppState = {
  // ... normal state

  degradedMode: {
    active: boolean
    reason: string
    features: Partial<DegradedFeatures>
  }
}

// Activate degraded mode
export function activateDegradedMode(
  state: AppState,
  reason: string,
  features: Partial<DegradedFeatures>
): AppState {
  return {
    ...state,
    degradedMode: {
      active: true,
      reason,
      features,
    },
  }
}
Feature Detection
TypeScript
// utils/degradedMode.ts
export function isFeatureAvailable(
  state: AppState,
  feature: keyof DegradedFeatures
): boolean {
  if (!state.degradedMode.active) return true
  return state.degradedMode.features[feature] !== false
}

// Usage in code
function StreamingComponent() {
  const degraded = useAppState(s => s.degradedMode)

  if (!isFeatureAvailable({ degradedMode: degraded }, 'streaming')) {
    return <FallbackNonStreamingUI />
  }

  return <StreamingUI />
}

Cleanup & Shutdown

Cleanup Registry
TypeScript
// utils/cleanupRegistry.ts
const cleanupRegistry = new Set<() => void>()

export function registerCleanup(cleanupFn: () => void): () => void {
  cleanupRegistry.add(cleanupFn)

  // Return unregister
  return () => {
    cleanupRegistry.delete(cleanupFn)
  }
}

export function executeCleanup(): void {
  for (const cleanup of cleanupRegistry) {
    try {
      cleanup()
    } catch (error) {
      logError('Cleanup failed', error)
    }
  }
  cleanupRegistry.clear()
}
Graceful Shutdown
TypeScript
// utils/gracefulShutdown.ts
let isShuttingDown = false

export function setupGracefulShutdown(): void {
  // SIGINT (Ctrl+C)
  process.on('SIGINT', () => handleShutdown('SIGINT'))

  // SIGTERM (kill)
  process.on('SIGTERM', () => handleShutdown('SIGTERM'))

  // Uncaught exceptions
  process.on('uncaughtException', (error) => {
    logError('Uncaught exception', error)
    handleShutdown('uncaughtException')
  })

  // Unhandled rejections
  process.on('unhandledRejection', (reason) => {
    logError('Unhandled rejection', reason)
  })
}

async function handleShutdown(signal: string): Promise<void> {
  if (isShuttingDown) {
    // Force exit if already shutting down
    process.exit(1)
  }

  isShuttingDown = true
  console.log(`\nReceived ${signal}. Shutting down gracefully...`)

  // 1. Stop accepting new input
  disableInput()

  // 2. Flush pending writes
  await flushSessionStorage()

  // 3. Save state
  await saveAppState()

  // 4. Execute registered cleanup
  executeCleanup()

  // 5. Exit
  process.exit(0)
}

// Synchronous version for emergency
export function gracefulShutdownSync(): void {
  try {
    flushSessionStorageSync()
    saveAppStateSync()
  } catch (e) {
    // Best effort
  }
  process.exit(1)
}

Crash Recovery

Session Recovery
TypeScript
// utils/conversationRecovery.ts
export async function attemptCrashRecovery(): Promise<boolean> {
  // Check for crash marker
  const crashInfo = await readCrashMarker()

  if (!crashInfo) return false  // Clean exit last time

  console.log('Detected previous crash. Attempting recovery...')

  try {
    // 1. Recover session file
    const messages = await recoverSessionFile(crashInfo.sessionId)

    // 2. Check for incomplete tool executions
    const incompleteTools = findIncompleteToolExecutions(messages)

    // 3. Repair message chain
    const repaired = await repairMessageChain(messages, incompleteTools)

    // 4. Offer to resume
    const shouldResume = await askUser(
      `Recovered ${repaired.length} messages from crashed session. Resume?`
    )

    if (shouldResume) {
      await loadSession(repaired)
      return true
    }
  } catch (error) {
    logError('Crash recovery failed', error)
  }

  return false
}
Crash Markers
TypeScript
// Write on startup
export async function writeCrashMarker(sessionId: string): Promise<void> {
  const marker = {
    sessionId,
    pid: process.pid,
    startTime: Date.now(),
    version: MACRO.VERSION,
  }

  await writeFile(CRASH_MARKER_PATH, JSON.stringify(marker))
}

// Clear on clean exit
export async function clearCrashMarker(): Promise<void> {
  await unlink(CRASH_MARKER_PATH).catch(() => {})  // Ignore if missing
}

User-Facing Errors

Error to User Message
TypeScript
// utils/errors.ts
export function getUserFriendlyMessage(error: unknown): string {
  const classified = classifyError(error)

  const messages: Record<ErrorCategory, string> = {
    user: classified.userMessage,
    network: `${classified.userMessage} Retrying automatically...`,
    api: `${classified.userMessage} Please try again in a moment.`,
    file_system: classified.userMessage,
    permission: `${classified.userMessage} You may need to run with elevated permissions.`,
    resource: `${classified.userMessage} Try closing other applications.`,
    internal: `An unexpected error occurred. ${classified.userMessage}`,
  }

  return messages[classified.category]
}
Error Display Component
TypeScript
// components/ErrorDisplay.tsx
export function ErrorDisplay({ error, onRetry, onDismiss }: ErrorDisplayProps) {
  const classified = classifyError(error)
  const message = getUserFriendlyMessage(error)

  return (
    <Box flexDirection="column" borderStyle="round" borderColor="red" padding={1}>
      <Text color="red" bold>
        {classified.category === 'internal' ? '⚠️ Unexpected Error' : '❌ Error'}
      </Text>

      <Text>{message}</Text>

      {classified.retryable && onRetry && (
        <Button onPress={onRetry}>Retry</Button>
      )}

      <Button onPress={onDismiss}>Dismiss</Button>

      {process.env.DEBUG && (
        <Box marginTop={1}>
          <Text dimColor>{error.stack}</Text>
        </Box>
      )}
    </Box>
  )
}

Key Reliability Patterns

  1. Fail Fast: Validate inputs early, throw descriptive errors
  2. Retry with Backoff: Network/API errors get automatic retries
  3. Error Boundaries: UI crashes don't kill the session
  4. Graceful Degradation: Lose features, not the whole app
  5. Cleanup Registry: Resources always get freed
  6. Crash Recovery: Sessions survive unexpected exits
  7. User Context: Every error explains what happened and what to do

Debugging

Bash
# Verbose error logging
DEBUG=errors claude

# See retry attempts
DEBUG=retry claude

# Force degraded mode
claude --degraded-mode

# Simulate crash recovery
rm ~/.claude/.crash-marker && claude