InkdownInkdown
Start writing

Interview Questions

13 files·4 subfolders

Shared Workspace

Interview Questions
Agentic

interview-questions-part1

Shared from "Interview Questions" on Inkdown

Edward Project - Comprehensive Interview Questions - Part 1

Table of Contents

  • Architecture & System Design
  • Database & Transactions
  • Concurrency & Distributed Systems
  • Streaming & Parser

Architecture & System Design

Q: Why monorepo with pnpm workspaces instead of separate repositories?

Answer: The Edward project uses a monorepo architecture with pnpm workspaces for shared code reuse across packages. The @edward/auth, , , and packages are consumed by both (Next.js frontend) and (Express backend). This eliminates code duplication and ensures type safety across the entire stack. It allows atomic commits that span multiple packages, ensuring the frontend and backend stay in sync. Turbo provides a unified build pipeline with intelligent caching and task orchestration, significantly improving build times. A single at the root prevents version conflicts between packages.

interview-questions.md
Bonkers
BONKERS_END_TO_END_GUIDE.md
BONKERS_INTERVIEW_QUESTIONS.md
interview_questions.md
PROJECT_WALKTHROUGH_SCRIPT.md
Edward
pookie
questions
interview-questions-part1.md
interview-questions-part2.md
interview-questions-part3.md
interview-questions-part4.md
interview-questions-part5.md
interview-questions-part6.md
interview-questions-part7.md
@edward/shared
@edward/ui
@edward/octokit
apps/web
apps/api
pnpm-lock.yaml
Q: Why Express over NestJS/Fastify for the API?

Answer: Express was chosen for minimal overhead and full control over middleware composition. The team needed to compose custom middleware (Helmet, CORS, security telemetry, auth, rate limiting) in a specific order, which Express's unopinionated approach enables. NestJS would have introduced unnecessary abstraction layers and TypeScript decorators. Fastify would have required rewriting existing middleware patterns. The existing team had significant Express experience. The API's use case is primarily a delivery layer that delegates to services, so it doesn't need advanced framework features like dependency injection or modules.

Q: Why BullMQ over SQS/RabbitMQ?

Answer: BullMQ was selected because Redis was already required for pub/sub (used for build status updates and run event streaming). Adding another infrastructure service like SQS or RabbitMQ would have increased operational complexity. BullMQ provides built-in job scheduling, retry policies, and job priorities. It has excellent UI support through Bull Board for monitoring during development. The local development story is simpler with Redis compared to setting up local SQS or RabbitMQ. BullMQ's Redis-based approach provides natural horizontal scaling - multiple worker processes can consume from the same queue.

Q: Why SSE over WebSockets for streaming?

Answer: Server-Sent Events (SSE) was chosen because the use case is strictly server-to-client streaming - the client doesn't need to send messages back during the stream. SSE is simpler to implement, being built on top of standard HTTP, which works better through proxies and firewalls. SSE provides automatic reconnection through the Last-Event-ID header, critical for handling network interruptions. WebSockets would have required maintaining bidirectional connection state and custom reconnection logic. SSE's event format with named event types maps naturally to the parser's state machine.

Q: Why Drizzle ORM over Prisma?

Answer: Drizzle ORM was chosen for its SQL-first approach and zero runtime overhead. Prisma generates a complex query builder at runtime, adding performance overhead. Drizzle generates type-safe SQL queries compiled at build time, meaning the runtime is essentially just the queries. This is important for Edward's complex transaction patterns - the run admission logic uses Postgres advisory locks and raw SQL operations that would be awkward with Prisma. Drizzle's schema definition mirrors the actual database structure, making it easier to understand database-level operations.

Q: Why Docker for sandboxes instead of WebAssembly?

Answer: Docker containers were chosen because the project needs full Node.js ecosystem support. WebAssembly has limitations around native modules, file system access, and network operations that would make running real build commands difficult. Docker provides complete isolation while allowing access to the full Node.js runtime, npm/pnpm/yarn package managers, and native addons. Sandboxes need to execute arbitrary build commands which are designed for Linux environments. Docker also provides network isolation capabilities needed for security. The production parity argument is strong - the same Docker images can be used in development and production.


Database & Transactions

Q: How does run admission prevent race conditions?

Answer: Run admission uses Postgres advisory locks (pg_advisory_xact_lock) to serialize admission checks. The createRunWithUserLimit function wraps the entire admission logic in a database transaction. It first acquires a global lock on "run_admission_global" using pg_advisory_xact_lock(hashtext('run_admission_global')), ensuring only one transaction can check global limits at a time. Then it acquires a user-specific lock using the userId hash. Within this locked context, it counts active runs globally, per user, and per chat using inArray(run.status, ACTIVE_RUN_STATUSES). If any limit is exceeded, the transaction returns a rejection. Only if all checks pass does it insert the new run record. Advisory locks are automatically released when the transaction commits or rolls back.

Q: Why use onConflictDoUpdate for runToolCall?

Answer: The onConflictDoUpdate pattern provides idempotency for tool calls, which is critical because the agent loop can retry operations. The function uses a composite unique constraint on (runId, idempotencyKey). When a tool call is executed, it's inserted with a unique idempotencyKey. If the same tool call is retried, the insert will conflict, and onConflictDoUpdate updates the existing record with the latest status, output, and duration instead of creating a duplicate. This prevents the database from accumulating multiple entries for the same logical operation.

Q: How does the event sequence numbering work?

Answer: Each run record has a nextEventSeq column that starts at 0 and increments with each event. The appendRunEvent function uses a database transaction to atomically increment the sequence and insert the event. First, it updates the run record with sqlrun.nextEventSeq+1‘‘andreturnsthenewsequencevalue.Thisincrement−and−returnpatternguaranteesunique,sequentialnumbersevenunderconcurrentaccess.Thereturnedsequencenumberisusedasthe‘seq‘fieldinthe‘runEvent‘table,andtheeventIDisconstructedas‘{run.nextEventSeq} + 1`` and returns the new sequence value. This increment-and-return pattern guarantees unique, sequential numbers even under concurrent access. The returned sequence number is used as the `seq` field in the `runEvent` table, and the event ID is constructed as `run.nextEventSeq+1‘‘andreturnsthenewsequencevalue.Thisincreme{runId}:${seq}. When streaming events, getRunEventsAfterqueries for events withseq > afterSeq` in ascending order.

Q: What's the purpose of the unique index on (runId, seq) in runEvent?

Answer: The unique index on (runId, seq) serves three purposes: (1) prevents duplicate events from being inserted, (2) enforces ordering by ensuring sequence numbers are unique and monotonic within a run, and (3) enables efficient sequential queries when streaming events. The index supports the replay logic where clients request events after a specific lastEventId. Without this index, the system would need to scan all events for a run to find the correct starting point.


Concurrency & Distributed Systems

Q: How does the distributed lock implementation prevent deadlocks?

Answer: The distributed lock uses several strategies: (1) TTL auto-expires locks (typically 60 seconds) so if a process crashes, Redis automatically expires the lock, (2) lock release uses a Lua script for atomic compare-and-swap ensuring only the lock owner can release it, (3) lock acquisition includes optional retry with exponential backoff for cases where the lock is held, (4) lock keys include the resource identifier so different resources don't contend, (5) code uses try-finally blocks to ensure lock release even on errors.

Q: Why use Redis SET NX PX instead of Redlock?

Answer: The project uses a single Redis instance, making Redlock's multi-instance complexity unnecessary. SET NX PX is a single Redis command that sets a key only if it doesn't exist (NX) with an expiration (PX), which is exactly what's needed. Redlock was designed for multiple independent Redis nodes to prevent single-point failures, but Edward's deployment uses a single Redis instance (or Redis Cluster with its own failover). SET NX PX is simpler, faster (one network round-trip vs multiple), and sufficient for the use case.

Q: How does sandbox provisioning handle concurrent requests?

Answer: Sandbox provisioning uses a distributed lock with a double-check pattern. The provisionSandbox function first checks if an active sandbox exists or if another process is provisioning (by checking for a lock). If not, it attempts to acquire a distributed lock on edward:locking:provision:{chatId} with a 60-second TTL. Lock acquisition includes retry with jitter to avoid thundering herd. Once acquired, it performs a "double-check" - calling getActiveSandbox again to verify another process didn't complete provisioning while waiting. Only if the double-check fails does it proceed to create a new container.

Q: What's the flush scheduler's purpose?

Answer: The flush scheduler implements a debounce pattern for file write operations to Docker containers. Instead of immediately flushing to the container, the system schedules a flush marker in Redis with a timestamp. For immediate flushes (e.g., before a build), it executes synchronously. For normal writes, it sets the due timestamp to Date.now() + WRITE_DEBOUNCE_MS. The processScheduledFlushes function runs periodically, scans Redis for due keys, and flushes pending files. This batching reduces Docker exec calls from potentially hundreds per operation to one per debounce window.

Q: How does the build job claim mechanism work?

Answer: The build job claim uses optimistic locking. In processBuildJob, the function resolves the build record and calls claimBuildForExecution, which attempts to update the status from QUEUED to BUILDING using a conditional WHERE clause: where(and(eq(build.id, buildId), eq(build.status, BuildRecordStatus.QUEUED))). This ensures the update only succeeds if the build is still in QUEUED state. If 0 rows are affected, another worker already claimed it or it's terminal. If successful, the worker proceeds with execution.


Streaming & Parser

Q: How does the parser state machine handle malformed streams?

Answer: The parser has several safeguards: (1) buffer with max size to prevent memory issues, (2) iteration limit to prevent infinite loops - if buffer doesn't change after MAX_ITERATIONS, it emits an error and resets state, (3) flush function handles incomplete states by emitting appropriate end events, (4) ignores unknown/malformed content by staying in current state until a known pattern is recognized, (5) frontend has replay mechanism for stream disconnections.

Q: Why separate states (TEXT, THINKING, SANDBOX, FILE, INSTALL)?

Answer: Separate states are needed because each requires different handling and emits different event types. TEXT accumulates regular content. THINKING accumulates reasoning content. SANDBOX handles sandbox operations. FILE handles individual file content nested within SANDBOX. INSTALL handles package installation. Separating states prevents tag nesting issues - a file tag inside thinking is handled correctly. It enables proper flush behavior - if stream ends mid-file, flush emits FILE_END. Each state has its own buffer and transition logic.

Q: How does the frontend handle stream disconnection?

Answer: The frontend uses automatic replay with exponential backoff. After the initial stream completes without session completion and no fatal error, it waits a backoff period (500ms doubling up to 5s), then calls openRunEventsStream with the lastEventId to fetch missed events. The replay response is processed recursively with incremented attempt count. If replay succeeds, results are merged using mergeStreamResults. If replay fails, it emits a fatal error. This ensures transient network interruptions don't result in lost events.

Q: What's the purpose of the frame-batched action queue in stream processor?

Answer: The frame-batched action queue optimizes React rendering during high-frequency event streams. Instead of dispatching each event immediately, events are accumulated in a pendingActions array and scheduled for flush using requestAnimationFrame (or 16ms setTimeout). The flush function splices all pending actions and dispatches them in a single frame. This means even if the stream emits hundreds of events, React only re-renders once per frame (~60 times/second), dramatically reducing rendering overhead.

Q: How are web search events deduplicated and merged?

Answer: Web search events are deduplicated using mergeWebSearchEvent. It compares incoming events with the last event using isSameWebSearchEvent which checks query, maxResults, answer, error, and serialized results. If identical, returns unchanged. If queries match but existing has no payload and incoming has payload, replaces existing while preserving uiOrder. This handles case where a web search starts with a query event (no results) and later completes with results. If queries match and both have no payload, it's a no-op (duplicate start). Otherwise, appends the incoming event.


n
t
−
and−
returnpatternguaranteesunique,sequentialnumbersevenunderconcurrentaccess.Thereturnedsequencenumberisusedasthe‘seq‘fieldinthe‘runEvent‘table,andtheeventIDisconstructedas‘