InkdownInkdown
Start writing

Arpit Bhayani Blogs

336 files·168 subfolders

Shared Workspace

Arpit Bhayani Blogs
001 Ai Topological Sort

020-how-llm-inference-works

Shared from "Arpit Bhayani Blogs" on Inkdown

How LLM Inference Works

Source: https://arpitbhayani.me/blogs/how-llm-inference-works Date: 2025-11-22

When you enter a prompt into an LLM, the model converts your text into numbers, processes them, and returns a response one token at a time. In this article, we go through the journey of LLM inference and see how it works.


When you enter a prompt into an LLM, the model converts your text into numbers, processes them, and returns a response one token at a time. In this article, we go through the journey of LLM inference and see how it works.

What are Large Language Models?

LLMs are just neural networks built on the transformer architecture. Unlike earlier architectures that processed text sequentially, transformers can analyze entire sequences in parallel, making them more efficient to train and deploy.

001-ai-topological-sort.md
tldr.md
002 Temporal Primer
002-temporal-primer.md
tldr.md
003 Rag Production
003-rag-production.md
tldr.md
004 Structure Of Llm Chat
004-structure-of-llm-chat.md
tldr.md
005 How Llms Work
005-how-llms-work.md
tldr.md
006 Monolith Is Distributed System
006-monolith-is-distributed-system.md
tldr.md
007 Defensive Databases
007-defensive-databases.md
tldr.md
008 Bm25
008-bm25.md
tldr.md
009 Join Algorithms
009-join-algorithms.md
tldr.md
010 Venting At Work
010-venting-at-work.md
tldr.md
011 Half Life
011-half-life.md
tldr.md
012 Multi Paxos
012-multi-paxos.md
tldr.md
013 Mysql Replication Internals
013-mysql-replication-internals.md
tldr.md
014 Bloom Filters
014-bloom-filters.md
tldr.md
015 Clock Sync Nightmare
015-clock-sync-nightmare.md
tldr.md
016 Kafka Partitions
016-kafka-partitions.md
tldr.md
017 Product Quantization
017-product-quantization.md
tldr.md
018 Qkv Matrices
018-qkv-matrices.md
tldr.md
019 Deleted Production
019-deleted-production.md
tldr.md
020 How Llm Inference Works
020-how-llm-inference-works.md
tldr.md
021 Blocking Queues
021-blocking-queues.md
tldr.md
022 Heartbeats In Distributed Systems
022-heartbeats-in-distributed-systems.md
tldr.md
023 Cassandra Writes
023-cassandra-writes.md
tldr.md
024 Redis Replication
024-redis-replication.md
tldr.md
025 Arrogant People At Work
025-arrogant-people-at-work.md
tldr.md
026 Cdn Content Replication
026-cdn-content-replication.md
tldr.md
027 Cant Fix Everything Day One
027-cant-fix-everything-day-one.md
tldr.md
028 Emotions At Work
028-emotions-at-work.md
tldr.md
029 Grpc Http2
029-grpc-http2.md
tldr.md
030 Meetings With No Agenda Are A Waste Of Time
030-meetings-with-no-agenda-are-a-waste-of-time.md
tldr.md
031 Growth Is Not About Doing Everything
031-growth-is-not-about-doing-everything.md
tldr.md
032 Career Longevity Vs Job Hopping
032-career-longevity-vs-job-hopping.md
tldr.md
033 Stay Relevant At Higher Salary Levels
033-stay-relevant-at-higher-salary-levels.md
tldr.md
034 Why Consensus
034-why-consensus.md
tldr.md
035 Database Deadlocks
035-database-deadlocks.md
tldr.md
036 Cpu Cache Locality
036-cpu-cache-locality.md
tldr.md
037 Eventual Consistency
037-eventual-consistency.md
tldr.md
038 Dns Udp Tcp
038-dns-udp-tcp.md
tldr.md
039 Masters
039-masters.md
tldr.md
040 Empathy Makes Great Engineers Unstoppable
040-empathy-makes-great-engineers-unstoppable.md
tldr.md
041 Good Mentors Build People
041-good-mentors-build-people.md
tldr.md
042 Always Have Back Burner Projects
042-always-have-back-burner-projects.md
tldr.md
043 Before You Push Back Know What Youre Standing On
043-before-you-push-back-know-what-youre-standing-on.md
tldr.md
044 Be The One They Can Count On
044-be-the-one-they-can-count-on.md
tldr.md
045 How Much People Bet On You
045-how-much-people-bet-on-you.md
tldr.md
046 How To Get Leadership To Say Yes To Your Project
046-how-to-get-leadership-to-say-yes-to-your-project.md
tldr.md
047 Dont Let Your Best Ideas Die In Silence
047-dont-let-your-best-ideas-die-in-silence.md
tldr.md
048 Be Someone Others Want To Work With
048-be-someone-others-want-to-work-with.md
tldr.md
049 Dont Fall For Xy Problem Ask Right Questions
049-dont-fall-for-xy-problem-ask-right-questions.md
tldr.md
050 Biggest Lie Startups Tell Engineers
050-biggest-lie-startups-tell-engineers.md
tldr.md
051 Promotions Are Proactive Not Reactive
051-promotions-are-proactive-not-reactive.md
tldr.md
052 Not Enough To Be Right Learn To Be Heard
052-not-enough-to-be-right-learn-to-be-heard.md
tldr.md
053 No One Ships Alone
053-no-one-ships-alone.md
tldr.md
054 Not Every Mistake Needs A Correction
054-not-every-mistake-needs-a-correction.md
tldr.md
055 Build Influence At Work
055-build-influence-at-work.md
tldr.md
056 Your Soft Skills Arent Soft At All
056-your-soft-skills-arent-soft-at-all.md
tldr.md
057 Experience Before Forming Opinion
057-experience-before-forming-opinion.md
tldr.md
058 Curiosity And High Bias For Action
058-curiosity-and-high-bias-for-action.md
tldr.md
059 Worklog
059-worklog.md
tldr.md
060 Mistakes And Growth
060-mistakes-and-growth.md
tldr.md
061 Own It Instead Of Sweeping It Aside
061-own-it-instead-of-sweeping-it-aside.md
tldr.md
062 Dont Wait Step Up
062-dont-wait-step-up.md
tldr.md
063 Temporary Fix Is Permanent
063-temporary-fix-is-permanent.md
tldr.md
064 Interview Bias And What Sets You Apart
064-interview-bias-and-what-sets-you-apart.md
tldr.md
065 Saying This Isnt My Problem Is A Problem
065-saying-this-isnt-my-problem-is-a-problem.md
tldr.md
066 Okr
066-okr.md
tldr.md
067 Miscommunication
067-miscommunication.md
tldr.md
068 When In Doubt Code It Out
068-when-in-doubt-code-it-out.md
tldr.md
069 Follow Up Without Annoying People
069-follow-up-without-annoying-people.md
tldr.md
070 Lead Projects That Land
070-lead-projects-that-land.md
tldr.md
071 Abstract Thinking Skill Next Decade
071-abstract-thinking-skill-next-decade.md
tldr.md
072 We Engineers Suck At Task Estimation
072-we-engineers-suck-at-task-estimation.md
tldr.md
073 Shiny Object Syndrome In Tech
073-shiny-object-syndrome-in-tech.md
tldr.md
074 3p
074-3p.md
tldr.md
075 Leverage The Equilibrium
075-leverage-the-equilibrium.md
tldr.md
076 On Demand Container Loading In Aws Lambda
076-on-demand-container-loading-in-aws-lambda.md
tldr.md
077 Sql Has Problems We Can Fix Them Pipe Syntax In Sql
077-sql-has-problems-we-can-fix-them-pipe-syntax-in-sql.md
tldr.md
078 Nanolog A Nanosecond Scale Logging System
078-nanolog-a-nanosecond-scale-logging-system.md
tldr.md
079 Best Resource Is Mythical
079-best-resource-is-mythical.md
tldr.md
080 Wtf The Who To Follow Service At Twitter
080-wtf-the-who-to-follow-service-at-twitter.md
tldr.md
081 Know A Lot
081-know-a-lot.md
tldr.md
082 Out Of Syllabus
082-out-of-syllabus.md
tldr.md
083 Negotiate The Offer
083-negotiate-the-offer.md
tldr.md
084 Never Bad Mouth Your Ex Exployer
084-never-bad-mouth-your-ex-exployer.md
tldr.md
085 Culture Fit
085-culture-fit.md
tldr.md
086 Quantification In Resume
086-quantification-in-resume.md
tldr.md
087 Hiring Is Unfair
087-hiring-is-unfair.md
tldr.md
088 Questions For Interviewers
088-questions-for-interviewers.md
tldr.md
089 Collaboration Communication
089-collaboration-communication.md
tldr.md
090 Out Of Vicious Interview Cycle
090-out-of-vicious-interview-cycle.md
tldr.md
091 Pitch Projects Not Ideas
091-pitch-projects-not-ideas.md
tldr.md
092 Read Design Docs
092-read-design-docs.md
tldr.md
093 Read Rca Docs
093-read-rca-docs.md
tldr.md
094 Start Generalist
094-start-generalist.md
tldr.md
095 Do Not Rely On Summaries
095-do-not-rely-on-summaries.md
tldr.md
096 Structure Your Design Interviews
096-structure-your-design-interviews.md
tldr.md
097 Title Inflation
097-title-inflation.md
tldr.md
098 Find Your Own Project
098-find-your-own-project.md
tldr.md
099 Six Pointers To Crack Coding And Design Interviews
099-six-pointers-to-crack-coding-and-design-interviews.md
tldr.md
100 Keep Yourself Unblocked
100-keep-yourself-unblocked.md
tldr.md
101 Genetic Knapsack
101-genetic-knapsack.md
tldr.md
102 Pseudorandom Number Generation Lfsr
102-pseudorandom-number-generation-lfsr.md
tldr.md
103 How Indexes Work On Partitioned And Sharded Data
103-how-indexes-work-on-partitioned-and-sharded-data.md
tldr.md
104 Some Data Partitioning Strategies For Distributed Data Stores
104-some-data-partitioning-strategies-for-distributed-data-stores.md
tldr.md
105 Data Partitioning
105-data-partitioning.md
tldr.md
106 Leaderless Replication
106-leaderless-replication.md
tldr.md
107 Conflict Resolution
107-conflict-resolution.md
tldr.md
108 Conflict Detection
108-conflict-detection.md
tldr.md
109 Multi Master Replication
109-multi-master-replication.md
tldr.md
110 Monotonic Reads
110-monotonic-reads.md
tldr.md
111 Read Your Write Consistency
111-read-your-write-consistency.md
tldr.md
112 Handling Outages Master Replica
112-handling-outages-master-replica.md
tldr.md
113 Replication Formats
113-replication-formats.md
tldr.md
114 Replication Strategies
114-replication-strategies.md
tldr.md
115 Master Replica Replication
115-master-replica-replication.md
tldr.md
116 Durability
116-durability.md
tldr.md
117 Isolation
117-isolation.md
tldr.md
118 Atomicity
118-atomicity.md
tldr.md
119 Consistency
119-consistency.md
tldr.md
120 Architectures In Distributed Systems
120-architectures-in-distributed-systems.md
tldr.md
121 Mistaken Beliefs Of Distributed Systems
121-mistaken-beliefs-of-distributed-systems.md
tldr.md
122 Fork Bomb
122-fork-bomb.md
tldr.md
123 Chained Operators Python
123-chained-operators-python.md
tldr.md
124 Taxonomy On Sql
124-taxonomy-on-sql.md
tldr.md
125 The Weird Walrus
125-the-weird-walrus.md
tldr.md
126 Fully Persistent Arrays
126-fully-persistent-arrays.md
tldr.md
127 Persistent Data Structures Introduction
127-persistent-data-structures-introduction.md
tldr.md
128 Constant Folding Python
128-constant-folding-python.md
tldr.md
129 String Interning Python
129-string-interning-python.md
tldr.md
130 Recursion Visualizer Python
130-recursion-visualizer-python.md
tldr.md
131 Flajolet Martin
131-flajolet-martin.md
tldr.md
132 2q Cache
132-2q-cache.md
tldr.md
133 Israeli Queues
133-israeli-queues.md
tldr.md
134 1d Terrain
134-1d-terrain.md
tldr.md
135 Jaccard Minhash
135-jaccard-minhash.md
tldr.md
136 Ts Smoothing
136-ts-smoothing.md
tldr.md
137 Lfu
137-lfu.md
tldr.md
138 Morris Counter
138-morris-counter.md
tldr.md
139 Slowsort
139-slowsort.md
tldr.md
140 Bitcask
140-bitcask.md
tldr.md
141 Phi Accrual
141-phi-accrual.md
tldr.md
142 10x Engineer
142-10x-engineer.md
tldr.md
143 Decipher Repeated Key Xor
143-decipher-repeated-key-xor.md
tldr.md
144 Decipher Single Xor
144-decipher-single-xor.md
tldr.md
145 Python Iterable Integers
145-python-iterable-integers.md
tldr.md
146 Inheritance C
146-inheritance-c.md
tldr.md
147 Rum
147-rum.md
tldr.md
148 Consistent Hashing
148-consistent-hashing.md
tldr.md
149 Python Caches Integers
149-python-caches-integers.md
tldr.md
150 Fractional Cascading
150-fractional-cascading.md
tldr.md
151 Copy On Write
151-copy-on-write.md
tldr.md
152 Midpoint Insertion Caching Strategy
152-midpoint-insertion-caching-strategy.md
tldr.md
153 Fsm Python
153-fsm-python.md
tldr.md
154 Bayesian Average
154-bayesian-average.md
tldr.md
155 Sliding Window Ratelimiter
155-sliding-window-ratelimiter.md
tldr.md
156 Idf
156-idf.md
tldr.md
157 Better Programmer
157-better-programmer.md
tldr.md
158 Python Prompts
158-python-prompts.md
tldr.md
159 Rule 30 Cellular Automata
159-rule-30-cellular-automata.md
tldr.md
160 Function Overloading
160-function-overloading.md
tldr.md
161 Isolation Forest
161-isolation-forest.md
tldr.md
162 Image Steganography
162-image-steganography.md
tldr.md
163 Long Integers Python
163-long-integers-python.md
tldr.md
164 I Changed My Python
164-i-changed-my-python.md
tldr.md
165 Benchmark And Compare Pagination Approach In Mongodb
165-benchmark-and-compare-pagination-approach-in-mongodb.md
tldr.md
166 Mongodb Cursor Skip Is Slow
166-mongodb-cursor-skip-is-slow.md
tldr.md
167 Fast And Efficient Pagination In Mongodb
167-fast-and-efficient-pagination-in-mongodb.md
tldr.md
168 Making Http Requests Using Netcat
168-making-http-requests-using-netcat.md
tldr.md

The fundamental building block of these models is the transformer layer, which consists of two primary components:

  1. a self-attention mechanism, and
  2. a feed-forward neural network.

LLMs stack dozens of these layers, creating deep networks capable of capturing complex patterns in language.

Transformers rely on self-attention and it evaluates how each word relates to the rest of the sequence, not just its neighbouring words.

Model size = the number of parameters in the network. A 7-billion parameter model has 7 billion floating-point numbers that store the learned knowledge from training. These parameters are organized into weight matrices that perform transformations on the input data at each layer.

Models like GPT-4, Claude, and Llama are decoder-only transformers, meaning they use only the decoder part of the original transformer architecture. This makes them autoregressive, generating one token at a time based on all previously generated tokens, which is perfect for text generation tasks.

Tokenization

Before any computation happens, the model needs to convert your text input into numbers. This process, called tokenization, breaks text into smaller units called tokens.

The most common tokenization approach in modern LLMs is Byte Pair Encoding (BPE). BPE starts with a vocabulary of individual characters and iteratively merges the most frequent pairs of adjacent tokens to create new tokens.

Plain text

Because of BPE, common words get represented as single tokens (efficient), while rare or unknown words get broken into familiar subword pieces (flexible).

The tokenization process works by encoding your input text into UTF-8 bytes, then applying the learned merge rules to compress the byte sequence into tokens. Each token maps to an integer ID that the model can work with.

Plain text

Tokenization directly impacts model performance and costs. More tokens mean more computation, higher API costs, and potentially hitting context length limits. This is why non-English text often costs more to process since these languages typically require more tokens per word when the tokenizer was primarily trained on English data.

Token Embeddings

Once text becomes tokens, the next step transforms these discrete token IDs into continuous vector representations that neural networks can process. This happens through an embedding layer, essentially a lookup table that maps each token ID to a high-dimensional vector.

For a model with a vocabulary of 50,000 tokens and an embedding dimension of 4,096, the embedding matrix has shape [50000, 4096]. Each row represents one token, and the values in that row form the embedding vector for that token.

Plain text

These embedding vectors capture semantic meaning learned during training. Words with similar meanings have embedding vectors that point in similar directions in this high-dimensional space.

Transformers do not inherently understand the order of tokens. To address this, we add positional encodings to the embeddings, providing information about where each token sits in the sequence. Modern approaches use learned positional embeddings or relative position encodings like Rotary Position Embeddings (RoPE).

The Transformer Architecture

The transformer processes embedding vectors through its layers. Each transformer layer applies two main operations: multi-head self-attention and feed-forward networks.

The self-attention mechanism computes three matrices for each token: Query (Q), Key (K), and Value (V). These come from multiplying the input embeddings by three learned weight matrices.

Plain text

The weight matrices W_query, W_key, and W_value are learned during training. They are randomly initialized and then adjusted through backpropagation to extract the most useful patterns from the embeddings.

The attention mechanism then computes how much each token should attend to every other token. This happens through a scaled dot-product attention calculation:

Plain text

The scaling factor (square root of dimension) prevents the dot products from becoming too large, which would cause the softmax function to saturate and produce extremely small gradients during training.

Multi-head attention runs this process multiple times in parallel with different learned projection matrices. A model might use 32 attention heads, each learning to focus on different aspects of the relationships between tokens. The outputs from all heads get concatenated and projected back to the model dimension.

After attention, the output passes through a feed-forward network, which consists of two linear transformations with a non-linear activation function in between. This typically expands the dimensionality by 4x before projecting back down.

Plain text

Inference Phases - Prefill and Decode

The prefill phase happens when you first submit a prompt. The model processes all input tokens in parallel, computing the Query, Key, and Value matrices for each token simultaneously. This phase is compute-bound, meaning the GPU’s computational throughput determines performance.

During prefill, the attention mechanism performs matrix-matrix multiplication, which GPUs excel at. All tokens can see all other tokens (in the input), and the model computes attention scores for every pair of positions in one batch operation.

Plain text

The prefill phase produces the first output token and builds the KV cache, which we will discuss shortly. Time to First Token (TTFT) measures how long this phase takes, directly impacting user experience since this is the wait time before seeing any output.

The decode phase begins after the first token generates. The model produces tokens one at a time, autoregressively. Each new token gets computed based on all previous tokens, but only the latest token needs fresh Q, K, V computations.

This phase is memory-bound, not compute-bound. The GPU spends most of its time loading data from memory rather than performing calculations. Each iteration involves a matrix-vector operation instead of matrix-matrix, providing far less computational work to saturate the GPU.

Plain text

Inter-Token Latency (ITL) measures the time between consecutive token generations in the decode phase. This metric determines how fast text streams to the user after generation begins.

The KV Cache

The KV cache represents one of the most important optimizations in transformer inference. Without it, generating 100 tokens would require recomputing attention for all previous tokens 100 times, wasting enormous computational resources.

During autoregressive generation, the Key and Value matrices for previously processed tokens never change. Only the Query matrix for the new token needs computation. By caching the K and V matrices from all previous tokens, we avoid recomputing them.

A pseudocode to make sense of it goes like this.

Plain text

For each transformer layer and each attention head, the model maintains separate KV caches. When generating the nth token, the cache stores K and V matrices for all n-1 previous tokens.

The speedup from KV caching can be dramatic. Empirical tests show that generating 1000 tokens with KV caching takes about ~10 seconds, while without caching the same task takes ~50 seconds, nearly a 5x difference.

However, the KV cache comes with a memory cost. The cache grows linearly with sequence length. For a 13-billion parameter model like LLaMA-2, each output token requires approximately 1 MB of cache storage. A 4,000 token context needs about 4 GB just for the cache, comparable to the model size itself.

This memory pressure becomes severe with long contexts or large batch sizes. Modern systems employ several strategies to manage KV cache memory: quantizing the cache to lower precision (4-bit or 2-bit keys and values), using sliding window attention that only retains recent tokens, or implementing attention approximations that reduce cache requirements.

When I first experimented with running a model myself, I blamed the GPU for slow responses before noticing that the KV cache kept spilling out of GPU memory.

Every time a user typed a long prompt, latency skyrocketed. A single fix - reducing precision from FP16 to INT8 - cut our response times by more than half.

Matrix Multiplication

Matrix multiplication forms the computational heart of transformer inference. Every layer performs multiple matrix multiplications: computing Q, K, V from inputs, applying attention, and running the feed-forward network.

On GPUs, efficient matrix multiplication employs a tiling strategy. The large matrix operation gets divided into smaller tiles that fit in shared memory, reducing expensive global memory accesses.

Each thread block computes one output tile, stepping through the K dimension in tiles. This maximizes data reuse: once data loads into shared memory, all threads in the block can access it without additional global memory traffic.

Tensor Cores further accelerate this by performing entire small matrix multiplications in hardware. The programming model exposes 16x16x16 operations, but hardware executes them as multiple 4x4x4 operations automatically.

Plain text

Precision and Quantization in Inference

LLM inference often operates at reduced precision compared to training. While training typically uses FP32 or BF16 precision, inference can use FP16, INT8, or even INT4 with minimal quality loss.

FP16 (16-bit floating point) cuts memory usage and bandwidth requirements in half compared to FP32. Tensor Cores achieve maximum throughput at FP16, making it the default precision for many inference deployments.

Plain text

Quantization converts model weights and activations to lower precision formats. This requires careful calibration to maintain model quality. Post-training quantization analyzes activation distributions on representative data to determine optimal scaling factors.

A 7-billion parameter model at FP16 precision requires approximately 14 GB of memory (7B parameters × 2 bytes per parameter). Quantizing to INT4 reduces this to 3.5 GB, enabling inference on consumer hardware.

Quantization techniques like GPTQ and AWQ apply different scaling factors per channel or per group, preserving more information from the original weights. Some methods quantize weights but keep activations at higher precision, balancing quality and performance.

End-to-end Inference Flow

Step 1: Tokenization. Your prompt “Explain how transformers work” gets converted to token IDs by the tokenizer. The BPE algorithm splits this into subword units, producing something like [Explain, how, transform, ers, work].

Plain text

Step 2: Embedding lookup. Each token ID indexes into the embedding matrix, retrieving its corresponding embedding vector. If the model has 4096 dimensions, each token becomes a vector of 4096 floating-point numbers.

Plain text

Step 3: Add positional encodings. The model adds positional information to the embeddings so the attention mechanism knows the order of tokens.

Plain text

Step 4: Prefill phase. The input embeddings flow through each transformer layer. For a 32-layer model, this happens 32 times.

Plain text

Step 5: Generate first token. After the final layer, the hidden states get projected to vocabulary size through a linear layer, then softmax converts these logits to probabilities over all possible next tokens.

Step 6: Decode phase. Now we generate tokens one at a time. For each new token, we only compute fresh Q, K, V for that token, retrieving cached values for all previous tokens.

Step 7: Detokenization. Finally, the sequence of token IDs gets converted back to text using the tokenizer’s vocabulary.

Plain text

This entire process repeats for every token generated, with the KV cache growing at each step. The decode phase continues until the model generates a stop token or reaches a maximum length limit.

Inference Serving Frameworks

Production LLM inference relies on specialized serving frameworks that handle batching, memory management, and optimization automatically.

vLLM implements PagedAttention for efficient KV cache management and continuous batching for high throughput. It achieves 2-4x higher throughput than naive implementations on the same hardware.

TensorRT-LLM from NVIDIA provides highly optimized kernels specific to NVIDIA GPUs, achieving near-theoretical peak performance. It includes techniques like in-flight batching and FP8 quantization support.

Plain text

Text Generation Inference (TGI) from Hugging Face offers broad model support and features like continuous batching and token streaming. It provides a production-ready HTTP API for deploying models.

Each framework makes different tradeoffs between ease of use, performance, and model support. Choosing the right one depends on your specific requirements, hardware, and model architecture.

Performance metrics and monitoring

Understanding and monitoring inference performance requires tracking several key metrics.

Time to First Token (TTFT) measures prefill phase latency. This directly impacts user experience since users wait this long before seeing any output. Optimizing TTFT means efficient prompt processing, often through batch prefill or speculative decoding techniques.

Inter-Token Latency (ITL) measures the time between consecutive tokens during decode. Low ITL creates smooth streaming experiences. This metric depends heavily on memory bandwidth and KV cache efficiency.

Throughput, measured in tokens per second, indicates overall system capacity. High throughput means serving more users concurrently. Batching strategies significantly impact throughput.

Plain text

GPU utilization indicates how effectively the hardware is being used. Low utilization during decode suggests memory bottlenecks. Monitoring tools like nvidia-smi show GPU usage, memory consumption, and power draw in real-time.

Memory pressure, especially KV cache size, affects maximum context length and batch size. Tracking cache memory helps prevent out-of-memory errors and guides quantization decisions.

Footnote

LLM inference transforms text prompts into responses through a process involving tokenization, transformer layers with self-attention mechanisms, and autoregressive token generation.

There are two stages in practice: the model first handles your full prompt in parallel, then switches to generating tokens one by one, which shifts the bottleneck from math to memory access.

Key optimizations include KV caching to avoid redundant computation, batching to improve GPU utilization, and quantization to reduce memory pressure.

Thanks for reading. I hope the breakdown made the inner workings of inference a little clearer.