InkdownInkdown
Start writing

Arpit Bhayani Blogs

336 files·168 subfolders

Shared Workspace

Arpit Bhayani Blogs
001 Ai Topological Sort

005-how-llms-work

Shared from "Arpit Bhayani Blogs" on Inkdown

How LLMs Really Work

Source: https://arpitbhayani.me/blogs/how-llms-work Date: 2026-05-12

If you have used ChatGPT, Gemini, or Claude, you have already formed an intuition about what these systems do. You type something in, and text comes back that feels coherent, knowledgeable, and sometimes eerily human. But the machinery underneath is simultaneously simpler and stranger than most people expect.


If you have used ChatGPT, Gemini, or Claude, you have already formed an intuition about what these systems do. You type something in, and text comes back that feels coherent, knowledgeable, and sometimes eerily human. But the machinery underneath is simultaneously simpler and stranger than most people expect.

001-ai-topological-sort.md
tldr.md
002 Temporal Primer
002-temporal-primer.md
tldr.md
003 Rag Production
003-rag-production.md
tldr.md
004 Structure Of Llm Chat
004-structure-of-llm-chat.md
tldr.md
005 How Llms Work
005-how-llms-work.md
tldr.md
006 Monolith Is Distributed System
006-monolith-is-distributed-system.md
tldr.md
007 Defensive Databases
007-defensive-databases.md
tldr.md
008 Bm25
008-bm25.md
tldr.md
009 Join Algorithms
009-join-algorithms.md
tldr.md
010 Venting At Work
010-venting-at-work.md
tldr.md
011 Half Life
011-half-life.md
tldr.md
012 Multi Paxos
012-multi-paxos.md
tldr.md
013 Mysql Replication Internals
013-mysql-replication-internals.md
tldr.md
014 Bloom Filters
014-bloom-filters.md
tldr.md
015 Clock Sync Nightmare
015-clock-sync-nightmare.md
tldr.md
016 Kafka Partitions
016-kafka-partitions.md
tldr.md
017 Product Quantization
017-product-quantization.md
tldr.md
018 Qkv Matrices
018-qkv-matrices.md
tldr.md
019 Deleted Production
019-deleted-production.md
tldr.md
020 How Llm Inference Works
020-how-llm-inference-works.md
tldr.md
021 Blocking Queues
021-blocking-queues.md
tldr.md
022 Heartbeats In Distributed Systems
022-heartbeats-in-distributed-systems.md
tldr.md
023 Cassandra Writes
023-cassandra-writes.md
tldr.md
024 Redis Replication
024-redis-replication.md
tldr.md
025 Arrogant People At Work
025-arrogant-people-at-work.md
tldr.md
026 Cdn Content Replication
026-cdn-content-replication.md
tldr.md
027 Cant Fix Everything Day One
027-cant-fix-everything-day-one.md
tldr.md
028 Emotions At Work
028-emotions-at-work.md
tldr.md
029 Grpc Http2
029-grpc-http2.md
tldr.md
030 Meetings With No Agenda Are A Waste Of Time
030-meetings-with-no-agenda-are-a-waste-of-time.md
tldr.md
031 Growth Is Not About Doing Everything
031-growth-is-not-about-doing-everything.md
tldr.md
032 Career Longevity Vs Job Hopping
032-career-longevity-vs-job-hopping.md
tldr.md
033 Stay Relevant At Higher Salary Levels
033-stay-relevant-at-higher-salary-levels.md
tldr.md
034 Why Consensus
034-why-consensus.md
tldr.md
035 Database Deadlocks
035-database-deadlocks.md
tldr.md
036 Cpu Cache Locality
036-cpu-cache-locality.md
tldr.md
037 Eventual Consistency
037-eventual-consistency.md
tldr.md
038 Dns Udp Tcp
038-dns-udp-tcp.md
tldr.md
039 Masters
039-masters.md
tldr.md
040 Empathy Makes Great Engineers Unstoppable
040-empathy-makes-great-engineers-unstoppable.md
tldr.md
041 Good Mentors Build People
041-good-mentors-build-people.md
tldr.md
042 Always Have Back Burner Projects
042-always-have-back-burner-projects.md
tldr.md
043 Before You Push Back Know What Youre Standing On
043-before-you-push-back-know-what-youre-standing-on.md
tldr.md
044 Be The One They Can Count On
044-be-the-one-they-can-count-on.md
tldr.md
045 How Much People Bet On You
045-how-much-people-bet-on-you.md
tldr.md
046 How To Get Leadership To Say Yes To Your Project
046-how-to-get-leadership-to-say-yes-to-your-project.md
tldr.md
047 Dont Let Your Best Ideas Die In Silence
047-dont-let-your-best-ideas-die-in-silence.md
tldr.md
048 Be Someone Others Want To Work With
048-be-someone-others-want-to-work-with.md
tldr.md
049 Dont Fall For Xy Problem Ask Right Questions
049-dont-fall-for-xy-problem-ask-right-questions.md
tldr.md
050 Biggest Lie Startups Tell Engineers
050-biggest-lie-startups-tell-engineers.md
tldr.md
051 Promotions Are Proactive Not Reactive
051-promotions-are-proactive-not-reactive.md
tldr.md
052 Not Enough To Be Right Learn To Be Heard
052-not-enough-to-be-right-learn-to-be-heard.md
tldr.md
053 No One Ships Alone
053-no-one-ships-alone.md
tldr.md
054 Not Every Mistake Needs A Correction
054-not-every-mistake-needs-a-correction.md
tldr.md
055 Build Influence At Work
055-build-influence-at-work.md
tldr.md
056 Your Soft Skills Arent Soft At All
056-your-soft-skills-arent-soft-at-all.md
tldr.md
057 Experience Before Forming Opinion
057-experience-before-forming-opinion.md
tldr.md
058 Curiosity And High Bias For Action
058-curiosity-and-high-bias-for-action.md
tldr.md
059 Worklog
059-worklog.md
tldr.md
060 Mistakes And Growth
060-mistakes-and-growth.md
tldr.md
061 Own It Instead Of Sweeping It Aside
061-own-it-instead-of-sweeping-it-aside.md
tldr.md
062 Dont Wait Step Up
062-dont-wait-step-up.md
tldr.md
063 Temporary Fix Is Permanent
063-temporary-fix-is-permanent.md
tldr.md
064 Interview Bias And What Sets You Apart
064-interview-bias-and-what-sets-you-apart.md
tldr.md
065 Saying This Isnt My Problem Is A Problem
065-saying-this-isnt-my-problem-is-a-problem.md
tldr.md
066 Okr
066-okr.md
tldr.md
067 Miscommunication
067-miscommunication.md
tldr.md
068 When In Doubt Code It Out
068-when-in-doubt-code-it-out.md
tldr.md
069 Follow Up Without Annoying People
069-follow-up-without-annoying-people.md
tldr.md
070 Lead Projects That Land
070-lead-projects-that-land.md
tldr.md
071 Abstract Thinking Skill Next Decade
071-abstract-thinking-skill-next-decade.md
tldr.md
072 We Engineers Suck At Task Estimation
072-we-engineers-suck-at-task-estimation.md
tldr.md
073 Shiny Object Syndrome In Tech
073-shiny-object-syndrome-in-tech.md
tldr.md
074 3p
074-3p.md
tldr.md
075 Leverage The Equilibrium
075-leverage-the-equilibrium.md
tldr.md
076 On Demand Container Loading In Aws Lambda
076-on-demand-container-loading-in-aws-lambda.md
tldr.md
077 Sql Has Problems We Can Fix Them Pipe Syntax In Sql
077-sql-has-problems-we-can-fix-them-pipe-syntax-in-sql.md
tldr.md
078 Nanolog A Nanosecond Scale Logging System
078-nanolog-a-nanosecond-scale-logging-system.md
tldr.md
079 Best Resource Is Mythical
079-best-resource-is-mythical.md
tldr.md
080 Wtf The Who To Follow Service At Twitter
080-wtf-the-who-to-follow-service-at-twitter.md
tldr.md
081 Know A Lot
081-know-a-lot.md
tldr.md
082 Out Of Syllabus
082-out-of-syllabus.md
tldr.md
083 Negotiate The Offer
083-negotiate-the-offer.md
tldr.md
084 Never Bad Mouth Your Ex Exployer
084-never-bad-mouth-your-ex-exployer.md
tldr.md
085 Culture Fit
085-culture-fit.md
tldr.md
086 Quantification In Resume
086-quantification-in-resume.md
tldr.md
087 Hiring Is Unfair
087-hiring-is-unfair.md
tldr.md
088 Questions For Interviewers
088-questions-for-interviewers.md
tldr.md
089 Collaboration Communication
089-collaboration-communication.md
tldr.md
090 Out Of Vicious Interview Cycle
090-out-of-vicious-interview-cycle.md
tldr.md
091 Pitch Projects Not Ideas
091-pitch-projects-not-ideas.md
tldr.md
092 Read Design Docs
092-read-design-docs.md
tldr.md
093 Read Rca Docs
093-read-rca-docs.md
tldr.md
094 Start Generalist
094-start-generalist.md
tldr.md
095 Do Not Rely On Summaries
095-do-not-rely-on-summaries.md
tldr.md
096 Structure Your Design Interviews
096-structure-your-design-interviews.md
tldr.md
097 Title Inflation
097-title-inflation.md
tldr.md
098 Find Your Own Project
098-find-your-own-project.md
tldr.md
099 Six Pointers To Crack Coding And Design Interviews
099-six-pointers-to-crack-coding-and-design-interviews.md
tldr.md
100 Keep Yourself Unblocked
100-keep-yourself-unblocked.md
tldr.md
101 Genetic Knapsack
101-genetic-knapsack.md
tldr.md
102 Pseudorandom Number Generation Lfsr
102-pseudorandom-number-generation-lfsr.md
tldr.md
103 How Indexes Work On Partitioned And Sharded Data
103-how-indexes-work-on-partitioned-and-sharded-data.md
tldr.md
104 Some Data Partitioning Strategies For Distributed Data Stores
104-some-data-partitioning-strategies-for-distributed-data-stores.md
tldr.md
105 Data Partitioning
105-data-partitioning.md
tldr.md
106 Leaderless Replication
106-leaderless-replication.md
tldr.md
107 Conflict Resolution
107-conflict-resolution.md
tldr.md
108 Conflict Detection
108-conflict-detection.md
tldr.md
109 Multi Master Replication
109-multi-master-replication.md
tldr.md
110 Monotonic Reads
110-monotonic-reads.md
tldr.md
111 Read Your Write Consistency
111-read-your-write-consistency.md
tldr.md
112 Handling Outages Master Replica
112-handling-outages-master-replica.md
tldr.md
113 Replication Formats
113-replication-formats.md
tldr.md
114 Replication Strategies
114-replication-strategies.md
tldr.md
115 Master Replica Replication
115-master-replica-replication.md
tldr.md
116 Durability
116-durability.md
tldr.md
117 Isolation
117-isolation.md
tldr.md
118 Atomicity
118-atomicity.md
tldr.md
119 Consistency
119-consistency.md
tldr.md
120 Architectures In Distributed Systems
120-architectures-in-distributed-systems.md
tldr.md
121 Mistaken Beliefs Of Distributed Systems
121-mistaken-beliefs-of-distributed-systems.md
tldr.md
122 Fork Bomb
122-fork-bomb.md
tldr.md
123 Chained Operators Python
123-chained-operators-python.md
tldr.md
124 Taxonomy On Sql
124-taxonomy-on-sql.md
tldr.md
125 The Weird Walrus
125-the-weird-walrus.md
tldr.md
126 Fully Persistent Arrays
126-fully-persistent-arrays.md
tldr.md
127 Persistent Data Structures Introduction
127-persistent-data-structures-introduction.md
tldr.md
128 Constant Folding Python
128-constant-folding-python.md
tldr.md
129 String Interning Python
129-string-interning-python.md
tldr.md
130 Recursion Visualizer Python
130-recursion-visualizer-python.md
tldr.md
131 Flajolet Martin
131-flajolet-martin.md
tldr.md
132 2q Cache
132-2q-cache.md
tldr.md
133 Israeli Queues
133-israeli-queues.md
tldr.md
134 1d Terrain
134-1d-terrain.md
tldr.md
135 Jaccard Minhash
135-jaccard-minhash.md
tldr.md
136 Ts Smoothing
136-ts-smoothing.md
tldr.md
137 Lfu
137-lfu.md
tldr.md
138 Morris Counter
138-morris-counter.md
tldr.md
139 Slowsort
139-slowsort.md
tldr.md
140 Bitcask
140-bitcask.md
tldr.md
141 Phi Accrual
141-phi-accrual.md
tldr.md
142 10x Engineer
142-10x-engineer.md
tldr.md
143 Decipher Repeated Key Xor
143-decipher-repeated-key-xor.md
tldr.md
144 Decipher Single Xor
144-decipher-single-xor.md
tldr.md
145 Python Iterable Integers
145-python-iterable-integers.md
tldr.md
146 Inheritance C
146-inheritance-c.md
tldr.md
147 Rum
147-rum.md
tldr.md
148 Consistent Hashing
148-consistent-hashing.md
tldr.md
149 Python Caches Integers
149-python-caches-integers.md
tldr.md
150 Fractional Cascading
150-fractional-cascading.md
tldr.md
151 Copy On Write
151-copy-on-write.md
tldr.md
152 Midpoint Insertion Caching Strategy
152-midpoint-insertion-caching-strategy.md
tldr.md
153 Fsm Python
153-fsm-python.md
tldr.md
154 Bayesian Average
154-bayesian-average.md
tldr.md
155 Sliding Window Ratelimiter
155-sliding-window-ratelimiter.md
tldr.md
156 Idf
156-idf.md
tldr.md
157 Better Programmer
157-better-programmer.md
tldr.md
158 Python Prompts
158-python-prompts.md
tldr.md
159 Rule 30 Cellular Automata
159-rule-30-cellular-automata.md
tldr.md
160 Function Overloading
160-function-overloading.md
tldr.md
161 Isolation Forest
161-isolation-forest.md
tldr.md
162 Image Steganography
162-image-steganography.md
tldr.md
163 Long Integers Python
163-long-integers-python.md
tldr.md
164 I Changed My Python
164-i-changed-my-python.md
tldr.md
165 Benchmark And Compare Pagination Approach In Mongodb
165-benchmark-and-compare-pagination-approach-in-mongodb.md
tldr.md
166 Mongodb Cursor Skip Is Slow
166-mongodb-cursor-skip-is-slow.md
tldr.md
167 Fast And Efficient Pagination In Mongodb
167-fast-and-efficient-pagination-in-mongodb.md
tldr.md
168 Making Http Requests Using Netcat
168-making-http-requests-using-netcat.md
tldr.md

This article tears open that machinery and explains what a language model is doing at a mechanical level - why it produces the outputs it does, why identical inputs produce different outputs on different runs, and what “temperature” actually means beyond “a creativity dial.”

Next-token Prediction Machine

A large language model (LLM) is, at its most fundamental level, a function that takes a sequence of tokens as input and outputs a probability distribution over its entire vocabulary for what the next token should be. That is the complete description of the core operation. Everything else - the apparent reasoning, the conversational ability, the code generation - emerges from doing this one thing at enormous scale, across an enormous amount of training data.

Concretely, imagine you feed the model the tokens for “The quick brown fox”. The model does not produce the word “jumps”. It produces a table of probabilities: “jumps” might have a 42% chance, “sat” a 12% chance, “leaped” an 8% chance, and every other token in a 100,000-word vocabulary gets some non-zero slice of the remaining probability mass. The model then samples from that distribution to pick the next token. That token gets appended to the sequence, and the whole process repeats until a stop condition is reached.

This is called autoregressive generation. Each token generated becomes part of the input for the next prediction. The model is always asking the same question: “given everything I have seen so far, what token is most likely to come next?”

What Training Actually Does

The model learns to produce these probability distributions by training on a massive corpus of text - essentially a large fraction of the written internet, books, code, and academic papers. During training, the model sees a sequence of tokens and tries to predict the next one.

When it is wrong, the error signal flows backward through the network (via backpropagation), nudging billions of internal parameters - the model’s “weights” - very slightly in the direction that would have made the correct prediction more probable.

After trillions of these updates, the model’s weights encode something remarkable: a compressed statistical model of how language works. It learns that “The Eiffel Tower is located in” is very frequently followed by “Paris,” that Python function definitions start with “def,” and that a sentence starting “To be or not to” almost certainly continues with “be.”

Crucially, the model does not have a memory of individual training examples. It has internalized statistical patterns. This is why it can generalise to novel inputs - it is not retrieving stored sentences, it is sampling from learned distributions.

Logits, Softmax, and Why Probabilities Matter

Before the model produces those clean probabilities, it produces raw scores called logits - one real number per token in the vocabulary. These logits are the raw output of the final linear layer in the neural network.

To convert logits to a probability distribution, the model applies the softmax function:

P(tokeni)=elogiti∑jelogitjP(\text{token}_i) = \frac{e^{\text{logit}_i}}{\sum_j e^{\text{logit}_j}}P(tokeni​)=∑j​elogitj​elogiti​​

Softmax does two things. First, it exponentiates each logit, which amplifies differences: a logit that is twice as large becomes exponentially more probable. Second, it normalizes everything so that all probabilities sum to 1. The result is a valid probability distribution over the entire vocabulary.

To see this in action, imagine the model is predicting the next word after “The quick brown fox”. It generates raw logits for a tiny vocabulary of four words:

TokenLogit (xix_ixi​)Exponent (exie^{x_i}exi​)Probability (PiP_iPi​)
“jumps”8.34023.890.7%
“leaped”6.0403.49.1%
“sat”2.18.10.18%
“sleeps”-1.50.20.004%
Sum4435.5100%

This is the number the model actually hands you before sampling. The entire drama of temperature, top-k, and nucleus sampling happens here, in the manipulation of this distribution before a token is drawn from it.

Temperature

Temperature is the most misunderstood parameter in prompting. It is commonly described as “creativity” or “randomness,” which is technically correct but obscures exactly how it works. Understanding it precisely lets you use it deliberately.

Temperature is a scalar that divides the logits before the softmax is applied:

P(tokeni)=elogiti/T∑jelogitj/TP(\text{token}_i) = \frac{e^{\text{logit}_i / T}}{\sum_j e^{\text{logit}_j / T}}P(tokeni​)=∑j​elogitj​/Telogiti​/T​

When T=1.0T = 1.0T=1.0, nothing changes. The probabilities are exactly what the raw softmax produced in our previous example.

By adjusting TTT, we can either “sharpen” or “flatten” the distribution:

TokenLogitProb (T=1.0T=1.0T=1.0)Prob (T=0.5T=0.5T=0.5)Prob (T=2.0T=2.0T=2.0)
“jumps”8.390.7%~99.0%~67.5%
“leaped”6.09.1%~1.0%~21.3%
“sat”2.10.18%~0.0%~3.0%
“sleeps”-1.50.004%~0.0%~0.5%

When T<1.0T < 1.0T<1.0 (e.g., T=0.5T = 0.5T=0.5), dividing by a fraction magnifies the logits. By cooling the temperature, the already-large difference between the top tokens becomes enormous. The model becomes nearly deterministic, overwhelmingly picking the single most likely token.

When T>1.0T > 1.0T>1.0 (e.g., T=2.0T = 2.0T=2.0), dividing by a large number flattens the logits. By turning up the heat, the probability mass is spread more evenly. Previously unlikely tokens become plausible candidates, meaning the model will sample more surprising continuations.

This has a practical implication that is easy to miss: temperature does not change what the model knows or how it reasons. It changes which region of the probability distribution you sample from. At low temperature you are exploiting the model’s most confident predictions. At high temperature you are exploring the tail of the distribution, which contains valid but unusual continuations - as well as incoherent ones.

A sensible mental model:

  • T

    =

    0.0

    T = 0.0

    T

    =

    0.0

    to 0.3

    0.3

    0.3

    : near-deterministic output, good for code generation, factual Q&A, structured data extraction

  • T

    =

    0.7

    T = 0.7

    T

    =

    0.7

    to 1.0

    1.0

    1.0

    : balanced, good for chat, summarisation, general-purpose use

  • T

    =

    1.2

    T = 1.2

    T

    =

    1.2

    to 2.0

    2.0

    2.0

    : high diversity, good for brainstorming, creative writing, exploring unusual phrasings - but outputs become increasingly unreliable at the high end

Why Outputs are Inherently Probabilistic

If you ask a language model “What is 2+22 + 22+2?” you will get "444" back every time regardless of temperature, because the probability mass is so concentrated on that token that even high-temperature sampling almost never picks anything else. But for any prompt where multiple continuations are plausible, the model’s outputs are drawn from a probability distribution. Run the same prompt a hundred times and you will get a hundred slightly different outputs, sometimes substantively different.

This is not a bug. It is a direct consequence of how the model was trained. The training data contains enormous variation: different people express the same idea in thousands of different ways. The model has learned this variation. When you ask it to write an email or summarise a document, many different phrasings are reasonable, and the model reflects that.

The probabilistic nature of outputs has several practical consequences that experienced engineers learn the hard way:

  • You cannot assume the model will always produce the same structure in its output, even with the same prompt. Strict output parsing must handle variation.

  • The model can contradict itself across separate calls even with identical input. For anything requiring consistency, either use temperature 0

    0

    0

    or implement validation logic.

  • “It gave me a wrong answer” and “it gives wrong answers reliably” are very different failure modes. Always test across multiple runs before concluding a prompt works.

No Meta Knowledge

One thing that trips up newcomers: the model does not “decide” what to say in any cognitive sense. There is no inner monologue, no planning step where it outlines a response before writing it. Each token is generated one at a time, left to right, with no ability to revise earlier tokens once they are committed.

This is why “chain-of-thought prompting” - asking the model to reason step by step before giving a final answer - actually improves accuracy on complex tasks. By generating intermediate reasoning tokens, the model conditions later tokens on that reasoning. The scratch space is real and functional: writing “let me think step by step” into the output genuinely changes the distribution over subsequent tokens in a way that improves correctness. It is not theatrical.

It also explains why the model can “hallucinate” - generate confident-sounding but false text. Given a prompt that contextually expects a specific detail (an author name, a statistic, a URL), the model samples a plausible-sounding continuation from its learned distribution. That distribution was built on real text, but it was not indexed for factual accuracy. A plausible token is not the same as a true one.

What “the model knows” Actually Means

When engineers say a language model “knows” something, they mean the training corpus contained many examples where that piece of information appeared in context, causing the model’s weights to encode a strong prior toward continuations that express it. The model does not have a database of facts. It has a compressed, lossy encoding of co-occurrence statistics across hundreds of billions of tokens.

This matters in practice. The model is confident and coherent about things it has seen many times in training. It is unreliable about things that appeared rarely or were expressed inconsistently. It will confidently make up details in domains that are underrepresented in its training data, because the token-prediction machinery does not distinguish between “I learned this” and “I am pattern-matching to something plausible.”

Understanding this helps you design prompts appropriately. For tasks grounded in common knowledge, the model is a powerful accelerant. For tasks requiring precise factual recall, especially of specific numbers, citations, or recent events, the model needs to be treated as a starting point that requires verification.

Footnote

A language model is a next-token prediction machine trained by minimizing the error on predicting held-out tokens from a large corpus. It outputs a probability distribution over its vocabulary at each step, and temperature controls how sharply peaked that distribution is before sampling.

Outputs are probabilistic because the model has learned the natural variation in human language. Understanding this - rather than treating the model as a search engine or a knowledge base - is the foundation for using LLMs effectively in production systems.