InkdownInkdown
Start writing

Arpit Bhayani Blogs

336 files·168 subfolders

Shared Workspace

Arpit Bhayani Blogs
001 Ai Topological Sort

004-structure-of-llm-chat

Shared from "Arpit Bhayani Blogs" on Inkdown

Structure of Every LLM Chat

Source: https://arpitbhayani.me/blogs/structure-of-llm-chat Date: 2026-05-13

Role tagging is not cosmetic. It shapes how the model responds, how context is managed across multiple turns, and how application developers constrain and direct model behaviour at a structural level. Understanding this format is the difference between using an LLM and building reliably on top of one.


If you have only ever interacted with a language model through a chat interface, you have seen one layer of abstraction that hides a lot of engineering. Behind the friendly chat window, every interaction with a modern LLM is structured as a list of messages, each tagged with a role.

That role tagging is not cosmetic. It shapes how the model responds, how context is managed across multiple turns, and how application developers constrain and direct model behaviour at a structural level. Understanding this format is the difference between using an LLM and building reliably on top of one.

001-ai-topological-sort.md
tldr.md
002 Temporal Primer
002-temporal-primer.md
tldr.md
003 Rag Production
003-rag-production.md
tldr.md
004 Structure Of Llm Chat
004-structure-of-llm-chat.md
tldr.md
005 How Llms Work
005-how-llms-work.md
tldr.md
006 Monolith Is Distributed System
006-monolith-is-distributed-system.md
tldr.md
007 Defensive Databases
007-defensive-databases.md
tldr.md
008 Bm25
008-bm25.md
tldr.md
009 Join Algorithms
009-join-algorithms.md
tldr.md
010 Venting At Work
010-venting-at-work.md
tldr.md
011 Half Life
011-half-life.md
tldr.md
012 Multi Paxos
012-multi-paxos.md
tldr.md
013 Mysql Replication Internals
013-mysql-replication-internals.md
tldr.md
014 Bloom Filters
014-bloom-filters.md
tldr.md
015 Clock Sync Nightmare
015-clock-sync-nightmare.md
tldr.md
016 Kafka Partitions
016-kafka-partitions.md
tldr.md
017 Product Quantization
017-product-quantization.md
tldr.md
018 Qkv Matrices
018-qkv-matrices.md
tldr.md
019 Deleted Production
019-deleted-production.md
tldr.md
020 How Llm Inference Works
020-how-llm-inference-works.md
tldr.md
021 Blocking Queues
021-blocking-queues.md
tldr.md
022 Heartbeats In Distributed Systems
022-heartbeats-in-distributed-systems.md
tldr.md
023 Cassandra Writes
023-cassandra-writes.md
tldr.md
024 Redis Replication
024-redis-replication.md
tldr.md
025 Arrogant People At Work
025-arrogant-people-at-work.md
tldr.md
026 Cdn Content Replication
026-cdn-content-replication.md
tldr.md
027 Cant Fix Everything Day One
027-cant-fix-everything-day-one.md
tldr.md
028 Emotions At Work
028-emotions-at-work.md
tldr.md
029 Grpc Http2
029-grpc-http2.md
tldr.md
030 Meetings With No Agenda Are A Waste Of Time
030-meetings-with-no-agenda-are-a-waste-of-time.md
tldr.md
031 Growth Is Not About Doing Everything
031-growth-is-not-about-doing-everything.md
tldr.md
032 Career Longevity Vs Job Hopping
032-career-longevity-vs-job-hopping.md
tldr.md
033 Stay Relevant At Higher Salary Levels
033-stay-relevant-at-higher-salary-levels.md
tldr.md
034 Why Consensus
034-why-consensus.md
tldr.md
035 Database Deadlocks
035-database-deadlocks.md
tldr.md
036 Cpu Cache Locality
036-cpu-cache-locality.md
tldr.md
037 Eventual Consistency
037-eventual-consistency.md
tldr.md
038 Dns Udp Tcp
038-dns-udp-tcp.md
tldr.md
039 Masters
039-masters.md
tldr.md
040 Empathy Makes Great Engineers Unstoppable
040-empathy-makes-great-engineers-unstoppable.md
tldr.md
041 Good Mentors Build People
041-good-mentors-build-people.md
tldr.md
042 Always Have Back Burner Projects
042-always-have-back-burner-projects.md
tldr.md
043 Before You Push Back Know What Youre Standing On
043-before-you-push-back-know-what-youre-standing-on.md
tldr.md
044 Be The One They Can Count On
044-be-the-one-they-can-count-on.md
tldr.md
045 How Much People Bet On You
045-how-much-people-bet-on-you.md
tldr.md
046 How To Get Leadership To Say Yes To Your Project
046-how-to-get-leadership-to-say-yes-to-your-project.md
tldr.md
047 Dont Let Your Best Ideas Die In Silence
047-dont-let-your-best-ideas-die-in-silence.md
tldr.md
048 Be Someone Others Want To Work With
048-be-someone-others-want-to-work-with.md
tldr.md
049 Dont Fall For Xy Problem Ask Right Questions
049-dont-fall-for-xy-problem-ask-right-questions.md
tldr.md
050 Biggest Lie Startups Tell Engineers
050-biggest-lie-startups-tell-engineers.md
tldr.md
051 Promotions Are Proactive Not Reactive
051-promotions-are-proactive-not-reactive.md
tldr.md
052 Not Enough To Be Right Learn To Be Heard
052-not-enough-to-be-right-learn-to-be-heard.md
tldr.md
053 No One Ships Alone
053-no-one-ships-alone.md
tldr.md
054 Not Every Mistake Needs A Correction
054-not-every-mistake-needs-a-correction.md
tldr.md
055 Build Influence At Work
055-build-influence-at-work.md
tldr.md
056 Your Soft Skills Arent Soft At All
056-your-soft-skills-arent-soft-at-all.md
tldr.md
057 Experience Before Forming Opinion
057-experience-before-forming-opinion.md
tldr.md
058 Curiosity And High Bias For Action
058-curiosity-and-high-bias-for-action.md
tldr.md
059 Worklog
059-worklog.md
tldr.md
060 Mistakes And Growth
060-mistakes-and-growth.md
tldr.md
061 Own It Instead Of Sweeping It Aside
061-own-it-instead-of-sweeping-it-aside.md
tldr.md
062 Dont Wait Step Up
062-dont-wait-step-up.md
tldr.md
063 Temporary Fix Is Permanent
063-temporary-fix-is-permanent.md
tldr.md
064 Interview Bias And What Sets You Apart
064-interview-bias-and-what-sets-you-apart.md
tldr.md
065 Saying This Isnt My Problem Is A Problem
065-saying-this-isnt-my-problem-is-a-problem.md
tldr.md
066 Okr
066-okr.md
tldr.md
067 Miscommunication
067-miscommunication.md
tldr.md
068 When In Doubt Code It Out
068-when-in-doubt-code-it-out.md
tldr.md
069 Follow Up Without Annoying People
069-follow-up-without-annoying-people.md
tldr.md
070 Lead Projects That Land
070-lead-projects-that-land.md
tldr.md
071 Abstract Thinking Skill Next Decade
071-abstract-thinking-skill-next-decade.md
tldr.md
072 We Engineers Suck At Task Estimation
072-we-engineers-suck-at-task-estimation.md
tldr.md
073 Shiny Object Syndrome In Tech
073-shiny-object-syndrome-in-tech.md
tldr.md
074 3p
074-3p.md
tldr.md
075 Leverage The Equilibrium
075-leverage-the-equilibrium.md
tldr.md
076 On Demand Container Loading In Aws Lambda
076-on-demand-container-loading-in-aws-lambda.md
tldr.md
077 Sql Has Problems We Can Fix Them Pipe Syntax In Sql
077-sql-has-problems-we-can-fix-them-pipe-syntax-in-sql.md
tldr.md
078 Nanolog A Nanosecond Scale Logging System
078-nanolog-a-nanosecond-scale-logging-system.md
tldr.md
079 Best Resource Is Mythical
079-best-resource-is-mythical.md
tldr.md
080 Wtf The Who To Follow Service At Twitter
080-wtf-the-who-to-follow-service-at-twitter.md
tldr.md
081 Know A Lot
081-know-a-lot.md
tldr.md
082 Out Of Syllabus
082-out-of-syllabus.md
tldr.md
083 Negotiate The Offer
083-negotiate-the-offer.md
tldr.md
084 Never Bad Mouth Your Ex Exployer
084-never-bad-mouth-your-ex-exployer.md
tldr.md
085 Culture Fit
085-culture-fit.md
tldr.md
086 Quantification In Resume
086-quantification-in-resume.md
tldr.md
087 Hiring Is Unfair
087-hiring-is-unfair.md
tldr.md
088 Questions For Interviewers
088-questions-for-interviewers.md
tldr.md
089 Collaboration Communication
089-collaboration-communication.md
tldr.md
090 Out Of Vicious Interview Cycle
090-out-of-vicious-interview-cycle.md
tldr.md
091 Pitch Projects Not Ideas
091-pitch-projects-not-ideas.md
tldr.md
092 Read Design Docs
092-read-design-docs.md
tldr.md
093 Read Rca Docs
093-read-rca-docs.md
tldr.md
094 Start Generalist
094-start-generalist.md
tldr.md
095 Do Not Rely On Summaries
095-do-not-rely-on-summaries.md
tldr.md
096 Structure Your Design Interviews
096-structure-your-design-interviews.md
tldr.md
097 Title Inflation
097-title-inflation.md
tldr.md
098 Find Your Own Project
098-find-your-own-project.md
tldr.md
099 Six Pointers To Crack Coding And Design Interviews
099-six-pointers-to-crack-coding-and-design-interviews.md
tldr.md
100 Keep Yourself Unblocked
100-keep-yourself-unblocked.md
tldr.md
101 Genetic Knapsack
101-genetic-knapsack.md
tldr.md
102 Pseudorandom Number Generation Lfsr
102-pseudorandom-number-generation-lfsr.md
tldr.md
103 How Indexes Work On Partitioned And Sharded Data
103-how-indexes-work-on-partitioned-and-sharded-data.md
tldr.md
104 Some Data Partitioning Strategies For Distributed Data Stores
104-some-data-partitioning-strategies-for-distributed-data-stores.md
tldr.md
105 Data Partitioning
105-data-partitioning.md
tldr.md
106 Leaderless Replication
106-leaderless-replication.md
tldr.md
107 Conflict Resolution
107-conflict-resolution.md
tldr.md
108 Conflict Detection
108-conflict-detection.md
tldr.md
109 Multi Master Replication
109-multi-master-replication.md
tldr.md
110 Monotonic Reads
110-monotonic-reads.md
tldr.md
111 Read Your Write Consistency
111-read-your-write-consistency.md
tldr.md
112 Handling Outages Master Replica
112-handling-outages-master-replica.md
tldr.md
113 Replication Formats
113-replication-formats.md
tldr.md
114 Replication Strategies
114-replication-strategies.md
tldr.md
115 Master Replica Replication
115-master-replica-replication.md
tldr.md
116 Durability
116-durability.md
tldr.md
117 Isolation
117-isolation.md
tldr.md
118 Atomicity
118-atomicity.md
tldr.md
119 Consistency
119-consistency.md
tldr.md
120 Architectures In Distributed Systems
120-architectures-in-distributed-systems.md
tldr.md
121 Mistaken Beliefs Of Distributed Systems
121-mistaken-beliefs-of-distributed-systems.md
tldr.md
122 Fork Bomb
122-fork-bomb.md
tldr.md
123 Chained Operators Python
123-chained-operators-python.md
tldr.md
124 Taxonomy On Sql
124-taxonomy-on-sql.md
tldr.md
125 The Weird Walrus
125-the-weird-walrus.md
tldr.md
126 Fully Persistent Arrays
126-fully-persistent-arrays.md
tldr.md
127 Persistent Data Structures Introduction
127-persistent-data-structures-introduction.md
tldr.md
128 Constant Folding Python
128-constant-folding-python.md
tldr.md
129 String Interning Python
129-string-interning-python.md
tldr.md
130 Recursion Visualizer Python
130-recursion-visualizer-python.md
tldr.md
131 Flajolet Martin
131-flajolet-martin.md
tldr.md
132 2q Cache
132-2q-cache.md
tldr.md
133 Israeli Queues
133-israeli-queues.md
tldr.md
134 1d Terrain
134-1d-terrain.md
tldr.md
135 Jaccard Minhash
135-jaccard-minhash.md
tldr.md
136 Ts Smoothing
136-ts-smoothing.md
tldr.md
137 Lfu
137-lfu.md
tldr.md
138 Morris Counter
138-morris-counter.md
tldr.md
139 Slowsort
139-slowsort.md
tldr.md
140 Bitcask
140-bitcask.md
tldr.md
141 Phi Accrual
141-phi-accrual.md
tldr.md
142 10x Engineer
142-10x-engineer.md
tldr.md
143 Decipher Repeated Key Xor
143-decipher-repeated-key-xor.md
tldr.md
144 Decipher Single Xor
144-decipher-single-xor.md
tldr.md
145 Python Iterable Integers
145-python-iterable-integers.md
tldr.md
146 Inheritance C
146-inheritance-c.md
tldr.md
147 Rum
147-rum.md
tldr.md
148 Consistent Hashing
148-consistent-hashing.md
tldr.md
149 Python Caches Integers
149-python-caches-integers.md
tldr.md
150 Fractional Cascading
150-fractional-cascading.md
tldr.md
151 Copy On Write
151-copy-on-write.md
tldr.md
152 Midpoint Insertion Caching Strategy
152-midpoint-insertion-caching-strategy.md
tldr.md
153 Fsm Python
153-fsm-python.md
tldr.md
154 Bayesian Average
154-bayesian-average.md
tldr.md
155 Sliding Window Ratelimiter
155-sliding-window-ratelimiter.md
tldr.md
156 Idf
156-idf.md
tldr.md
157 Better Programmer
157-better-programmer.md
tldr.md
158 Python Prompts
158-python-prompts.md
tldr.md
159 Rule 30 Cellular Automata
159-rule-30-cellular-automata.md
tldr.md
160 Function Overloading
160-function-overloading.md
tldr.md
161 Isolation Forest
161-isolation-forest.md
tldr.md
162 Image Steganography
162-image-steganography.md
tldr.md
163 Long Integers Python
163-long-integers-python.md
tldr.md
164 I Changed My Python
164-i-changed-my-python.md
tldr.md
165 Benchmark And Compare Pagination Approach In Mongodb
165-benchmark-and-compare-pagination-approach-in-mongodb.md
tldr.md
166 Mongodb Cursor Skip Is Slow
166-mongodb-cursor-skip-is-slow.md
tldr.md
167 Fast And Efficient Pagination In Mongodb
167-fast-and-efficient-pagination-in-mongodb.md
tldr.md
168 Making Http Requests Using Netcat
168-making-http-requests-using-netcat.md
tldr.md

Why Roles Exist at All

Base language models - the kind trained purely on next-token prediction over raw text - do not have a natural concept of “conversation.” They continue text. If you feed a base model the string “What is the capital of France?”, it might continue with “What is the capital of Germany? What is the capital of Spain?” because that pattern appears frequently in quiz and FAQ content. The model is doing exactly what it was trained to do: predict plausible continuations.

Instruction-following models (the kind you interact with in production APIs) are fine-tuned on data formatted as conversations. During this fine-tuning, the model sees thousands of examples where a system context is followed by a user request and then a high-quality assistant response. The model learns to treat these structural cues as meaningful. It learns that text following a system prefix should be treated as persistent instructions, that text following a user prefix is a request to respond to, and that it is generating the text that follows the assistant prefix.

The three-role format is therefore not arbitrary. It emerged from how instruction tuning works, and every production-grade model from OpenAI, Google, Anthropic, and Meta has been trained to respect it.

The System Prompt

The system prompt is the foundational instruction layer of a conversation. It is written by the application developer, not the end user, and it executes before any user interaction takes place.

A well-crafted system prompt does several things:

  • Defines the model’s persona and role (“You are a senior data analyst…”).

  • Specifies output format constraints (“Always respond in valid JSON

    with the schema: …”).

  • Establishes scope boundaries (“Only answer questions about our product documentation. Politely decline off-topic requests.”).

  • Sets behavioural rules (“Never speculate. If you are uncertain, say so explicitly.”).

  • Injects background context the model needs (“The current date is… The user’s subscription tier is…”).

The system prompt is processed before the first user message and its content persists through the entire conversation in the model’s context window. It is the most reliable lever you have for controlling model behaviour consistently across all turns.

One critical insight: the system prompt does not have magic authority in the way a configuration file has authority over software. The model has learned to attend to system content heavily because of how it was trained, but it is ultimately still performing token prediction.

A sufficiently adversarial user prompt can sometimes cause the model to deviate from system instructions - this is the class of vulnerabilities known as prompt injection. Never trust that a system prompt alone is a security boundary. Validate and sanitize outputs programmatically when the stakes are high.

Here is a minimal but structurally sound system prompt for a customer support application:

Plain text

Notice that it defines role, scope, fallback behaviour, confidentiality constraints, and style. These four categories cover most of what a useful system prompt needs to specify.

The User Turn

The user turn is the input from the person or the system acting as a person. In a simple chatbot, this is what the human typed. In a programmatic pipeline, this is often constructed by application code - injecting a retrieved document, formatted data, or a templated instruction.

A common mistake is treating the user turn as a place to put everything. Developers sometimes cram persona, instructions, data, and the actual question into a single user message because they are not using the system prompt at all.

This works, to a point, but it conflates different layers of intent. The model is somewhat sensitive to where instructions come from, and instructions in the user turn carry less persistent authority than those in the system prompt. More importantly, when you start managing multi-turn conversations, conflation becomes a maintenance problem.

The user turn should contain:

  • The actual request or question.
  • Any data or documents that are specific to this request (e.g. “Here is the PDF text - summarise it.”).
  • Context that is specific to this turn (e.g. “Given the plan we discussed above…”).

It should not contain:

  • Persistent behavioural instructions. Those belong in the system prompt.
  • Security-sensitive constraints. A user can modify their own messages; they cannot modify the system prompt (in a properly built application).

The Assistant Turn

The assistant turn is the model’s previous response, injected back into the conversation for the next request. This is the mechanism that gives a language model what looks like memory in a multi-turn conversation.

Here is the part that surprises many developers: the model has no persistent state between API calls. Every call is stateless. The model does not remember the previous turn - you have to send it back. When you make a second API call in a conversation, your application must include the entire conversation history: system prompt, first user message, first assistant response, second user message, and so on. The model attends to all of it to generate the next response.

This has immediate engineering consequences:

  • Token costs grow linearly with conversation length. A 20-turn conversation sends approximately 20x more tokens per call than a single-turn call, because the entire history is in every request.

  • Context windows

    are finite budgets. Once the cumulative history exceeds the model’s context window (measured in tokens), something has to give. Some APIs silently truncate the oldest messages. Others return an error. Your application needs a strategy - sliding window, summarization, or selective pruning - before it needs one.

  • You control the history. Nothing forces you to inject the exact unmodified model response from the previous turn. Sophisticated applications summarize, compress, or filter history before injecting it. You can also inject synthetic assistant turns to steer the model’s subsequent behavior - a technique sometimes called “prefilling.”

Here is what the message list looks like at the API level for a two-turn conversation:

Plain text

The model receives all four messages as context. Its response to the final user message will be informed by everything above it - including the definition it already gave. This is why follow-up questions work at all.

How Format Maps to Raw Text

Models do not natively understand JSON or Python data structures. Before the model ever sees the message list, the API serializes it into a flat text sequence using a chat template. The format varies by model family. OpenAI’s ChatML format looks like this:

Plain text

The final <|im_start|>assistant header with no closing tag is the generation prompt - the cue that tells the model to start producing the assistant’s response. The model continues the text from this point.

Llama-based models use a different format with [INST] and [/INST] markers. Anthropic’s Claude uses \n\nHuman: and \n\nAssistant: delimiters internally. The principle is the same: structured markers that the model was trained to respect, serialized into the flat token sequence the model actually sees.

When you use a hosted API, all of this serialization happens invisibly. When you run models locally using tools like llama.cpp or Ollama, applying the correct chat template yourself is your responsibility. Getting it wrong does not produce an error - it produces subtly degraded output, because the model’s behavior was fine-tuned against a specific format.

Practical Patterns for Production

A few patterns that experienced practitioners use consistently:

Separate persona from constraints. A system prompt that mixes “you are a friendly assistant” with “never discuss competitor products” is harder to maintain and debug than one with explicit sections. Use clear structural separation, even in plain text.

Test system prompt changes in isolation. The system prompt is a shared dependency for every conversation in your application. Changes to it are breaking changes. Version-control your system prompts and evaluate them on a representative set of test prompts before deploying.

Treat the user turn as untrusted input. Everything in the user turn could, in principle, be an attempt to override system instructions. This is not paranoia - it is the correct security model. Never interpolate user input directly into your system prompt. If you need to include user-provided data in the system prompt (a document they uploaded, for example), validate and sanitize it first.

Keep context history manageable. A context window of 128,000 tokens sounds generous until you realize that 20 turns of a rich conversation, with a substantial system prompt and retrieved documents, can fill it. Build context management into your architecture from the start, not as a retrofit.

Use assistant prefilling deliberately. You can inject the beginning of the assistant response to constrain the model’s output format. For example, if you need the model to always start with a JSON object, begin the assistant turn with { in your API call. The model will continue from that starting point. This is a low-overhead way to enforce structure without relying entirely on instruction following.

Footnote

Every interaction with a production LLM is a structured list of messages with roles - system, user, and assistant. The system prompt is the developer’s persistent instruction layer. The user turn is the request. The assistant turn is previous model output re-injected as context, because the model is stateless between calls.

Understanding this format and its constraints - token costs, context limits, injection risks - is foundational to building reliable applications on top of language models.