InkdownInkdown
Start writing

Arpit Bhayani Blogs

336 files·168 subfolders

Shared Workspace

Arpit Bhayani Blogs
001 Ai Topological Sort

003-rag-production

Shared from "Arpit Bhayani Blogs" on Inkdown

What Matters in Production RAG

Source: https://arpitbhayani.me/blogs/rag-production Date: 2026-05-15

Most of us build RAG the same way: follow a tutorial that embeds a handful of PDFs, stores the vectors in a local Chroma instance, and chains everything together with LangChain (if that's still a thing). The demo works. The answer looks reasonable. Then you take it to production and it falls apart in quiet, hard-to-diagnose ways.


Most of us build RAG the same way: follow a tutorial that embeds a handful of PDFs, stores the vectors in a local Chroma instance, and chains everything together with LangChain (if that’s still a thing). The demo works. The answer looks reasonable. Then you take it to production and it falls apart in quiet, hard-to-diagnose ways.

001-ai-topological-sort.md
tldr.md
002 Temporal Primer
002-temporal-primer.md
tldr.md
003 Rag Production
003-rag-production.md
tldr.md
004 Structure Of Llm Chat
004-structure-of-llm-chat.md
tldr.md
005 How Llms Work
005-how-llms-work.md
tldr.md
006 Monolith Is Distributed System
006-monolith-is-distributed-system.md
tldr.md
007 Defensive Databases
007-defensive-databases.md
tldr.md
008 Bm25
008-bm25.md
tldr.md
009 Join Algorithms
009-join-algorithms.md
tldr.md
010 Venting At Work
010-venting-at-work.md
tldr.md
011 Half Life
011-half-life.md
tldr.md
012 Multi Paxos
012-multi-paxos.md
tldr.md
013 Mysql Replication Internals
013-mysql-replication-internals.md
tldr.md
014 Bloom Filters
014-bloom-filters.md
tldr.md
015 Clock Sync Nightmare
015-clock-sync-nightmare.md
tldr.md
016 Kafka Partitions
016-kafka-partitions.md
tldr.md
017 Product Quantization
017-product-quantization.md
tldr.md
018 Qkv Matrices
018-qkv-matrices.md
tldr.md
019 Deleted Production
019-deleted-production.md
tldr.md
020 How Llm Inference Works
020-how-llm-inference-works.md
tldr.md
021 Blocking Queues
021-blocking-queues.md
tldr.md
022 Heartbeats In Distributed Systems
022-heartbeats-in-distributed-systems.md
tldr.md
023 Cassandra Writes
023-cassandra-writes.md
tldr.md
024 Redis Replication
024-redis-replication.md
tldr.md
025 Arrogant People At Work
025-arrogant-people-at-work.md
tldr.md
026 Cdn Content Replication
026-cdn-content-replication.md
tldr.md
027 Cant Fix Everything Day One
027-cant-fix-everything-day-one.md
tldr.md
028 Emotions At Work
028-emotions-at-work.md
tldr.md
029 Grpc Http2
029-grpc-http2.md
tldr.md
030 Meetings With No Agenda Are A Waste Of Time
030-meetings-with-no-agenda-are-a-waste-of-time.md
tldr.md
031 Growth Is Not About Doing Everything
031-growth-is-not-about-doing-everything.md
tldr.md
032 Career Longevity Vs Job Hopping
032-career-longevity-vs-job-hopping.md
tldr.md
033 Stay Relevant At Higher Salary Levels
033-stay-relevant-at-higher-salary-levels.md
tldr.md
034 Why Consensus
034-why-consensus.md
tldr.md
035 Database Deadlocks
035-database-deadlocks.md
tldr.md
036 Cpu Cache Locality
036-cpu-cache-locality.md
tldr.md
037 Eventual Consistency
037-eventual-consistency.md
tldr.md
038 Dns Udp Tcp
038-dns-udp-tcp.md
tldr.md
039 Masters
039-masters.md
tldr.md
040 Empathy Makes Great Engineers Unstoppable
040-empathy-makes-great-engineers-unstoppable.md
tldr.md
041 Good Mentors Build People
041-good-mentors-build-people.md
tldr.md
042 Always Have Back Burner Projects
042-always-have-back-burner-projects.md
tldr.md
043 Before You Push Back Know What Youre Standing On
043-before-you-push-back-know-what-youre-standing-on.md
tldr.md
044 Be The One They Can Count On
044-be-the-one-they-can-count-on.md
tldr.md
045 How Much People Bet On You
045-how-much-people-bet-on-you.md
tldr.md
046 How To Get Leadership To Say Yes To Your Project
046-how-to-get-leadership-to-say-yes-to-your-project.md
tldr.md
047 Dont Let Your Best Ideas Die In Silence
047-dont-let-your-best-ideas-die-in-silence.md
tldr.md
048 Be Someone Others Want To Work With
048-be-someone-others-want-to-work-with.md
tldr.md
049 Dont Fall For Xy Problem Ask Right Questions
049-dont-fall-for-xy-problem-ask-right-questions.md
tldr.md
050 Biggest Lie Startups Tell Engineers
050-biggest-lie-startups-tell-engineers.md
tldr.md
051 Promotions Are Proactive Not Reactive
051-promotions-are-proactive-not-reactive.md
tldr.md
052 Not Enough To Be Right Learn To Be Heard
052-not-enough-to-be-right-learn-to-be-heard.md
tldr.md
053 No One Ships Alone
053-no-one-ships-alone.md
tldr.md
054 Not Every Mistake Needs A Correction
054-not-every-mistake-needs-a-correction.md
tldr.md
055 Build Influence At Work
055-build-influence-at-work.md
tldr.md
056 Your Soft Skills Arent Soft At All
056-your-soft-skills-arent-soft-at-all.md
tldr.md
057 Experience Before Forming Opinion
057-experience-before-forming-opinion.md
tldr.md
058 Curiosity And High Bias For Action
058-curiosity-and-high-bias-for-action.md
tldr.md
059 Worklog
059-worklog.md
tldr.md
060 Mistakes And Growth
060-mistakes-and-growth.md
tldr.md
061 Own It Instead Of Sweeping It Aside
061-own-it-instead-of-sweeping-it-aside.md
tldr.md
062 Dont Wait Step Up
062-dont-wait-step-up.md
tldr.md
063 Temporary Fix Is Permanent
063-temporary-fix-is-permanent.md
tldr.md
064 Interview Bias And What Sets You Apart
064-interview-bias-and-what-sets-you-apart.md
tldr.md
065 Saying This Isnt My Problem Is A Problem
065-saying-this-isnt-my-problem-is-a-problem.md
tldr.md
066 Okr
066-okr.md
tldr.md
067 Miscommunication
067-miscommunication.md
tldr.md
068 When In Doubt Code It Out
068-when-in-doubt-code-it-out.md
tldr.md
069 Follow Up Without Annoying People
069-follow-up-without-annoying-people.md
tldr.md
070 Lead Projects That Land
070-lead-projects-that-land.md
tldr.md
071 Abstract Thinking Skill Next Decade
071-abstract-thinking-skill-next-decade.md
tldr.md
072 We Engineers Suck At Task Estimation
072-we-engineers-suck-at-task-estimation.md
tldr.md
073 Shiny Object Syndrome In Tech
073-shiny-object-syndrome-in-tech.md
tldr.md
074 3p
074-3p.md
tldr.md
075 Leverage The Equilibrium
075-leverage-the-equilibrium.md
tldr.md
076 On Demand Container Loading In Aws Lambda
076-on-demand-container-loading-in-aws-lambda.md
tldr.md
077 Sql Has Problems We Can Fix Them Pipe Syntax In Sql
077-sql-has-problems-we-can-fix-them-pipe-syntax-in-sql.md
tldr.md
078 Nanolog A Nanosecond Scale Logging System
078-nanolog-a-nanosecond-scale-logging-system.md
tldr.md
079 Best Resource Is Mythical
079-best-resource-is-mythical.md
tldr.md
080 Wtf The Who To Follow Service At Twitter
080-wtf-the-who-to-follow-service-at-twitter.md
tldr.md
081 Know A Lot
081-know-a-lot.md
tldr.md
082 Out Of Syllabus
082-out-of-syllabus.md
tldr.md
083 Negotiate The Offer
083-negotiate-the-offer.md
tldr.md
084 Never Bad Mouth Your Ex Exployer
084-never-bad-mouth-your-ex-exployer.md
tldr.md
085 Culture Fit
085-culture-fit.md
tldr.md
086 Quantification In Resume
086-quantification-in-resume.md
tldr.md
087 Hiring Is Unfair
087-hiring-is-unfair.md
tldr.md
088 Questions For Interviewers
088-questions-for-interviewers.md
tldr.md
089 Collaboration Communication
089-collaboration-communication.md
tldr.md
090 Out Of Vicious Interview Cycle
090-out-of-vicious-interview-cycle.md
tldr.md
091 Pitch Projects Not Ideas
091-pitch-projects-not-ideas.md
tldr.md
092 Read Design Docs
092-read-design-docs.md
tldr.md
093 Read Rca Docs
093-read-rca-docs.md
tldr.md
094 Start Generalist
094-start-generalist.md
tldr.md
095 Do Not Rely On Summaries
095-do-not-rely-on-summaries.md
tldr.md
096 Structure Your Design Interviews
096-structure-your-design-interviews.md
tldr.md
097 Title Inflation
097-title-inflation.md
tldr.md
098 Find Your Own Project
098-find-your-own-project.md
tldr.md
099 Six Pointers To Crack Coding And Design Interviews
099-six-pointers-to-crack-coding-and-design-interviews.md
tldr.md
100 Keep Yourself Unblocked
100-keep-yourself-unblocked.md
tldr.md
101 Genetic Knapsack
101-genetic-knapsack.md
tldr.md
102 Pseudorandom Number Generation Lfsr
102-pseudorandom-number-generation-lfsr.md
tldr.md
103 How Indexes Work On Partitioned And Sharded Data
103-how-indexes-work-on-partitioned-and-sharded-data.md
tldr.md
104 Some Data Partitioning Strategies For Distributed Data Stores
104-some-data-partitioning-strategies-for-distributed-data-stores.md
tldr.md
105 Data Partitioning
105-data-partitioning.md
tldr.md
106 Leaderless Replication
106-leaderless-replication.md
tldr.md
107 Conflict Resolution
107-conflict-resolution.md
tldr.md
108 Conflict Detection
108-conflict-detection.md
tldr.md
109 Multi Master Replication
109-multi-master-replication.md
tldr.md
110 Monotonic Reads
110-monotonic-reads.md
tldr.md
111 Read Your Write Consistency
111-read-your-write-consistency.md
tldr.md
112 Handling Outages Master Replica
112-handling-outages-master-replica.md
tldr.md
113 Replication Formats
113-replication-formats.md
tldr.md
114 Replication Strategies
114-replication-strategies.md
tldr.md
115 Master Replica Replication
115-master-replica-replication.md
tldr.md
116 Durability
116-durability.md
tldr.md
117 Isolation
117-isolation.md
tldr.md
118 Atomicity
118-atomicity.md
tldr.md
119 Consistency
119-consistency.md
tldr.md
120 Architectures In Distributed Systems
120-architectures-in-distributed-systems.md
tldr.md
121 Mistaken Beliefs Of Distributed Systems
121-mistaken-beliefs-of-distributed-systems.md
tldr.md
122 Fork Bomb
122-fork-bomb.md
tldr.md
123 Chained Operators Python
123-chained-operators-python.md
tldr.md
124 Taxonomy On Sql
124-taxonomy-on-sql.md
tldr.md
125 The Weird Walrus
125-the-weird-walrus.md
tldr.md
126 Fully Persistent Arrays
126-fully-persistent-arrays.md
tldr.md
127 Persistent Data Structures Introduction
127-persistent-data-structures-introduction.md
tldr.md
128 Constant Folding Python
128-constant-folding-python.md
tldr.md
129 String Interning Python
129-string-interning-python.md
tldr.md
130 Recursion Visualizer Python
130-recursion-visualizer-python.md
tldr.md
131 Flajolet Martin
131-flajolet-martin.md
tldr.md
132 2q Cache
132-2q-cache.md
tldr.md
133 Israeli Queues
133-israeli-queues.md
tldr.md
134 1d Terrain
134-1d-terrain.md
tldr.md
135 Jaccard Minhash
135-jaccard-minhash.md
tldr.md
136 Ts Smoothing
136-ts-smoothing.md
tldr.md
137 Lfu
137-lfu.md
tldr.md
138 Morris Counter
138-morris-counter.md
tldr.md
139 Slowsort
139-slowsort.md
tldr.md
140 Bitcask
140-bitcask.md
tldr.md
141 Phi Accrual
141-phi-accrual.md
tldr.md
142 10x Engineer
142-10x-engineer.md
tldr.md
143 Decipher Repeated Key Xor
143-decipher-repeated-key-xor.md
tldr.md
144 Decipher Single Xor
144-decipher-single-xor.md
tldr.md
145 Python Iterable Integers
145-python-iterable-integers.md
tldr.md
146 Inheritance C
146-inheritance-c.md
tldr.md
147 Rum
147-rum.md
tldr.md
148 Consistent Hashing
148-consistent-hashing.md
tldr.md
149 Python Caches Integers
149-python-caches-integers.md
tldr.md
150 Fractional Cascading
150-fractional-cascading.md
tldr.md
151 Copy On Write
151-copy-on-write.md
tldr.md
152 Midpoint Insertion Caching Strategy
152-midpoint-insertion-caching-strategy.md
tldr.md
153 Fsm Python
153-fsm-python.md
tldr.md
154 Bayesian Average
154-bayesian-average.md
tldr.md
155 Sliding Window Ratelimiter
155-sliding-window-ratelimiter.md
tldr.md
156 Idf
156-idf.md
tldr.md
157 Better Programmer
157-better-programmer.md
tldr.md
158 Python Prompts
158-python-prompts.md
tldr.md
159 Rule 30 Cellular Automata
159-rule-30-cellular-automata.md
tldr.md
160 Function Overloading
160-function-overloading.md
tldr.md
161 Isolation Forest
161-isolation-forest.md
tldr.md
162 Image Steganography
162-image-steganography.md
tldr.md
163 Long Integers Python
163-long-integers-python.md
tldr.md
164 I Changed My Python
164-i-changed-my-python.md
tldr.md
165 Benchmark And Compare Pagination Approach In Mongodb
165-benchmark-and-compare-pagination-approach-in-mongodb.md
tldr.md
166 Mongodb Cursor Skip Is Slow
166-mongodb-cursor-skip-is-slow.md
tldr.md
167 Fast And Efficient Pagination In Mongodb
167-fast-and-efficient-pagination-in-mongodb.md
tldr.md
168 Making Http Requests Using Netcat
168-making-http-requests-using-netcat.md
tldr.md

This article is about what comes after the demo. It covers the fundamentals of how RAG actually works under the hood, the engineering challenges of keeping an index fresh and correct over time, and how to build the observability layer that lets you answer “why did the system retrieve that?” when things go wrong. None of these topics are exotic. All of them are consistently underbuilt in practice.

RAG Basics

The core idea is simple: instead of asking an LLM to answer from memory, you retrieve relevant documents at query time and inject them into the prompt as context. The model’s role shifts from “know everything” to “reason over what you are given.” This architectural choice has made RAG the dominant pattern for grounding LLMs in specific, current, or proprietary knowledge.

A RAG system has two distinct pipelines that run at different times.

The indexing pipeline runs offline (or in the background). It ingests raw documents, splits them into chunks, converts each chunk into a dense vector embedding, and stores those vectors in a vector database alongside metadata and the original text. This pipeline populates the knowledge base the retriever will search at query time.

The query pipeline runs online, per user request. It takes the user’s question, embeds it using the same model used during indexing, searches the vector database for the nearest chunks, assembles those chunks into a context window, and sends the whole thing to the LLM as a prompt.

The math underlying the retrieval step is cosine similarity. Two vectors are considered close if the angle between them is small:

similarity

(

q

,

d

)

=

q

⋅

d

∥

q

∥

⋅

∥

d

∥

\text{similarity}(q, d) = \frac{q \cdot d}{|q| \cdot |d|}

similarity

(

q

,

d

)

=

∥

q

∥

⋅

∥

d

∥

q

⋅

d

​

Where qqq is the query embedding and ddd is a document chunk embedding. In practice, most vector databases use approximate nearest neighbor (ANN) search rather than exact exhaustive search, because scanning billions of vectors at query time is prohibitively slow. HNSW (Hierarchical Navigable Small World) is the dominant algorithm: it builds a layered proximity graph during indexing that allows retrieval in O(log⁡n)O(\log n)O(logn) time at the cost of a small, tunable recall loss.

Chunking

Chunking is where most RAG systems silently fail. The intuition is straightforward: chunks need to be small enough that retrieved text is specific and relevant, but large enough that they contain complete thoughts. In practice, getting this right requires understanding your document corpus.

The naive approach is fixed-size chunking at some character or token count, say 512 tokens with a 128-token overlap. It is simple and fast. It is also routinely wrong. Fixed-size chunking cuts sentences in half, separates questions from their answers in FAQ documents, and splits code across function boundaries.

The approaches that actually work in production:

  • Recursive splitting: split on paragraphs first, then sentences, then characters as a fallback. This preserves semantic structure far better than character counting.

  • Semantic chunking: embed consecutive sentences and insert chunk boundaries where cosine similarity between adjacent sentences drops below a threshold. This identifies genuine topic shifts rather than arbitrary position boundaries.

  • Structure-aware splitting: for code, split at function or class boundaries using AST parsing

    . For legal documents, split at clause boundaries. For contracts, include the parent section heading with every child chunk.

Always store metadata with each chunk: the source document ID, section heading, page number, creation timestamp, and a content hash. You will need all of these later, both for filtering and for keeping the index current.

Embedding Models and the Model-Lock Problem

The embedding model you choose during indexing is a ‘long-term commitment’ (sorry, could not come with a better working here). Every vector in your index was produced by that model. If you switch models, every vector is now incommensurable with the new query embeddings, and you must re-embed the entire corpus.

Production-grade options as of mid-2026:

  • text-embedding-3-large

    (OpenAI): 3072-dimensional, best general-purpose recall, but API-dependent

  • embed-v3

    (Cohere): strong multilingual performance, supports truncation modes

  • bge-large-en-v1.5

    (BAAI): open-source, deployable locally, competitive with the above for English

  • e5-mistral-7b-instruct

    : instruction-tuned, excellent for asymmetric retrieval tasks

RAG Indexing Pipelines

Here is where most tutorials stop and most production problems begin. Your knowledge base is not static. Documents are updated, retracted, corrected, superseded, and deleted. If your indexing pipeline cannot handle these operations correctly, your RAG system will quietly serve stale, contradictory, or deleted information with full confidence.

Chunk Identity

A document that is split into 15 chunks produces 15 separate vectors, each stored with its own ID. When that document is updated, you cannot simply update a row as you would in a relational database. You need to:

  1. Identify all 15 chunk IDs that belong to the old version of the document
  2. Delete them from the vector store
  3. Re-chunk the updated document (which may now produce 17 chunks)
  4. Re-embed and insert the 17 new chunks

This requires a mapping layer that vector databases do not provide natively. The standard approach is a document registry, a simple relational table (Postgres works fine) that maps each doc_id to the list of chunk vector IDs currently in the index:

Plain text

When a document update arrives, the flow is:

Plain text
Avoiding Unnecessary Re-Embedding

Re-embedding is expensive. A 100,000-document corpus with an average of 10 chunks per document means 1 million embedding API calls for a full rebuild. You want to re-embed only what changed.

Content hashing is the first gate. When a document arrives, compute a hash of its content. If the hash matches what is in the registry, skip it entirely. Most “updates” in practice are metadata changes (a title change, a timestamp update) that do not affect the text content and therefore do not require re-embedding.

Plain text

For large documents, you can go further: hash at the chunk level, and re-embed only the chunks whose content changed. This is more complex to implement but pays off for long, mostly-stable documents like regulatory filings or technical manuals where only a few sections change per update cycle.

Index Versioning and No-Downtime Updates

The most underappreciated failure mode in RAG is the partial update. You start reindexing 10,000 documents, the pipeline crashes at document 6,000, and now your index is a flux: some documents are at version N, some at version N+1, and the seam between them is invisible to the retrieval layer.

The safe pattern is alias-based deployment, borrowed directly from Elasticsearch operations:

Plain text

You build the new index completely, validate it against a benchmark query set, then atomically swap the alias. The old index stays around for a configurable retention period in case rollback is needed. No query ever sees a partial index.

For systems that cannot tolerate rebuild latency (the index is too large, or documents need to be available within seconds of ingestion), incremental upsert is the alternative. Upsert appends new vectors without touching existing ones. Manage concurrent visibility by including a valid_from timestamp (similar to Postgres MVCC) in metadata and filtering queries to only return chunks where valid_from <= NOW(). This lets you stage new content before it becomes live.

Plain text
Embedding Model Upgrades

When a better embedding model is released, every vector in your index is now wrong in a specific sense: it was produced by a different model, so its geometric position in the vector space is incommensurable with query embeddings from the new model. You cannot query with model B and retrieve vectors from model A.

This means embedding model upgrades require full corpus re-embedding. In practice, the migration strategy is:

  1. Build a shadow index with the new model running in parallel
  2. Route a small percentage of queries to the shadow index and compare results
  3. Gradually shift traffic using the alias pattern above
  4. Keep the old index warm until you are confident in the new one

The operational cost of this is why embedding model choice deserves more up-front thought than it typically gets. Treat it like a database schema migration: painful to undo, so choose carefully.

A practical safeguard: store the embedding model name and version in every chunk’s metadata. When querying, assert that the stored model matches the query model before returning results. This prevents the silent failure mode where model drift goes undetected.

Observability and Retrieval Tracing

Production RAG systems fail in ways that look like LLM problems but are actually retrieval problems. The answer is confidently wrong not because the model hallucinated, but because it faithfully reasoned over the wrong context. Without end-to-end tracing, you cannot distinguish these two failure modes.

The standard observability stack for distributed systems (traces, metrics, logs via OpenTelemetry) applies here, but a RAG pipeline has primitives that OTel’s generic span model does not capture natively. You need to instrument these explicitly.

The Span Architecture

A complete RAG request should produce a trace with these spans, nested in a single root span:

Plain text

The chunk_retrieved events are what make a bad answer debuggable. When we investigate a support ticket about a wrong answer, we can open the trace, expand the retrieval span events, and immediately see which chunks scored highest and where they came from. “The system retrieved three chunks from the deprecated v1 policy document” is an actionable finding. “The system returned a bad answer” is not.

Logging the “Why”

A common question in production is not just “what was retrieved?” but “why did the system think this was relevant?” The similarity score alone does not answer this. A chunk with a score of 0.82 might be genuinely relevant, or it might be a false positive from an embedding space where the query and an unrelated chunk happen to land nearby.

To address this, we can add a lightweight rationale step:

After reranking, send the top-5 chunks and the query to the LLM with a short system prompt asking it to explain the relevance of each chunk before generating the final answer. The rationale is logged as a structured field on the trace. This is expensive if done per-request, but extremely valuable when run on a sampled basis (say, 1% of production traffic plus 100% of user-flagged responses).

Retrieval Quality vs Answer Quality

The highest-value observability investment is closing the feedback loop: connecting what was retrieved to how good the final answer was. This requires an evaluation signal.

For many applications, you can compute answer quality automatically using a lightweight LLM-as-judge approach: after the main LLM generates an answer, send the answer, the retrieved context, and the original question to a smaller, cheaper model with a rubric asking it to score faithfulness (did the answer stay within what the context says?) and relevance (did the answer address the question?). Log these scores alongside the trace ID.

This gives you a queryable dataset: “show me all requests where faithfulness score was below 0.7 in the last 7 days.” Drilling into those traces, you will typically find one of three patterns:

  • Retrieved chunks are from the wrong document (index corruption or model drift)
  • Retrieved chunks are from the right document but the wrong section (chunking boundary problem)
  • Retrieved chunks are correct but the LLM ignored them (a generation problem, not a retrieval problem)

Only traces with chunk-level attribution let you distinguish these cases. Without them, every bad answer looks the same from the outside.

Index Version Attribution in Traces

One failure mode that deserves special mention: your index was updated, retrieval behavior changed, and answer quality dropped. Without index version attribution in your traces, you cannot correlate the quality drop to the update.

The fix is to include the index version (or the alias timestamp) in every retrieval span. When you investigate a spike in low-quality answers, you can immediately filter to traces where the index version is the new one, and compare them to traces from the old version.

Plain text

This sounds obvious in retrospect. Almost nobody does it until they spend a painful post-incident trying to figure out why answer quality degraded on a Tuesday afternoon.

Footnote

RAG combines offline indexing (chunk, embed, store) with online retrieval (embed query, search, inject context). Getting the demo right is easy; getting production right requires three things. First, an indexing pipeline with a document registry, content-hash-based change detection, correct delete semantics, and alias-based zero-downtime deployment.

Second, a retrieval layer using hybrid search (vector + BM25) and cross-encoder reranking to achieve meaningful accuracy. Third, an observability layer that records chunk-level attribution per request, tracks retrieval quality metrics over time, and links index versions to answer quality regressions. Without all three, a RAG system that works in staging will silently serve stale, wrong, or deleted information in production.