InkdownInkdown
Start writing

Arpit Bhayani Blogs

336 files·168 subfolders

Shared Workspace

Arpit Bhayani Blogs
001 Ai Topological Sort

008-bm25

Shared from "Arpit Bhayani Blogs" on Inkdown

BM25

Source: https://arpitbhayani.me/blogs/bm25 Date: 2026-03-04

There is a particular kind of respect reserved in engineering for the algorithm that outlives its era. BM25 is one of them. BM25 was born out of information retrieval research in the 1970s and 1980s, polished over decades, and eventually adopted as the default ranking function in Elasticsearch, Solr, and Lucene.


There is a particular kind of respect reserved in engineering for the algorithm that outlives its era. BM25 is one of them. BM25 was born out of information retrieval research in the 1970s and 1980s, polished over decades, and eventually adopted as the default ranking function in Elasticsearch, Solr, and Lucene.

What makes BM25 worth understanding is not just that it works. It is that it works for knowable reasons.

Every part of the formula has a clear interpretation. When a result is surprising, you can trace why. When you need to tune for your domain, the parameters give you meaningful handles to turn. The interpretability is genuinely valuable.

001-ai-topological-sort.md
tldr.md
002 Temporal Primer
002-temporal-primer.md
tldr.md
003 Rag Production
003-rag-production.md
tldr.md
004 Structure Of Llm Chat
004-structure-of-llm-chat.md
tldr.md
005 How Llms Work
005-how-llms-work.md
tldr.md
006 Monolith Is Distributed System
006-monolith-is-distributed-system.md
tldr.md
007 Defensive Databases
007-defensive-databases.md
tldr.md
008 Bm25
008-bm25.md
tldr.md
009 Join Algorithms
009-join-algorithms.md
tldr.md
010 Venting At Work
010-venting-at-work.md
tldr.md
011 Half Life
011-half-life.md
tldr.md
012 Multi Paxos
012-multi-paxos.md
tldr.md
013 Mysql Replication Internals
013-mysql-replication-internals.md
tldr.md
014 Bloom Filters
014-bloom-filters.md
tldr.md
015 Clock Sync Nightmare
015-clock-sync-nightmare.md
tldr.md
016 Kafka Partitions
016-kafka-partitions.md
tldr.md
017 Product Quantization
017-product-quantization.md
tldr.md
018 Qkv Matrices
018-qkv-matrices.md
tldr.md
019 Deleted Production
019-deleted-production.md
tldr.md
020 How Llm Inference Works
020-how-llm-inference-works.md
tldr.md
021 Blocking Queues
021-blocking-queues.md
tldr.md
022 Heartbeats In Distributed Systems
022-heartbeats-in-distributed-systems.md
tldr.md
023 Cassandra Writes
023-cassandra-writes.md
tldr.md
024 Redis Replication
024-redis-replication.md
tldr.md
025 Arrogant People At Work
025-arrogant-people-at-work.md
tldr.md
026 Cdn Content Replication
026-cdn-content-replication.md
tldr.md
027 Cant Fix Everything Day One
027-cant-fix-everything-day-one.md
tldr.md
028 Emotions At Work
028-emotions-at-work.md
tldr.md
029 Grpc Http2
029-grpc-http2.md
tldr.md
030 Meetings With No Agenda Are A Waste Of Time
030-meetings-with-no-agenda-are-a-waste-of-time.md
tldr.md
031 Growth Is Not About Doing Everything
031-growth-is-not-about-doing-everything.md
tldr.md
032 Career Longevity Vs Job Hopping
032-career-longevity-vs-job-hopping.md
tldr.md
033 Stay Relevant At Higher Salary Levels
033-stay-relevant-at-higher-salary-levels.md
tldr.md
034 Why Consensus
034-why-consensus.md
tldr.md
035 Database Deadlocks
035-database-deadlocks.md
tldr.md
036 Cpu Cache Locality
036-cpu-cache-locality.md
tldr.md
037 Eventual Consistency
037-eventual-consistency.md
tldr.md
038 Dns Udp Tcp
038-dns-udp-tcp.md
tldr.md
039 Masters
039-masters.md
tldr.md
040 Empathy Makes Great Engineers Unstoppable
040-empathy-makes-great-engineers-unstoppable.md
tldr.md
041 Good Mentors Build People
041-good-mentors-build-people.md
tldr.md
042 Always Have Back Burner Projects
042-always-have-back-burner-projects.md
tldr.md
043 Before You Push Back Know What Youre Standing On
043-before-you-push-back-know-what-youre-standing-on.md
tldr.md
044 Be The One They Can Count On
044-be-the-one-they-can-count-on.md
tldr.md
045 How Much People Bet On You
045-how-much-people-bet-on-you.md
tldr.md
046 How To Get Leadership To Say Yes To Your Project
046-how-to-get-leadership-to-say-yes-to-your-project.md
tldr.md
047 Dont Let Your Best Ideas Die In Silence
047-dont-let-your-best-ideas-die-in-silence.md
tldr.md
048 Be Someone Others Want To Work With
048-be-someone-others-want-to-work-with.md
tldr.md
049 Dont Fall For Xy Problem Ask Right Questions
049-dont-fall-for-xy-problem-ask-right-questions.md
tldr.md
050 Biggest Lie Startups Tell Engineers
050-biggest-lie-startups-tell-engineers.md
tldr.md
051 Promotions Are Proactive Not Reactive
051-promotions-are-proactive-not-reactive.md
tldr.md
052 Not Enough To Be Right Learn To Be Heard
052-not-enough-to-be-right-learn-to-be-heard.md
tldr.md
053 No One Ships Alone
053-no-one-ships-alone.md
tldr.md
054 Not Every Mistake Needs A Correction
054-not-every-mistake-needs-a-correction.md
tldr.md
055 Build Influence At Work
055-build-influence-at-work.md
tldr.md
056 Your Soft Skills Arent Soft At All
056-your-soft-skills-arent-soft-at-all.md
tldr.md
057 Experience Before Forming Opinion
057-experience-before-forming-opinion.md
tldr.md
058 Curiosity And High Bias For Action
058-curiosity-and-high-bias-for-action.md
tldr.md
059 Worklog
059-worklog.md
tldr.md
060 Mistakes And Growth
060-mistakes-and-growth.md
tldr.md
061 Own It Instead Of Sweeping It Aside
061-own-it-instead-of-sweeping-it-aside.md
tldr.md
062 Dont Wait Step Up
062-dont-wait-step-up.md
tldr.md
063 Temporary Fix Is Permanent
063-temporary-fix-is-permanent.md
tldr.md
064 Interview Bias And What Sets You Apart
064-interview-bias-and-what-sets-you-apart.md
tldr.md
065 Saying This Isnt My Problem Is A Problem
065-saying-this-isnt-my-problem-is-a-problem.md
tldr.md
066 Okr
066-okr.md
tldr.md
067 Miscommunication
067-miscommunication.md
tldr.md
068 When In Doubt Code It Out
068-when-in-doubt-code-it-out.md
tldr.md
069 Follow Up Without Annoying People
069-follow-up-without-annoying-people.md
tldr.md
070 Lead Projects That Land
070-lead-projects-that-land.md
tldr.md
071 Abstract Thinking Skill Next Decade
071-abstract-thinking-skill-next-decade.md
tldr.md
072 We Engineers Suck At Task Estimation
072-we-engineers-suck-at-task-estimation.md
tldr.md
073 Shiny Object Syndrome In Tech
073-shiny-object-syndrome-in-tech.md
tldr.md
074 3p
074-3p.md
tldr.md
075 Leverage The Equilibrium
075-leverage-the-equilibrium.md
tldr.md
076 On Demand Container Loading In Aws Lambda
076-on-demand-container-loading-in-aws-lambda.md
tldr.md
077 Sql Has Problems We Can Fix Them Pipe Syntax In Sql
077-sql-has-problems-we-can-fix-them-pipe-syntax-in-sql.md
tldr.md
078 Nanolog A Nanosecond Scale Logging System
078-nanolog-a-nanosecond-scale-logging-system.md
tldr.md
079 Best Resource Is Mythical
079-best-resource-is-mythical.md
tldr.md
080 Wtf The Who To Follow Service At Twitter
080-wtf-the-who-to-follow-service-at-twitter.md
tldr.md
081 Know A Lot
081-know-a-lot.md
tldr.md
082 Out Of Syllabus
082-out-of-syllabus.md
tldr.md
083 Negotiate The Offer
083-negotiate-the-offer.md
tldr.md
084 Never Bad Mouth Your Ex Exployer
084-never-bad-mouth-your-ex-exployer.md
tldr.md
085 Culture Fit
085-culture-fit.md
tldr.md
086 Quantification In Resume
086-quantification-in-resume.md
tldr.md
087 Hiring Is Unfair
087-hiring-is-unfair.md
tldr.md
088 Questions For Interviewers
088-questions-for-interviewers.md
tldr.md
089 Collaboration Communication
089-collaboration-communication.md
tldr.md
090 Out Of Vicious Interview Cycle
090-out-of-vicious-interview-cycle.md
tldr.md
091 Pitch Projects Not Ideas
091-pitch-projects-not-ideas.md
tldr.md
092 Read Design Docs
092-read-design-docs.md
tldr.md
093 Read Rca Docs
093-read-rca-docs.md
tldr.md
094 Start Generalist
094-start-generalist.md
tldr.md
095 Do Not Rely On Summaries
095-do-not-rely-on-summaries.md
tldr.md
096 Structure Your Design Interviews
096-structure-your-design-interviews.md
tldr.md
097 Title Inflation
097-title-inflation.md
tldr.md
098 Find Your Own Project
098-find-your-own-project.md
tldr.md
099 Six Pointers To Crack Coding And Design Interviews
099-six-pointers-to-crack-coding-and-design-interviews.md
tldr.md
100 Keep Yourself Unblocked
100-keep-yourself-unblocked.md
tldr.md
101 Genetic Knapsack
101-genetic-knapsack.md
tldr.md
102 Pseudorandom Number Generation Lfsr
102-pseudorandom-number-generation-lfsr.md
tldr.md
103 How Indexes Work On Partitioned And Sharded Data
103-how-indexes-work-on-partitioned-and-sharded-data.md
tldr.md
104 Some Data Partitioning Strategies For Distributed Data Stores
104-some-data-partitioning-strategies-for-distributed-data-stores.md
tldr.md
105 Data Partitioning
105-data-partitioning.md
tldr.md
106 Leaderless Replication
106-leaderless-replication.md
tldr.md
107 Conflict Resolution
107-conflict-resolution.md
tldr.md
108 Conflict Detection
108-conflict-detection.md
tldr.md
109 Multi Master Replication
109-multi-master-replication.md
tldr.md
110 Monotonic Reads
110-monotonic-reads.md
tldr.md
111 Read Your Write Consistency
111-read-your-write-consistency.md
tldr.md
112 Handling Outages Master Replica
112-handling-outages-master-replica.md
tldr.md
113 Replication Formats
113-replication-formats.md
tldr.md
114 Replication Strategies
114-replication-strategies.md
tldr.md
115 Master Replica Replication
115-master-replica-replication.md
tldr.md
116 Durability
116-durability.md
tldr.md
117 Isolation
117-isolation.md
tldr.md
118 Atomicity
118-atomicity.md
tldr.md
119 Consistency
119-consistency.md
tldr.md
120 Architectures In Distributed Systems
120-architectures-in-distributed-systems.md
tldr.md
121 Mistaken Beliefs Of Distributed Systems
121-mistaken-beliefs-of-distributed-systems.md
tldr.md
122 Fork Bomb
122-fork-bomb.md
tldr.md
123 Chained Operators Python
123-chained-operators-python.md
tldr.md
124 Taxonomy On Sql
124-taxonomy-on-sql.md
tldr.md
125 The Weird Walrus
125-the-weird-walrus.md
tldr.md
126 Fully Persistent Arrays
126-fully-persistent-arrays.md
tldr.md
127 Persistent Data Structures Introduction
127-persistent-data-structures-introduction.md
tldr.md
128 Constant Folding Python
128-constant-folding-python.md
tldr.md
129 String Interning Python
129-string-interning-python.md
tldr.md
130 Recursion Visualizer Python
130-recursion-visualizer-python.md
tldr.md
131 Flajolet Martin
131-flajolet-martin.md
tldr.md
132 2q Cache
132-2q-cache.md
tldr.md
133 Israeli Queues
133-israeli-queues.md
tldr.md
134 1d Terrain
134-1d-terrain.md
tldr.md
135 Jaccard Minhash
135-jaccard-minhash.md
tldr.md
136 Ts Smoothing
136-ts-smoothing.md
tldr.md
137 Lfu
137-lfu.md
tldr.md
138 Morris Counter
138-morris-counter.md
tldr.md
139 Slowsort
139-slowsort.md
tldr.md
140 Bitcask
140-bitcask.md
tldr.md
141 Phi Accrual
141-phi-accrual.md
tldr.md
142 10x Engineer
142-10x-engineer.md
tldr.md
143 Decipher Repeated Key Xor
143-decipher-repeated-key-xor.md
tldr.md
144 Decipher Single Xor
144-decipher-single-xor.md
tldr.md
145 Python Iterable Integers
145-python-iterable-integers.md
tldr.md
146 Inheritance C
146-inheritance-c.md
tldr.md
147 Rum
147-rum.md
tldr.md
148 Consistent Hashing
148-consistent-hashing.md
tldr.md
149 Python Caches Integers
149-python-caches-integers.md
tldr.md
150 Fractional Cascading
150-fractional-cascading.md
tldr.md
151 Copy On Write
151-copy-on-write.md
tldr.md
152 Midpoint Insertion Caching Strategy
152-midpoint-insertion-caching-strategy.md
tldr.md
153 Fsm Python
153-fsm-python.md
tldr.md
154 Bayesian Average
154-bayesian-average.md
tldr.md
155 Sliding Window Ratelimiter
155-sliding-window-ratelimiter.md
tldr.md
156 Idf
156-idf.md
tldr.md
157 Better Programmer
157-better-programmer.md
tldr.md
158 Python Prompts
158-python-prompts.md
tldr.md
159 Rule 30 Cellular Automata
159-rule-30-cellular-automata.md
tldr.md
160 Function Overloading
160-function-overloading.md
tldr.md
161 Isolation Forest
161-isolation-forest.md
tldr.md
162 Image Steganography
162-image-steganography.md
tldr.md
163 Long Integers Python
163-long-integers-python.md
tldr.md
164 I Changed My Python
164-i-changed-my-python.md
tldr.md
165 Benchmark And Compare Pagination Approach In Mongodb
165-benchmark-and-compare-pagination-approach-in-mongodb.md
tldr.md
166 Mongodb Cursor Skip Is Slow
166-mongodb-cursor-skip-is-slow.md
tldr.md
167 Fast And Efficient Pagination In Mongodb
167-fast-and-efficient-pagination-in-mongodb.md
tldr.md
168 Making Http Requests Using Netcat
168-making-http-requests-using-netcat.md
tldr.md

In this write-up, I cover BM25 from first principles - where it came from, why TF-IDF was not enough, how the formula works mechanically, how the parameters behave in practice, what its real limitations are, and where it fits in a modern retrieval stack.

What BM25 Was Built To Solve

The simplest possible retrieval system is Boolean keyword matching: a document is relevant if it contains the query terms, and irrelevant if it does not. This works when a corpus is small, and queries are exact, but it collapses immediately on anything that deviates from this.

For example, every document containing “database” matches equally for the query “fast database.” You have no ranking, no way to distinguish a paper about database internals from a blog post where “database” appears once in a sidebar.

The natural next step is TF-IDF, which most engineers encounter first. TF-IDF scores a document by multiplying two quantities:

  • Term Frequency (TF): how many times the query term appears in the document
  • Inverse Document Frequency (IDF): a measure of how rare the term is across the corpus

The intuition is sound. A document that mentions “photosynthesis” ten times is probably more about photosynthesis than one that mentions it once. And a term that appears in every document (like “the”) tells you nothing about relevance, so you discount it with IDF.

TF-IDF works surprisingly well for a heuristic, which is why it survived in production systems for decades. But it has two fundamental failure modes that compound badly in real corpora:

TF is Linear

A document mentioning “photosynthesis” 200 times scores exactly twice as high as one mentioning it 100 times. But is a document twice as relevant just because it repeats the term more? In most cases, no. After a term appears enough times to establish that the document is about that concept, additional occurrences contribute diminishing information about relevance. TF-IDF does not model this.

The second is that TF-IDF has no concept of document length. A short, focused abstract mentioning “photosynthesis” three times is competing on equal footing, under raw TF, with a 10,000-word textbook chapter that mentions it fifteen times. The textbook chapter will almost always win on TF, but that may not reflect relevance. Long documents naturally accumulate more term occurrences just by being long, not because they are more relevant.

Okapi and the TREC years

BM25 emerged from work done on the Okapi system at City University London. The name Okapi BM25 reflects this lineage: Okapi was the system, and BM stands for “Best Match.” The 25 denotes a specific iteration in the development of Best Match functions, which had been evolving through a series of numbered variants.

The shift from theoretical standard to industry default happened more slowly. Lucene, the search library underlying both Elasticsearch and Solr, shipped with a modified TF-IDF implementation for years. Lucene 6 switched BM25 to the default similarity function around 2016, and Elasticsearch 5.0 followed suit. At that point, BM25 became the de facto relevance algorithm for most production search deployments in the world.

How BM25 Works

Rather than presenting the formula and explaining it, it is more useful to build it up from the two problems TF-IDF could not solve. That way, the formula reads as a series of deliberate design decisions rather than a pile of notation.

Saturating Term Frequency

The core insight is that the relationship between term frequency and relevance should not be linear. It should saturate. The first few occurrences of a term in a document are strong evidence of relevance. After that, each additional occurrence contributes less. Eventually, adding more occurrences should contribute almost nothing.

BM25 achieves this with the following transformation of raw term frequency f:

Plain text

Where k1 is a free parameter. When f is 0, the numerator is 0. As f grows, the expression approaches an asymptote of k1 + 1. The curve rises steeply at first and then flattens. This is the saturation function.

The parameter k1 controls how quickly the saturation occurs. With a low k1 (say, 0.5), the function saturates quickly, and the first occurrence of a term does most of the work. With a higher k1 (say, 2.0), the function saturates slowly, and multiple occurrences continue to add meaningful score. For most text collections, k1 values between 1.2 and 2.0 work well. Elasticsearch defaults to 1.2.

To see why this matters in practice: imagine searching for “search engine” across a corpus. A document that uses the phrase once in a focused technical definition is probably more relevant than a marketing page that repeats “search engine” forty times across boilerplate text. The saturation function gives the first document a fighting chance.

Normalizing Document Length

The second fix is normalizing for document length. The idea is that a term occurring three times in a 300-word document is a stronger relevance signal than the same term occurring three times in a 10,000-word book chapter.

BM25 incorporates document length by adjusting the effective term frequency based on how long the current document is relative to the average document length in the corpus:

Plain text

Where |D| is the length of the current document in tokens, avgdl is the average document length across the corpus, and b is a second free parameter that controls how aggressively length normalization is applied.

When b = 0, the denominator reduces to f + k1, and length normalization is disabled entirely. The score depends only on term frequency. When b = 1, full-length normalization is applied: the effective term frequency is scaled proportionally to the ratio of the document’s length to the average. The standard default of b = 0.75 applies partial normalization, which works well for most corpora.

The practical effect: if you have two documents that both mention your query term three times, the shorter one will score higher. This is usually what you want when documents vary significantly in length.

The IDF Component

BM25 keeps the inverse document frequency concept from TF-IDF, but uses a specific formula derived from the probabilistic relevance framework:

Plain text

Here’s my write-up covering the intuition behind IDF and how it works.

Where N is the total number of documents in the corpus and n(q) is the number of documents containing the query term.

The smoothing constants (+0.5) prevent division by zero and handle edge cases. There is also a subtle problem: terms that appear in more than half the corpus produce a negative IDF under this formula. Lucene’s implementation adds 1 inside the log to prevent negative IDF values from inverting the scoring of common terms.

Compared to classic TF-IDF’s log(N / n(q)), BM25’s IDF is derived from a log-odds ratio with probabilistic justification. In practice, the curves are similar, but the BM25 formulation is theoretically grounded in the Binary Independence Model.

The Complete Formula

Putting it together, the BM25 score for a document D given a query Q with terms q1, q2, ..., qn is:

Plain text

BM25 is additive across query terms. Each query term contributes independently to the total score. This bag-of-words assumption means term ordering and proximity are ignored. “New York” and “York New” produce identical scores. This is a meaningful limitation that we will return to.

Worked Example

To make this concrete, consider a small corpus of three documents and the query “inverted index”:

Plain text

With default parameters k1 = 1.2, b = 0.75, and an avgdl of 312 tokens:

For the term “inverted”:

  • D1 has f = 2

    , moderate length: moderate TF contribution.

  • D2 has f = 1

    , much shorter than average: length normalization boosts it.

  • D3 has f = 1

    , much longer than average: length normalization penalizes it.

D2, despite only one occurrence of “inverted,” will likely score higher than D3 with its one occurrence buried in 800 tokens of noise. D1 with two occurrences in a focused document will likely come out on top overall.

This is the behavior you want. D2 is a definition. D3 is a tangential mention in a long document. The formula reflects that.

Tuning k1 and b

The defaults work well out of the box for general text search. But your corpus is not a general text corpus, and tuning matters more than most engineers realize.

k1

Increase k1 (toward 2.0) when:

  • Documents are long, and term repetition is genuinely informative (legal documents, scientific papers, technical manuals)
  • Users search with domain-specific jargon that naturally recurs throughout relevant documents.

Decrease k1 (toward 0.5) when:

  • Documents are short (tweets, product titles, code identifiers)
  • A single mention is as informative as ten mentions (e.g., a product description that names the SKU once)
  • You are worried about adversarial keyword stuffing.
b

Set b closer to 1.0 when:

  • Your documents span a very wide range of lengths, and you want to prevent long documents from dominating.
  • Average document length is driven by padding or boilerplate rather than meaningful content.
  • You are indexing heterogeneous content (short FAQs mixed with long technical articles)

Set b closer to 0.0 when:

  • Document length is genuinely correlated with coverage and relevance (encyclopedic articles that are long because they are comprehensive)
  • Documents are all roughly the same length.
  • You are searching through code, where length carries semantic meaning.

What BM25 Cannot Do

BM25 is a bag-of-words model. That phrase contains its entire set of limitations.

  • do not understand synonyms i.e “heart attack” is different than “myocardial infarction” and “automobile accident” is different than “car crash”.
  • do not understand word order, i.e., “New York” and “York New” produce identical BM25 scores.
  • do not understand context or intent. A user searching for “python” might want programming documentation, a natural history article, or a Monty Python sketch. BM25 cannot distinguish query intent; it can only rank by term statistics.
  • struggles with rare or out-of-vocabulary terms

So, BM25 is excellent for keyword-heavy, factual queries where exact term matching is meaningful, and it degrades on semantic queries.

When To Use BM25 vs Alternatives

BM25 is the right default for lexical retrieval. The question is when lexical retrieval is the right choice at all.

Reach for BM25 when:

  • Exact keyword matching is the primary use case (legal document search, code search by function name, product search by SKU)
  • The corpus contains domain-specific jargon, product codes, or identifiers that embedding models cannot represent
  • need interpretable, auditable results (compliance environments, debugging production search)
  • need fast retrieval with no GPUs and minimal memory overhead

Augment BM25 with dense retrieval (semantic) when:

  • Users phrase queries in natural language with varying vocabulary (customer support, knowledge base search)
  • You have multilingual content where a user might search in one language for content in another.
  • Synonymy and paraphrasing are common in your domain (medical search, legal search, research literature)
  • You are building a RAG pipeline where recall matters more than precision.

BM25 is explicitly the wrong tool when the search task is primarily semantic, and your queries rarely use the same words as your documents. In those cases, BM25 provides recall for edge cases but should not be the primary retrieval mechanism.

BM25 in Elasticsearch

Elasticsearch computes IDF per-shard, not per-index. In a distributed cluster, each shard sees only its portion of the corpus when computing N and n(q). This means IDF values can vary across shards, producing slightly inconsistent scores for the same document depending on which shard it lives on.

For most use cases, this is fine, but if you need globally consistent IDF, use search_type=dfs_query_then_fetch, which forces a global term statistics collection step before scoring.

Field lengths in Elasticsearch’s BM25 are measured in tokens. So, the choice of analyzer affects what counts as a token and therefore affects length normalization. A language analyzer that applies stemming and stopword removal will produce shorter effective documents than a standard tokenizer. Tune analyzer choice before you tune b.

In Elasticsearch, document lengths are encoded in a single byte using a logarithmic encoding scheme. This means the stored length is an approximation, not the exact token count. For most cases, this is a negligible error, but it is worth knowing when you are trying to understand why BM25 scores do not exactly match hand calculations.

You can inspect explain output to see exactly how a score was computed as

Plain text

The _explanation field in the response shows the IDF, TF saturation, and length normalization components individually. This is genuinely useful for debugging unexpected rankings and one of BM25’s practical advantages over black-box scoring systems.

Footnote

BM25 improves on TF-IDF through two mechanisms: a saturation function that prevents repeated terms from scoring linearly, and document length normalization that adjusts for corpus-wide length variance.

The two free parameters - k1 (saturation speed) and b (normalization strength) - offer meaningful tuning handles with sensible defaults.

BM25 is a bag-of-words model and cannot handle synonyms, word order, or semantic intent. In modern systems, it functions as the fast, interpretable, exact-matching leg of a hybrid retrieval pipeline, complementing dense vector search that handles semantic queries.

Its greatest practical advantage is debuggability: every score can be traced to specific term statistics, which matters more than most engineers expect when something goes wrong in production.