InkdownInkdown
Start writing

Arpit Bhayani Blogs

336 files·168 subfolders

Shared Workspace

Arpit Bhayani Blogs
001 Ai Topological Sort

009-join-algorithms

Shared from "Arpit Bhayani Blogs" on Inkdown

JOIN Algorithms

Source: https://arpitbhayani.me/blogs/join-algorithms Date: 2026-02-24

When you write a SQL query with a JOIN clause, you probably do not think much about what happens next. You just expect the database to return the right rows. But this simple keyword forces your database to make one of the most consequential decisions a query planner makes: which join algorithm should it use?


When you write a SQL query with a JOIN clause, you probably do not think much about what happens next. You just expect the database to return the right rows. But this simple keyword forces your database to make one of the most consequential decisions a query planner makes: which join algorithm should it use?

The choice matters a ton. A bad join algorithm on a large dataset can turn a millisecond query into a minutes-long. A good one can make joining two tables with billions of rows feel effortless.

001-ai-topological-sort.md
tldr.md
002 Temporal Primer
002-temporal-primer.md
tldr.md
003 Rag Production
003-rag-production.md
tldr.md
004 Structure Of Llm Chat
004-structure-of-llm-chat.md
tldr.md
005 How Llms Work
005-how-llms-work.md
tldr.md
006 Monolith Is Distributed System
006-monolith-is-distributed-system.md
tldr.md
007 Defensive Databases
007-defensive-databases.md
tldr.md
008 Bm25
008-bm25.md
tldr.md
009 Join Algorithms
009-join-algorithms.md
tldr.md
010 Venting At Work
010-venting-at-work.md
tldr.md
011 Half Life
011-half-life.md
tldr.md
012 Multi Paxos
012-multi-paxos.md
tldr.md
013 Mysql Replication Internals
013-mysql-replication-internals.md
tldr.md
014 Bloom Filters
014-bloom-filters.md
tldr.md
015 Clock Sync Nightmare
015-clock-sync-nightmare.md
tldr.md
016 Kafka Partitions
016-kafka-partitions.md
tldr.md
017 Product Quantization
017-product-quantization.md
tldr.md
018 Qkv Matrices
018-qkv-matrices.md
tldr.md
019 Deleted Production
019-deleted-production.md
tldr.md
020 How Llm Inference Works
020-how-llm-inference-works.md
tldr.md
021 Blocking Queues
021-blocking-queues.md
tldr.md
022 Heartbeats In Distributed Systems
022-heartbeats-in-distributed-systems.md
tldr.md
023 Cassandra Writes
023-cassandra-writes.md
tldr.md
024 Redis Replication
024-redis-replication.md
tldr.md
025 Arrogant People At Work
025-arrogant-people-at-work.md
tldr.md
026 Cdn Content Replication
026-cdn-content-replication.md
tldr.md
027 Cant Fix Everything Day One
027-cant-fix-everything-day-one.md
tldr.md
028 Emotions At Work
028-emotions-at-work.md
tldr.md
029 Grpc Http2
029-grpc-http2.md
tldr.md
030 Meetings With No Agenda Are A Waste Of Time
030-meetings-with-no-agenda-are-a-waste-of-time.md
tldr.md
031 Growth Is Not About Doing Everything
031-growth-is-not-about-doing-everything.md
tldr.md
032 Career Longevity Vs Job Hopping
032-career-longevity-vs-job-hopping.md
tldr.md
033 Stay Relevant At Higher Salary Levels
033-stay-relevant-at-higher-salary-levels.md
tldr.md
034 Why Consensus
034-why-consensus.md
tldr.md
035 Database Deadlocks
035-database-deadlocks.md
tldr.md
036 Cpu Cache Locality
036-cpu-cache-locality.md
tldr.md
037 Eventual Consistency
037-eventual-consistency.md
tldr.md
038 Dns Udp Tcp
038-dns-udp-tcp.md
tldr.md
039 Masters
039-masters.md
tldr.md
040 Empathy Makes Great Engineers Unstoppable
040-empathy-makes-great-engineers-unstoppable.md
tldr.md
041 Good Mentors Build People
041-good-mentors-build-people.md
tldr.md
042 Always Have Back Burner Projects
042-always-have-back-burner-projects.md
tldr.md
043 Before You Push Back Know What Youre Standing On
043-before-you-push-back-know-what-youre-standing-on.md
tldr.md
044 Be The One They Can Count On
044-be-the-one-they-can-count-on.md
tldr.md
045 How Much People Bet On You
045-how-much-people-bet-on-you.md
tldr.md
046 How To Get Leadership To Say Yes To Your Project
046-how-to-get-leadership-to-say-yes-to-your-project.md
tldr.md
047 Dont Let Your Best Ideas Die In Silence
047-dont-let-your-best-ideas-die-in-silence.md
tldr.md
048 Be Someone Others Want To Work With
048-be-someone-others-want-to-work-with.md
tldr.md
049 Dont Fall For Xy Problem Ask Right Questions
049-dont-fall-for-xy-problem-ask-right-questions.md
tldr.md
050 Biggest Lie Startups Tell Engineers
050-biggest-lie-startups-tell-engineers.md
tldr.md
051 Promotions Are Proactive Not Reactive
051-promotions-are-proactive-not-reactive.md
tldr.md
052 Not Enough To Be Right Learn To Be Heard
052-not-enough-to-be-right-learn-to-be-heard.md
tldr.md
053 No One Ships Alone
053-no-one-ships-alone.md
tldr.md
054 Not Every Mistake Needs A Correction
054-not-every-mistake-needs-a-correction.md
tldr.md
055 Build Influence At Work
055-build-influence-at-work.md
tldr.md
056 Your Soft Skills Arent Soft At All
056-your-soft-skills-arent-soft-at-all.md
tldr.md
057 Experience Before Forming Opinion
057-experience-before-forming-opinion.md
tldr.md
058 Curiosity And High Bias For Action
058-curiosity-and-high-bias-for-action.md
tldr.md
059 Worklog
059-worklog.md
tldr.md
060 Mistakes And Growth
060-mistakes-and-growth.md
tldr.md
061 Own It Instead Of Sweeping It Aside
061-own-it-instead-of-sweeping-it-aside.md
tldr.md
062 Dont Wait Step Up
062-dont-wait-step-up.md
tldr.md
063 Temporary Fix Is Permanent
063-temporary-fix-is-permanent.md
tldr.md
064 Interview Bias And What Sets You Apart
064-interview-bias-and-what-sets-you-apart.md
tldr.md
065 Saying This Isnt My Problem Is A Problem
065-saying-this-isnt-my-problem-is-a-problem.md
tldr.md
066 Okr
066-okr.md
tldr.md
067 Miscommunication
067-miscommunication.md
tldr.md
068 When In Doubt Code It Out
068-when-in-doubt-code-it-out.md
tldr.md
069 Follow Up Without Annoying People
069-follow-up-without-annoying-people.md
tldr.md
070 Lead Projects That Land
070-lead-projects-that-land.md
tldr.md
071 Abstract Thinking Skill Next Decade
071-abstract-thinking-skill-next-decade.md
tldr.md
072 We Engineers Suck At Task Estimation
072-we-engineers-suck-at-task-estimation.md
tldr.md
073 Shiny Object Syndrome In Tech
073-shiny-object-syndrome-in-tech.md
tldr.md
074 3p
074-3p.md
tldr.md
075 Leverage The Equilibrium
075-leverage-the-equilibrium.md
tldr.md
076 On Demand Container Loading In Aws Lambda
076-on-demand-container-loading-in-aws-lambda.md
tldr.md
077 Sql Has Problems We Can Fix Them Pipe Syntax In Sql
077-sql-has-problems-we-can-fix-them-pipe-syntax-in-sql.md
tldr.md
078 Nanolog A Nanosecond Scale Logging System
078-nanolog-a-nanosecond-scale-logging-system.md
tldr.md
079 Best Resource Is Mythical
079-best-resource-is-mythical.md
tldr.md
080 Wtf The Who To Follow Service At Twitter
080-wtf-the-who-to-follow-service-at-twitter.md
tldr.md
081 Know A Lot
081-know-a-lot.md
tldr.md
082 Out Of Syllabus
082-out-of-syllabus.md
tldr.md
083 Negotiate The Offer
083-negotiate-the-offer.md
tldr.md
084 Never Bad Mouth Your Ex Exployer
084-never-bad-mouth-your-ex-exployer.md
tldr.md
085 Culture Fit
085-culture-fit.md
tldr.md
086 Quantification In Resume
086-quantification-in-resume.md
tldr.md
087 Hiring Is Unfair
087-hiring-is-unfair.md
tldr.md
088 Questions For Interviewers
088-questions-for-interviewers.md
tldr.md
089 Collaboration Communication
089-collaboration-communication.md
tldr.md
090 Out Of Vicious Interview Cycle
090-out-of-vicious-interview-cycle.md
tldr.md
091 Pitch Projects Not Ideas
091-pitch-projects-not-ideas.md
tldr.md
092 Read Design Docs
092-read-design-docs.md
tldr.md
093 Read Rca Docs
093-read-rca-docs.md
tldr.md
094 Start Generalist
094-start-generalist.md
tldr.md
095 Do Not Rely On Summaries
095-do-not-rely-on-summaries.md
tldr.md
096 Structure Your Design Interviews
096-structure-your-design-interviews.md
tldr.md
097 Title Inflation
097-title-inflation.md
tldr.md
098 Find Your Own Project
098-find-your-own-project.md
tldr.md
099 Six Pointers To Crack Coding And Design Interviews
099-six-pointers-to-crack-coding-and-design-interviews.md
tldr.md
100 Keep Yourself Unblocked
100-keep-yourself-unblocked.md
tldr.md
101 Genetic Knapsack
101-genetic-knapsack.md
tldr.md
102 Pseudorandom Number Generation Lfsr
102-pseudorandom-number-generation-lfsr.md
tldr.md
103 How Indexes Work On Partitioned And Sharded Data
103-how-indexes-work-on-partitioned-and-sharded-data.md
tldr.md
104 Some Data Partitioning Strategies For Distributed Data Stores
104-some-data-partitioning-strategies-for-distributed-data-stores.md
tldr.md
105 Data Partitioning
105-data-partitioning.md
tldr.md
106 Leaderless Replication
106-leaderless-replication.md
tldr.md
107 Conflict Resolution
107-conflict-resolution.md
tldr.md
108 Conflict Detection
108-conflict-detection.md
tldr.md
109 Multi Master Replication
109-multi-master-replication.md
tldr.md
110 Monotonic Reads
110-monotonic-reads.md
tldr.md
111 Read Your Write Consistency
111-read-your-write-consistency.md
tldr.md
112 Handling Outages Master Replica
112-handling-outages-master-replica.md
tldr.md
113 Replication Formats
113-replication-formats.md
tldr.md
114 Replication Strategies
114-replication-strategies.md
tldr.md
115 Master Replica Replication
115-master-replica-replication.md
tldr.md
116 Durability
116-durability.md
tldr.md
117 Isolation
117-isolation.md
tldr.md
118 Atomicity
118-atomicity.md
tldr.md
119 Consistency
119-consistency.md
tldr.md
120 Architectures In Distributed Systems
120-architectures-in-distributed-systems.md
tldr.md
121 Mistaken Beliefs Of Distributed Systems
121-mistaken-beliefs-of-distributed-systems.md
tldr.md
122 Fork Bomb
122-fork-bomb.md
tldr.md
123 Chained Operators Python
123-chained-operators-python.md
tldr.md
124 Taxonomy On Sql
124-taxonomy-on-sql.md
tldr.md
125 The Weird Walrus
125-the-weird-walrus.md
tldr.md
126 Fully Persistent Arrays
126-fully-persistent-arrays.md
tldr.md
127 Persistent Data Structures Introduction
127-persistent-data-structures-introduction.md
tldr.md
128 Constant Folding Python
128-constant-folding-python.md
tldr.md
129 String Interning Python
129-string-interning-python.md
tldr.md
130 Recursion Visualizer Python
130-recursion-visualizer-python.md
tldr.md
131 Flajolet Martin
131-flajolet-martin.md
tldr.md
132 2q Cache
132-2q-cache.md
tldr.md
133 Israeli Queues
133-israeli-queues.md
tldr.md
134 1d Terrain
134-1d-terrain.md
tldr.md
135 Jaccard Minhash
135-jaccard-minhash.md
tldr.md
136 Ts Smoothing
136-ts-smoothing.md
tldr.md
137 Lfu
137-lfu.md
tldr.md
138 Morris Counter
138-morris-counter.md
tldr.md
139 Slowsort
139-slowsort.md
tldr.md
140 Bitcask
140-bitcask.md
tldr.md
141 Phi Accrual
141-phi-accrual.md
tldr.md
142 10x Engineer
142-10x-engineer.md
tldr.md
143 Decipher Repeated Key Xor
143-decipher-repeated-key-xor.md
tldr.md
144 Decipher Single Xor
144-decipher-single-xor.md
tldr.md
145 Python Iterable Integers
145-python-iterable-integers.md
tldr.md
146 Inheritance C
146-inheritance-c.md
tldr.md
147 Rum
147-rum.md
tldr.md
148 Consistent Hashing
148-consistent-hashing.md
tldr.md
149 Python Caches Integers
149-python-caches-integers.md
tldr.md
150 Fractional Cascading
150-fractional-cascading.md
tldr.md
151 Copy On Write
151-copy-on-write.md
tldr.md
152 Midpoint Insertion Caching Strategy
152-midpoint-insertion-caching-strategy.md
tldr.md
153 Fsm Python
153-fsm-python.md
tldr.md
154 Bayesian Average
154-bayesian-average.md
tldr.md
155 Sliding Window Ratelimiter
155-sliding-window-ratelimiter.md
tldr.md
156 Idf
156-idf.md
tldr.md
157 Better Programmer
157-better-programmer.md
tldr.md
158 Python Prompts
158-python-prompts.md
tldr.md
159 Rule 30 Cellular Automata
159-rule-30-cellular-automata.md
tldr.md
160 Function Overloading
160-function-overloading.md
tldr.md
161 Isolation Forest
161-isolation-forest.md
tldr.md
162 Image Steganography
162-image-steganography.md
tldr.md
163 Long Integers Python
163-long-integers-python.md
tldr.md
164 I Changed My Python
164-i-changed-my-python.md
tldr.md
165 Benchmark And Compare Pagination Approach In Mongodb
165-benchmark-and-compare-pagination-approach-in-mongodb.md
tldr.md
166 Mongodb Cursor Skip Is Slow
166-mongodb-cursor-skip-is-slow.md
tldr.md
167 Fast And Efficient Pagination In Mongodb
167-fast-and-efficient-pagination-in-mongodb.md
tldr.md
168 Making Http Requests Using Netcat
168-making-http-requests-using-netcat.md
tldr.md

This write-up covers the join algorithms that power every major relational database, why each one exists, when to use each, and what tradeoffs you are making when you pick one over another. The write-up will cover

  • Nested Loop Join
  • Hash Join
  • Merge Join (Sort-Merge Join)
  • Index Join (Indexed Nested Loop Join)
  • Grace Hash Join, and
  • Broadcast Join

Pre-requisites

Before diving in, you should be comfortable with:

  • Basic SQL and what a JOIN operation conceptually means

  • How tables are stored on disk

    and the difference between sequential and random I/O

  • What an index is and how a B-tree index works

    at a high level

Why Join Algorithms Exist

A join is conceptually simple. Given two tables R and S, find all pairs of rows (r, s) where r.key = s.key. The naive way to do this is to compare every row in R with every row in S. That is O(|R| * |S|) comparisons.

Join algorithms are strategies to avoid doing all those comparisons when a set of conditions is met. These algorithms, like any other, exploit structure and patterns. Join algorithms exploit - sorting, hashing, indexing, or partitioning to drastically reduce the search space.

Let us go through each one.

Nested Loop Join

This is the most intuitive join algorithm. For every row in the outer table, scan the entire inner table looking for matches.

Plain text

This is O(|R| * |S|) in time complexity. For two tables with a million rows each, that is a trillion comparisons. Not good.

By the way, nested loop join is not completely useless. It is actually the algorithm of choice in several real scenarios. Here’s where it shines

  • When the outer table has very few rows, the inner table gets scanned only that many times — so even if the inner table is huge, the total work stays small. (explained below)
  • When there is no useful index, sort order, or hash-friendly key available. So, nested loop join becomes the default fallback approach.
Why Smaller Outer and Larger Inner Is Preferred

In a nested loop join, the outer table controls how many times the inner table is scanned (as seen in the pseudocode). If the outer table has 5 rows, you scan the inner table exactly 5 times. If the outer table has 1 million rows, you scan the inner table 1 million times.

Now, the total number of comparisons is the same either way — 5 * 1M is the same as 1M * 5. So why does it matter which side is outer?

It comes down to I/O, not computation.

Every time the inner loop restarts, the database has to read the inner table again. If the inner table does not fit in the buffer pool (in-memory cache), those reads hit disk. Disk reads are orders of magnitude slower than memory reads. So the fewer times you restart the inner scan, the fewer expensive disk reads you pay for.

When the outer table is small, the inner table gets re-read only a handful of times. The first scan loads most of the inner table into the buffer pool, and subsequent scans are largely served from cache. When the outer table is large, the buffer pool gets thrashed — pages loaded for one outer row get evicted before the next outer row can reuse them.

So the rule of thumb is: put the smaller table on the outside to minimize the number of inner scans, and let the buffer pool do its job of caching the inner table across those few passes.

This is also why the block nested loop join optimization exists. Let’s dig in …

Block Nested Loop Join

The main optimization in practice is Block Nested Loop Join. Instead of reading one row at a time from the outer table, you load a chunk (block) of outer rows into memory, then scan the inner table once for that entire chunk.

Plain text

With B buffer pages available, you can load B-2 pages of the outer table at a time (leaving one page for the inner scan and one for output). This reduces the number of times you scan the inner table from |R| to ceil(|R| / (B-2)). If the outer table fits entirely in memory, you only scan the inner table once. That is a huge win.

By the way, PostgreSQL uses nested loop join heavily when one side of the join is small or when it can combine it with an index on the inner table (which leads us to Index Join later).

Hash Join

Hash join is actually one of the most widely used join algorithms. The idea is to build a hash table from one of the two relations, then probe it using the other (similar, not the same, to what we did in Block Nested Loop Join)

There are two phases:

Phase 1 (Build): Scan the smaller relation (the build side). For every row, compute a hash of the join key and insert the row into a hash table in memory.

Phase 2 (Probe): Scan the larger relation (the probe side). For every row, compute the same hash of the join key and look it up in the hash table. Emit matching pairs.

Plain text

Time complexity is O(|R| + |S|) — linear in the size of both tables. This is dramatically better than a nested loop join for large tables.

The catch? The build phase requires the hash table to fit in memory. If the smaller relation is 500MB and you only have 256MB of RAM, you are in trouble. This is where Grace Hash Join (covered later) comes in.

When does hash join shine?

  • Large equijoins with no useful sort order or index
  • When the build side (smaller table) fits in memory
  • When the query planner estimates high cardinality on both sides

When does it struggle?

  • Non-equijoins: you cannot hash on >

    or BETWEEN

    predicates

  • Skewed data: if many rows share the same key (think NULL values, or a status column with 99% of rows being “active”), your hash buckets become unbalanced, and performance degrades

  • Memory pressure: if the build side does not fit in memory, the database has to spill to disk or switch strategies

PostgreSQL, MySQL (since 8.0), SQL Server, and virtually every other major database support hash join. In PostgreSQL, you can see it in query plans as a Hash Join with a Hash node underneath.

Merge Join (Sort-Merge Join)

Merge join exploits sorted order. If both relations are sorted on the join key, you can merge them in a single linear pass — similar to the merge step in merge sort.

The core idea is to maintain a pointer into each sorted relation. At each step, compare the current rows. If they match, emit the result and advance. If one is smaller, advance that pointer. Repeat until one relation is exhausted.

Plain text

If the data is already sorted, merge join is O(|R| + |S|). If it is not sorted, you pay O(|R| log |R| + |S| log |S|) for the sort step upfront. That sorting cost is the key tradeoff.

When does merge join shine?

  • When both relations are already sorted on the join key (e.g., both have a clustered index on the join column)

  • When you need the output in sorted order anyway (the sort is “free” from the planner’s perspective)

  • Large equijoins where neither relation fits comfortably in memory for hashing

  • Range joins (e.g., r.date BETWEEN s.start AND s.end

    ) can sometimes use a variation of the merge join

When does it struggle?

  • When neither side is pre-sorted, and the sort cost is high
  • When join keys have many duplicates, the rewinding of the right pointer can cause quadratic behavior in degenerate cases
  • When data arrives in streaming fashion (though external sort-merge can help)

A key advantage of merge join over hash join is that it handles memory gracefully. Sorting can be done in chunks and merged externally; you do not need to hold the entire build side in memory at once.

PostgreSQL, Oracle, and SQL Server all support merge join. You will see it in PostgreSQL query plans as Merge Join.

Index Join (Indexed Nested Loop Join)

Index join is a specialized form of nested loop join where the inner table has an index on the join key. Instead of scanning the entire inner table for each outer row, you do an index lookup, which is typically O(log n) or even O(1) for hash indexes.

Plain text

The time complexity is O(|R| * log|S|) when using a B-tree index, or O(|R|) amortized for a hash index (ignoring collisions). Compared to the O(|R| * |S|) of plain nested loop join, this is a massive improvement.

Here, the index turned the inner scan into a seek. Instead of reading every page of the inner table, you jump directly to the relevant rows.

This algorithm is also I/O-friendly in a subtle way. If the outer table is accessed in order of the inner index key, you get sequential-ish access patterns on the inner table, which is cache-friendly.

When does a index join shine?

  • When the inner table has an index on the join key
  • When the outer table is small (fewer rows = fewer index lookups)
  • When the join is highly selective (few rows match per outer row)
  • Point lookups or equality joins

When does it struggle?

  • When the outer table is large, and the inner table index has poor selectivity, you end up doing many random I/Os, which can be slower than a full sequential scan
  • When there is no index on the join key, (then it degrades to a plain nested loop join)
  • For range predicates, B-tree indexes help, but performance depends heavily on the range width

Grace Hash Join

Grace hash join solves the problem that regular hash join has: what happens when the build side does not fit in memory? The algorithm has two phases, both using partitioning:

Phase 1 (Partitioning): Partition both R and S into k buckets using the same hash function h1. All rows with the same join key will end up in the same partition pair (R_i, S_i) on disk. This requires two sequential passes over both relations.

Phase 2 (Probing): For each partition pair (R_i, S_i), load R_i into memory, build a hash table using a second hash function h2, then probe it with S_i. Since partitions are smaller than the original tables, each one should now fit in memory.

Plain text

The I/O cost of Grace Hash Join is 3 * (|R| + |S|) page reads and writes: one pass to partition, and one pass to join each partition pair. This is competitive with sort-merge join and significantly better than naive nested loop join.

The number of partitions k needs to be chosen carefully. If you have B buffer pages available, you need at least sqrt(|R|) partitions so that each R partition fits in B pages during the probe phase. A common rule of thumb is k = ceil(sqrt(|R| / B)).

When does Grace Hash Join shine?

  • Large equijoins where neither side fits in memory
  • When data has no useful sort order
  • Parallel and distributed settings where partitioning can happen across nodes

When does it struggle?

  • Skewed data with many duplicate join keys (one partition becomes huge, causing recursive partitioning)
  • When I/O is expensive (it writes everything to disk and reads it back)
  • For very small tables, a simple hash join or a nested loop is cheaper due to lower overhead

Most production databases implement Grace Hash Join (or a variant of it) as their fallback when regular in-memory hash join runs out of memory. PostgreSQL calls this “hash join with batches” — you can see the number of batches in the EXPLAIN ANALYZE output. If batches > 1, it has spilled to disk.

Broadcast Join

Broadcast join exists primarily in distributed databases and query engines (like Apache Spark, Presto, Google BigQuery, Snowflake, etc.). It is not a new join algorithm in the computational sense — it combines with any of the above — but it addresses a specific problem in distributed systems: how do you join two tables that live across many nodes?

The idea is simple - If one of the two relations is small enough, send (broadcast) a complete copy of it to every single node in the cluster. Then each node independently joins its local partition of the large table with the full copy of the small table, using any local join algorithm (usually hash join).

Plain text

By broadcasting the small table, you avoid shuffling (repartitioning) the large table across the network. Network shuffles are often the most expensive operation — they involve serialization, network transfer, deserialization, and disk I/O. Broadcast eliminates all of that for the large table.

  • Broadcast cost: send the small table to all N nodes. If the small table is S

    bytes, that is O(N * S)

    network transfer.

  • Join cost: each node does a local hash join on its partition of the large table.

  • No shuffle of the large table.

Compare this to a shuffle join (also called a partitioned hash join or repartition join): both tables are shuffled so that rows with the same key end up on the same node. Cost is O(|R| + |S|) in network transfer, but this scales with the size of both tables.

Broadcast join wins when: N * |small| << |large|. In other words, when the small table is small enough that broadcasting it everywhere is cheaper than shuffling the large table.

When does broadcast join shine?

  • One table is significantly smaller than the other
  • The large table is already partitioned across nodes, and you want to avoid reshuffling it
  • Low-latency, high-throughput analytical queries where the small table fits in memory on each node

When does it struggle?

  • When the “small” table is not actually that small — broadcasting a 10GB table to 1000 nodes means 10TB of network transfer
  • When the cluster has limited network bandwidth
  • When the small table changes frequently (broadcasting is a snapshot; you need to re-broadcast on updates)

How Databases Choose Between Algorithms

The query planner does not pick join algorithms randomly. It uses cost-based optimization: it estimates the cost of each candidate plan and picks the cheapest one.

The key inputs to this decision are:

  • Table cardinality
  • Working memory limit for this query
  • Indexes on the join keys
  • Sort order of the join key
  • Join predicate type - equality, range, etc
  • Cluster topology - single-node or distributed execution

Conclusion

Join algorithms are a foundational piece of database internals, and understanding them is important when you are debugging slow queries, designing schemas, or evaluating distributed systems.

The key takeaways are

  • Nested loop join is simple and flexible, but expensive for large tables; it becomes powerful when combined with an index.
  • Hash join is the most common join, and it suits large equijoins in memory.
  • Merge join excels when data is pre-sorted and handles disk gracefully.
  • Grace Hash Join extends hash join to handle data that does not fit in memory by partitioning both relations to disk first.
  • Broadcast join is the go-to strategy in distributed systems when one table is small enough to replicate to every node.