InkdownInkdown
Start writing

Arpit Bhayani Blogs

336 files·168 subfolders

Shared Workspace

Arpit Bhayani Blogs
001 Ai Topological Sort

023-cassandra-writes

Shared from "Arpit Bhayani Blogs" on Inkdown

How Writes Work in Apache Cassandra

Source: https://arpitbhayani.me/blogs/cassandra-writes Date: 2025-10-20

Apache Cassandra is a distributed database designed for high availability and horizontal scalability. This write-up explores the complete write path in Cassandra, from the moment a client sends a write request to how data gets replicated across nodes in the cluster.


Apache Cassandra is a distributed database designed for high availability and horizontal scalability. This write-up explores the complete write path in Cassandra, from the moment a client sends a write request to how data gets replicated across nodes in the cluster.

When a client writes data to Cassandra, the data flows through multiple stages before being safely stored. The write path involves several key components working together to ensure durability, consistency, and performance. Let’s break this down…

Client Connection and Coordinator Selection

001-ai-topological-sort.md
tldr.md
002 Temporal Primer
002-temporal-primer.md
tldr.md
003 Rag Production
003-rag-production.md
tldr.md
004 Structure Of Llm Chat
004-structure-of-llm-chat.md
tldr.md
005 How Llms Work
005-how-llms-work.md
tldr.md
006 Monolith Is Distributed System
006-monolith-is-distributed-system.md
tldr.md
007 Defensive Databases
007-defensive-databases.md
tldr.md
008 Bm25
008-bm25.md
tldr.md
009 Join Algorithms
009-join-algorithms.md
tldr.md
010 Venting At Work
010-venting-at-work.md
tldr.md
011 Half Life
011-half-life.md
tldr.md
012 Multi Paxos
012-multi-paxos.md
tldr.md
013 Mysql Replication Internals
013-mysql-replication-internals.md
tldr.md
014 Bloom Filters
014-bloom-filters.md
tldr.md
015 Clock Sync Nightmare
015-clock-sync-nightmare.md
tldr.md
016 Kafka Partitions
016-kafka-partitions.md
tldr.md
017 Product Quantization
017-product-quantization.md
tldr.md
018 Qkv Matrices
018-qkv-matrices.md
tldr.md
019 Deleted Production
019-deleted-production.md
tldr.md
020 How Llm Inference Works
020-how-llm-inference-works.md
tldr.md
021 Blocking Queues
021-blocking-queues.md
tldr.md
022 Heartbeats In Distributed Systems
022-heartbeats-in-distributed-systems.md
tldr.md
023 Cassandra Writes
023-cassandra-writes.md
tldr.md
024 Redis Replication
024-redis-replication.md
tldr.md
025 Arrogant People At Work
025-arrogant-people-at-work.md
tldr.md
026 Cdn Content Replication
026-cdn-content-replication.md
tldr.md
027 Cant Fix Everything Day One
027-cant-fix-everything-day-one.md
tldr.md
028 Emotions At Work
028-emotions-at-work.md
tldr.md
029 Grpc Http2
029-grpc-http2.md
tldr.md
030 Meetings With No Agenda Are A Waste Of Time
030-meetings-with-no-agenda-are-a-waste-of-time.md
tldr.md
031 Growth Is Not About Doing Everything
031-growth-is-not-about-doing-everything.md
tldr.md
032 Career Longevity Vs Job Hopping
032-career-longevity-vs-job-hopping.md
tldr.md
033 Stay Relevant At Higher Salary Levels
033-stay-relevant-at-higher-salary-levels.md
tldr.md
034 Why Consensus
034-why-consensus.md
tldr.md
035 Database Deadlocks
035-database-deadlocks.md
tldr.md
036 Cpu Cache Locality
036-cpu-cache-locality.md
tldr.md
037 Eventual Consistency
037-eventual-consistency.md
tldr.md
038 Dns Udp Tcp
038-dns-udp-tcp.md
tldr.md
039 Masters
039-masters.md
tldr.md
040 Empathy Makes Great Engineers Unstoppable
040-empathy-makes-great-engineers-unstoppable.md
tldr.md
041 Good Mentors Build People
041-good-mentors-build-people.md
tldr.md
042 Always Have Back Burner Projects
042-always-have-back-burner-projects.md
tldr.md
043 Before You Push Back Know What Youre Standing On
043-before-you-push-back-know-what-youre-standing-on.md
tldr.md
044 Be The One They Can Count On
044-be-the-one-they-can-count-on.md
tldr.md
045 How Much People Bet On You
045-how-much-people-bet-on-you.md
tldr.md
046 How To Get Leadership To Say Yes To Your Project
046-how-to-get-leadership-to-say-yes-to-your-project.md
tldr.md
047 Dont Let Your Best Ideas Die In Silence
047-dont-let-your-best-ideas-die-in-silence.md
tldr.md
048 Be Someone Others Want To Work With
048-be-someone-others-want-to-work-with.md
tldr.md
049 Dont Fall For Xy Problem Ask Right Questions
049-dont-fall-for-xy-problem-ask-right-questions.md
tldr.md
050 Biggest Lie Startups Tell Engineers
050-biggest-lie-startups-tell-engineers.md
tldr.md
051 Promotions Are Proactive Not Reactive
051-promotions-are-proactive-not-reactive.md
tldr.md
052 Not Enough To Be Right Learn To Be Heard
052-not-enough-to-be-right-learn-to-be-heard.md
tldr.md
053 No One Ships Alone
053-no-one-ships-alone.md
tldr.md
054 Not Every Mistake Needs A Correction
054-not-every-mistake-needs-a-correction.md
tldr.md
055 Build Influence At Work
055-build-influence-at-work.md
tldr.md
056 Your Soft Skills Arent Soft At All
056-your-soft-skills-arent-soft-at-all.md
tldr.md
057 Experience Before Forming Opinion
057-experience-before-forming-opinion.md
tldr.md
058 Curiosity And High Bias For Action
058-curiosity-and-high-bias-for-action.md
tldr.md
059 Worklog
059-worklog.md
tldr.md
060 Mistakes And Growth
060-mistakes-and-growth.md
tldr.md
061 Own It Instead Of Sweeping It Aside
061-own-it-instead-of-sweeping-it-aside.md
tldr.md
062 Dont Wait Step Up
062-dont-wait-step-up.md
tldr.md
063 Temporary Fix Is Permanent
063-temporary-fix-is-permanent.md
tldr.md
064 Interview Bias And What Sets You Apart
064-interview-bias-and-what-sets-you-apart.md
tldr.md
065 Saying This Isnt My Problem Is A Problem
065-saying-this-isnt-my-problem-is-a-problem.md
tldr.md
066 Okr
066-okr.md
tldr.md
067 Miscommunication
067-miscommunication.md
tldr.md
068 When In Doubt Code It Out
068-when-in-doubt-code-it-out.md
tldr.md
069 Follow Up Without Annoying People
069-follow-up-without-annoying-people.md
tldr.md
070 Lead Projects That Land
070-lead-projects-that-land.md
tldr.md
071 Abstract Thinking Skill Next Decade
071-abstract-thinking-skill-next-decade.md
tldr.md
072 We Engineers Suck At Task Estimation
072-we-engineers-suck-at-task-estimation.md
tldr.md
073 Shiny Object Syndrome In Tech
073-shiny-object-syndrome-in-tech.md
tldr.md
074 3p
074-3p.md
tldr.md
075 Leverage The Equilibrium
075-leverage-the-equilibrium.md
tldr.md
076 On Demand Container Loading In Aws Lambda
076-on-demand-container-loading-in-aws-lambda.md
tldr.md
077 Sql Has Problems We Can Fix Them Pipe Syntax In Sql
077-sql-has-problems-we-can-fix-them-pipe-syntax-in-sql.md
tldr.md
078 Nanolog A Nanosecond Scale Logging System
078-nanolog-a-nanosecond-scale-logging-system.md
tldr.md
079 Best Resource Is Mythical
079-best-resource-is-mythical.md
tldr.md
080 Wtf The Who To Follow Service At Twitter
080-wtf-the-who-to-follow-service-at-twitter.md
tldr.md
081 Know A Lot
081-know-a-lot.md
tldr.md
082 Out Of Syllabus
082-out-of-syllabus.md
tldr.md
083 Negotiate The Offer
083-negotiate-the-offer.md
tldr.md
084 Never Bad Mouth Your Ex Exployer
084-never-bad-mouth-your-ex-exployer.md
tldr.md
085 Culture Fit
085-culture-fit.md
tldr.md
086 Quantification In Resume
086-quantification-in-resume.md
tldr.md
087 Hiring Is Unfair
087-hiring-is-unfair.md
tldr.md
088 Questions For Interviewers
088-questions-for-interviewers.md
tldr.md
089 Collaboration Communication
089-collaboration-communication.md
tldr.md
090 Out Of Vicious Interview Cycle
090-out-of-vicious-interview-cycle.md
tldr.md
091 Pitch Projects Not Ideas
091-pitch-projects-not-ideas.md
tldr.md
092 Read Design Docs
092-read-design-docs.md
tldr.md
093 Read Rca Docs
093-read-rca-docs.md
tldr.md
094 Start Generalist
094-start-generalist.md
tldr.md
095 Do Not Rely On Summaries
095-do-not-rely-on-summaries.md
tldr.md
096 Structure Your Design Interviews
096-structure-your-design-interviews.md
tldr.md
097 Title Inflation
097-title-inflation.md
tldr.md
098 Find Your Own Project
098-find-your-own-project.md
tldr.md
099 Six Pointers To Crack Coding And Design Interviews
099-six-pointers-to-crack-coding-and-design-interviews.md
tldr.md
100 Keep Yourself Unblocked
100-keep-yourself-unblocked.md
tldr.md
101 Genetic Knapsack
101-genetic-knapsack.md
tldr.md
102 Pseudorandom Number Generation Lfsr
102-pseudorandom-number-generation-lfsr.md
tldr.md
103 How Indexes Work On Partitioned And Sharded Data
103-how-indexes-work-on-partitioned-and-sharded-data.md
tldr.md
104 Some Data Partitioning Strategies For Distributed Data Stores
104-some-data-partitioning-strategies-for-distributed-data-stores.md
tldr.md
105 Data Partitioning
105-data-partitioning.md
tldr.md
106 Leaderless Replication
106-leaderless-replication.md
tldr.md
107 Conflict Resolution
107-conflict-resolution.md
tldr.md
108 Conflict Detection
108-conflict-detection.md
tldr.md
109 Multi Master Replication
109-multi-master-replication.md
tldr.md
110 Monotonic Reads
110-monotonic-reads.md
tldr.md
111 Read Your Write Consistency
111-read-your-write-consistency.md
tldr.md
112 Handling Outages Master Replica
112-handling-outages-master-replica.md
tldr.md
113 Replication Formats
113-replication-formats.md
tldr.md
114 Replication Strategies
114-replication-strategies.md
tldr.md
115 Master Replica Replication
115-master-replica-replication.md
tldr.md
116 Durability
116-durability.md
tldr.md
117 Isolation
117-isolation.md
tldr.md
118 Atomicity
118-atomicity.md
tldr.md
119 Consistency
119-consistency.md
tldr.md
120 Architectures In Distributed Systems
120-architectures-in-distributed-systems.md
tldr.md
121 Mistaken Beliefs Of Distributed Systems
121-mistaken-beliefs-of-distributed-systems.md
tldr.md
122 Fork Bomb
122-fork-bomb.md
tldr.md
123 Chained Operators Python
123-chained-operators-python.md
tldr.md
124 Taxonomy On Sql
124-taxonomy-on-sql.md
tldr.md
125 The Weird Walrus
125-the-weird-walrus.md
tldr.md
126 Fully Persistent Arrays
126-fully-persistent-arrays.md
tldr.md
127 Persistent Data Structures Introduction
127-persistent-data-structures-introduction.md
tldr.md
128 Constant Folding Python
128-constant-folding-python.md
tldr.md
129 String Interning Python
129-string-interning-python.md
tldr.md
130 Recursion Visualizer Python
130-recursion-visualizer-python.md
tldr.md
131 Flajolet Martin
131-flajolet-martin.md
tldr.md
132 2q Cache
132-2q-cache.md
tldr.md
133 Israeli Queues
133-israeli-queues.md
tldr.md
134 1d Terrain
134-1d-terrain.md
tldr.md
135 Jaccard Minhash
135-jaccard-minhash.md
tldr.md
136 Ts Smoothing
136-ts-smoothing.md
tldr.md
137 Lfu
137-lfu.md
tldr.md
138 Morris Counter
138-morris-counter.md
tldr.md
139 Slowsort
139-slowsort.md
tldr.md
140 Bitcask
140-bitcask.md
tldr.md
141 Phi Accrual
141-phi-accrual.md
tldr.md
142 10x Engineer
142-10x-engineer.md
tldr.md
143 Decipher Repeated Key Xor
143-decipher-repeated-key-xor.md
tldr.md
144 Decipher Single Xor
144-decipher-single-xor.md
tldr.md
145 Python Iterable Integers
145-python-iterable-integers.md
tldr.md
146 Inheritance C
146-inheritance-c.md
tldr.md
147 Rum
147-rum.md
tldr.md
148 Consistent Hashing
148-consistent-hashing.md
tldr.md
149 Python Caches Integers
149-python-caches-integers.md
tldr.md
150 Fractional Cascading
150-fractional-cascading.md
tldr.md
151 Copy On Write
151-copy-on-write.md
tldr.md
152 Midpoint Insertion Caching Strategy
152-midpoint-insertion-caching-strategy.md
tldr.md
153 Fsm Python
153-fsm-python.md
tldr.md
154 Bayesian Average
154-bayesian-average.md
tldr.md
155 Sliding Window Ratelimiter
155-sliding-window-ratelimiter.md
tldr.md
156 Idf
156-idf.md
tldr.md
157 Better Programmer
157-better-programmer.md
tldr.md
158 Python Prompts
158-python-prompts.md
tldr.md
159 Rule 30 Cellular Automata
159-rule-30-cellular-automata.md
tldr.md
160 Function Overloading
160-function-overloading.md
tldr.md
161 Isolation Forest
161-isolation-forest.md
tldr.md
162 Image Steganography
162-image-steganography.md
tldr.md
163 Long Integers Python
163-long-integers-python.md
tldr.md
164 I Changed My Python
164-i-changed-my-python.md
tldr.md
165 Benchmark And Compare Pagination Approach In Mongodb
165-benchmark-and-compare-pagination-approach-in-mongodb.md
tldr.md
166 Mongodb Cursor Skip Is Slow
166-mongodb-cursor-skip-is-slow.md
tldr.md
167 Fast And Efficient Pagination In Mongodb
167-fast-and-efficient-pagination-in-mongodb.md
tldr.md
168 Making Http Requests Using Netcat
168-making-http-requests-using-netcat.md
tldr.md

Every write request in Cassandra starts with a client connecting to any node in the cluster. The node that receives the client request becomes the coordinator for that operation. This is an important concept: there is no “master” node in Cassandra—any node can coordinate any request.

The coordinator’s role is to:

  • Accept the write request from the client
  • Determine which nodes should store replicas of the data
  • Forward the write to those replica nodes
  • Wait for acknowledgments based on the consistency level
  • Respond to the client

The coordinator doesn’t necessarily store the data itself, though it might be one of the replica nodes depending on the partition key and replication strategy.

Determining Replica Nodes

Before the coordinator can forward writes, it needs to determine which nodes should store the data. This decision is based on three key factors.

Partition Key and Token

Every piece of data in Cassandra has a partition key. This key is hashed using a partitioner (typically Murmur3Partitioner) to produce a token—a 64-bit integer that determines where the data lives in the cluster.

Plain text

For example, if you have a table storing user data:

Plain text

When you insert a user with user_id = '550e8400-e29b-41d4-a716-446655440000', Cassandra hashes this UUID to produce a token. This token falls within a range owned by specific nodes in the cluster.

Token Ring and Consistent Hashing

Cassandra organizes nodes in a logical token ring. The entire token space (from -2^63 to 2^63-1 for Murmur3) is divided among the nodes in the cluster. Each node is assigned a token range and is responsible for storing data whose tokens fall within that range.

When you add or remove nodes, the token ranges are redistributed, but only a fraction of the data needs to move—this is the beauty of consistent hashing.

Replication Strategy

The replication strategy determines how many copies of the data exist and where they’re placed. There are two main strategies:

SimpleStrategy: Used for single data center deployments. It places replicas on consecutive nodes in the ring. For example, with a replication factor of 3, the data is stored on the first node determined by the token, plus the next two nodes clockwise in the ring.

NetworkTopologyStrategy: Used for multi-data center deployments. It allows you to specify how many replicas should exist in each data center. For example:

Plain text

This creates 3 replicas in datacenter1 and 2 in datacenter2. Within each data center, replicas are placed on different racks when possible to maximize availability.

The Write Path Within a Node

Once the coordinator determines which nodes should receive the write, it forwards the mutation to those nodes. Now let’s see what happens inside each replica node when it receives a write request.

CommitLog

The very first thing that happens when a node receives a write is that it appends the mutation to the CommitLog. This is a critical step for durability.

The CommitLog is an append-only log file on disk. It’s structured as a sequential write, which is extremely fast—modern SSDs can handle hundreds of thousands of sequential writes per second. The CommitLog entry contains:

  • The keyspace and table name
  • The partition key
  • The clustering keys (if any)
  • The column values
  • Timestamp and TTL information

By writing to the CommitLog first, Cassandra ensures that even if the node crashes immediately after, the write can be recovered when the node restarts. This is the Write-Ahead Log (WAL) pattern.

CommitLog configuration considerations:

Plain text

The commitlog_sync parameter has two modes:

  • periodic: Syncs to disk every N milliseconds (default 10s). Faster but less durable.
  • batch: Syncs after collecting writes for N milliseconds (default 2ms). More durable.

Most production systems use batch mode with a 2ms window, providing a good balance between durability and performance.

MemTable

Immediately after writing to the CommitLog, the mutation is written to an in-memory structure called a MemTable. Each table in Cassandra has its own MemTable.

The MemTable is essentially a sorted map structure (similar to a Red-Black tree or Skip List) that keeps data sorted by partition key and clustering columns. This sorting is crucial for efficient reads and for the later flush to disk.

Multiple writes to the same partition key will update the MemTable in place. However, Cassandra doesn’t actually update data—it writes new timestamped versions. When you “update” a column, you’re really adding a new entry with a newer timestamp. The reconciliation happens at read time.

Example of MemTable organization:

Plain text
Write Response

Once the write is in the CommitLog and MemTable, the node sends an acknowledgment back to the coordinator. This happens very quickly, typically in microseconds, because it only involves an append to the CommitLog and an in-memory update.

At this point, the write is considered successful from the replica node’s perspective, but the coordinator hasn’t responded to the client yet. That depends on the consistency level.

Flushing MemTables to SSTables

The MemTable is memory-bound, so it can’t grow indefinitely. When a MemTable reaches a certain size threshold or after a certain time period, it’s flushed to disk as an SSTable (Sorted String Table).

The flush process:

  1. The MemTable is marked as immutable (no new writes)
  2. A new MemTable is created for incoming writes
  3. The immutable MemTable is sorted and written to disk as an SSTable
  4. The corresponding CommitLog segments are marked for deletion

SSTables are immutable once written. This immutability provides several benefits:

  • No need for read-write locks
  • Simplified backup and recovery
  • Efficient sequential disk access
  • Easy to distribute across nodes

However, immutability means we accumulate multiple SSTables over time, which is why compaction is necessary.

Consistency Levels

The consistency level determines how many replica nodes must acknowledge a write before the coordinator responds to the client. This is configurable per query and represents a fundamental trade-off between consistency, availability, and latency.

ONE

The coordinator waits for acknowledgment from just one replica node. This provides the lowest latency and highest availability - the write succeeds as long as any single replica node is reachable. However, you might read stale data if you subsequently read from a node that hasn’t received the write yet.

QUORUM

The coordinator waits for acknowledgments from a majority of replica nodes. For a replication factor of 3, QUORUM means 2 nodes must acknowledge. This is the most commonly used level in production because it provides a good balance:

  • QUORUM = floor(replication_factor / 2) + 1
  • Guarantees that reads and writes overlap (if you read with QUORUM, you’ll see the most recent QUORUM write)
  • Can tolerate the failure of a minority of replicas
LOCAL_QUORUM

Similar to QUORUM, but only counts replicas in the local data center. This is crucial for multi-datacenter deployments because you don’t want to wait for cross-datacenter network latency on every write.

ALL

The coordinator waits for all replica nodes to acknowledge. This provides the strongest consistency but has the worst availability; if even one replica is down, writes fail. Generally not recommended for production.

ANY

This is an interesting case. The write is considered successful if it can be written to at least one node, or if it can be stored as a hint for a temporarily unavailable node. More on hints below.

Here’s an Example

Consider a cluster with a replication factor of 3, and you issue a write with consistency level QUORUM:

Plain text

Timeline:

  1. Client sends write to Node A (coordinator)
  2. Node A determines replicas: Nodes B, C, and D
  3. Node A forwards write to Nodes B, C, and D
  4. Within 1-2ms, Nodes B and C respond (they’ve written to CommitLog and MemTable)
  5. Node A has 2 acknowledgments (QUORUM satisfied)
  6. Node A responds success to the client
  7. Node D responds later (but the coordinator already replied to the client)

The entire operation typically completes in 2-5ms, depending on network latency and load.

Data Replication

Now let’s dive deeper into how data actually gets replicated across nodes. The replication happens synchronously as part of the write operation; there’s no separate background replication process for normal writes.

When the coordinator receives a write, it immediately forwards the mutation to all replica nodes (as determined by the replication strategy and replication factor). These writes happen in parallel. The coordinator just waits as per the consistency level, but it still writes to all the replicas synchronously.

Here’s what actually gets sent over the network:

Plain text

Each replica node processes this mutation independently through its own CommitLog → MemTable path.

Timestamps and Conflict Resolution

Cassandra uses last-write-wins (LWW) conflict resolution based on timestamps. Every mutation includes a timestamp (in microseconds), and when multiple versions of the same data exist, the one with the highest timestamp wins.

Timestamps can come from two sources:

  1. Client-provided timestamps: Using the USING TIMESTAMP clause:
Plain text
  1. Coordinator-generated timestamps: If the client doesn’t provide one, the coordinator assigns a timestamp based on its local clock.

If your cluster’s clocks are not synchronized (using NTP), you can get unexpected results. A write with timestamp 1000 can be overwritten by a later write with timestamp 999 if it arrives from a node with a clock that’s behind.

Hence, always use NTP to keep clocks synchronized across your cluster, with clock drift kept under 500ms.

Handling Network Partitions

In a distributed system, network partitions are inevitable. Cassandra handles these gracefully through its consistency model and hinted handoff mechanism.

Suppose we have RF=3 (replication factor), CL=QUORUM, and Node C goes down:

  1. Coordinator forwards write to Nodes A, B, and C
  2. Nodes A and B respond successfully
  3. Node C doesn’t respond (it’s down)
  4. Coordinator has 2 responses = QUORUM satisfied
  5. Write succeeds from the client’s perspective
  6. Coordinator stores a hint for Node C
Hinted Handoff

A hint is a special write that says, “when Node C comes back online, replay this mutation to it.” Hints are stored on the coordinator (or other available nodes) in a system table.

Plain text

When Node C comes back online, other nodes detect it and start replaying hints to it. This brings Node C up to date with the writes it missed while down.

Important limitations of hints:

  • Hints are best-effort, not guaranteed

  • If a node is down for longer than max_hint_window_in_ms

    (default 3 hours), hints expire

  • Hints increase the load on the coordinator

  • For extended outages, we run a repair instead

Read Repair

Even with a hinted handoff, replicas can get out of sync. Cassandra uses read repair to detect and fix inconsistencies during read operations.

When you read with a consistency level higher than ONE, the coordinator sends read requests to multiple replicas and compares their responses. If it finds differences, it:

  1. Returns the most recent data to the client (highest timestamp wins)
  2. Sends the most recent data to replicas with stale data
  3. Those replicas update themselves in the background

This happens transparently and helps maintain eventual consistency. We can also enable read_repair_chance for additional background repair:

Plain text

This tells Cassandra to perform read repair on 10% of reads, even if the consistency level is ONE. However, this feature is often disabled in modern Cassandra (3.0+) in favor of explicit repair operations.

Write Performance Characteristics

Write Throughput

Cassandra is write-optimized. A single node can handle 10,000-50,000 writes per second, depending on:

  • Hardware (SSD vs HDD, CPU cores, RAM)
  • Data model (wide vs narrow partitions)
  • Consistency level (ONE vs QUORUM vs ALL)
  • Batch size (single writes vs batched writes)

The write path is designed for speed:

  • Sequential CommitLog writes (no random I/O)
  • In-memory MemTable updates (nanosecond latency)
  • Parallel replication (no waiting for sequential replication)
Write Latency

Typical write latencies (p99):

  • Consistency Level ONE: 1-3ms
  • Consistency Level QUORUM: 2-5ms
  • Consistency Level LOCAL_QUORUM: 2-5ms (within a datacenter)
  • Consistency Level ALL: 5-20ms (depends on slowest node)

These are remarkably consistent because writes don’t involve disk reads—only sequential appends to the CommitLog.

Factors Affecting Write Performance
  1. Consistency Level

Higher consistency levels increase latency because the coordinator must wait for more acknowledgments.

  1. Batch Writes

Cassandra supports batches, but they’re not always a performance win:

Plain text

Logged batches (default) use a batch log for atomicity, which adds overhead. They’re useful for ensuring multiple writes succeed or fail together, but they’re slower than individual writes.

Unlogged batches skip the batch log and are faster, but they don’t guarantee atomicity. They’re only useful for performance when writing to the same partition:

Plain text
  1. Partition Size

Writing to the same partition repeatedly can create “hot spots.” Cassandra performs best when writes are distributed across many partitions. If you have a counter table that updates the same partition millions of times, you’ll experience performance degradation.

  1. Hardware
  • SSD vs HDD: SSDs provide much better write performance (CommitLog especially)
  • RAM: Larger MemTables reduce flush frequency
  • Network: Low-latency networks reduce replication overhead

Compaction

As SSTables accumulate, read performance degrades (you have to check more files) and disk space increases (duplicate and deleted data). Compaction solves this by merging SSTables and removing deleted or overwritten data.

Compaction Strategies
SizeTieredCompactionStrategy (STCS)
  • Default strategy
  • Groups SSTables of similar size and merges them
  • Good for write-heavy workloads
  • Can create large temporary disk spikes (requires 50% free space)
LeveledCompactionStrategy (LCS)
  • Organizes SSTables into levels of increasing size
  • Each level is 10x larger than the previous
  • Better read performance (fewer SSTables per read)
  • More I/O intensive (more frequent compactions)
  • Good for read-heavy workloads
TimeWindowCompactionStrategy (TWCS)
  • Designed for time-series data
  • Groups SSTables by time window (e.g., daily, hourly)
  • Old windows are never compacted with newer ones
  • Perfect for time-series data with TTL
  • Very efficient for workloads where old data expires
Impact on Write Performance

Compaction runs in the background but competes for I/O resources. Heavy compaction can temporarily impact write performance. You can tune compaction with:

Plain text

Multi-Datacenter Replication

Cassandra deployments often span multiple datacenters for disaster recovery and geographical distribution. This is how replication works across datacenters.

NetworkTopologyStrategy Configuration
Plain text

This creates:

  • 3 replicas in us-east datacenter
  • 3 replicas in us-west datacenter
  • 2 replicas in eu-west datacenter
  • Total of 8 replicas globally
Cross-Datacenter Write Flow

When a write arrives at a coordinator in us-east with consistency level LOCAL_QUORUM:

  1. The coordinator determines all 8 replica nodes
  2. 3 in us-east
  3. 3 in us-west
  4. 2 in eu-west
  5. Coordinator immediately forwards the write to all 8 nodes in parallel
  6. Coordinator waits for 2 acknowledgments from us-east replicas (LOCAL_QUORUM for RF=3)
  7. Coordinator responds to client (~5ms total)
  8. Acknowledgments from us-west and eu-west arrive later (~50-200ms depending on distance)
  9. All replicas now have the data

Cross-datacenter replication is still synchronous (the write is sent immediately), but LOCAL_QUORUM doesn’t wait for remote datacenters. This provides consistency within a datacenter and eventual consistency across datacenters.

Consistency Level Choice in Multi-DC

LOCAL_QUORUM: Most common for production

  • Fast (doesn’t wait for remote DCs)
  • Consistent within the local DC
  • Eventually consistent across DCs
  • Can tolerate an entire DC failure

EACH_QUORUM: Stronger consistency

  • Waits for QUORUM in every DC
  • Much higher latency (wait for all DCs)
  • Guarantees strong consistency globally
  • Use when you need immediate global consistency

Failure Scenarios and Handling

Scenario 1: Single Node Failure (RF=3, CL=QUORUM)

Cluster with nodes A, B, C, D. Node C fails. RF=3, so data has replicas on A, B, and C.

Write behavior:

  • Coordinator sends write to A, B, C
  • A and B respond (QUORUM=2 satisfied)
  • C doesn’t respond
  • Write succeeds
  • Hint stored for C

No impact on write availability. Writes continues successfully.

Scenario 2: Multiple Node Failures (RF=3, CL=QUORUM)

Same cluster, nodes B and C both fail.

Write behavior:

  • Coordinator sends write to A, B, C
  • Only A responds
  • QUORUM not satisfied (need 2, have 1)
  • Write fails with Unavailable exception

Write availability is lost for data replicated to B and C. However, data replicated to other sets of nodes (e.g., A, D, E) continues working.

Scenario 3: Network Partition

Cluster split into two groups: {A, B} and {C, D}. RF=3, CL=QUORUM.

Write behavior:

  • Group {A, B}: Can’t reach QUORUM (need 2, but C is unreachable) → Writes fail
  • Group {C, D}: Same problem → Writes fail
  • Result: Write availability lost for affected partitions

When the network heals, read repair and explicit repair (nodetool repair) restore consistency.

Scenario 4: Datacenter Failure (Multi-DC)

Setup: Three datacenters: US (RF=3), EU (RF=3), ASIA (RF=2). US datacenter goes offline.

Write behavior with LOCAL_QUORUM in EU:

  • Coordinator in EU
  • Sends writes to EU replicas
  • Gets QUORUM from EU replicas
  • Write succeeds

No impact on EU writes. US writes fail. ASIA applications can continue if they write to EU or ASIA with LOCAL_QUORUM.

Common Pitfalls

Using ALL Consistency Level

Problem: Sacrifices availability for consistency. Any single node failure causes writes to fail.

Solution: Use QUORUM or LOCAL_QUORUM instead. They provide strong consistency while tolerating failures.

Large Batches

Problem: Batching unrelated writes hurts performance and can overwhelm coordinators.

Plain text

Solution: Only batch writes to the same partition, or use unlogged batches when atomicity isn’t needed.

Using Client Timestamps Inconsistently

Problem: Mixing client-side and server-side timestamps leads to unpredictable conflict resolution.

Solution: Either always use USING TIMESTAMP or never use it. Be consistent across your application.

Ignoring NTP

Problem: Clock drift causes last-write-wins to produce unexpected results.

Solution: Keep NTP running on all nodes with drift under 500ms. Monitor clock sync regularly.

Writing to Hot Partitions

Problem: Repeatedly updating the same partition creates a bottleneck.

Plain text

Solution: Distribute load across partitions:

Plain text

Best Practices

Choose the Right Consistency Level
  • Write-heavy: Use ONE or LOCAL_ONE for maximum throughput
  • Balanced: Use LOCAL_QUORUM (most common)
  • Read-heavy requiring strong consistency: Use QUORUM or EACH_QUORUM
Design for Distribution

Ensure your partition keys distribute data evenly:

Plain text
Use TTL for Time-Series Data

Instead of manually deleting old data, use TTL:

Plain text

This is much more efficient than DELETE statements.

Monitor and Repair

Run regular repairs to ensure replicas stay synchronized:

Plain text

Schedule repairs weekly or monthly, depending on your workload.

Footnotes

Cassandra uses a Log-Structured Merge (LSM) tree to hold the data, where writes are first persisted to an append-only CommitLog before being written to an in-memory MemTable.

MemTables are periodically flushed to disk as immutable SSTables, and background compaction merges them to optimize read performance. This classic Write-Ahead Logging (WAL) mechanism ensures durability and high write throughput via sequential disk I/O.

The coordinator node also takes care of writing to all the replicas. In case write to one of the replicas fails or times out, then hinted handoff is leveraged to repair and maintain eventual consistency. Cassandra leverages quorum with tunable consistency levels, allowing trade-offs between consistency and availability for both reads and writes.