InkdownInkdown
Start writing

Arpit Bhayani Blogs

336 filesยท168 subfolders

Shared Workspace

Arpit Bhayani Blogs
001 Ai Topological Sort

144-decipher-single-xor

Shared from "Arpit Bhayani Blogs" on Inkdown

ETAOIN SHRDLU - Using Letter Frequency to Decipher the Ciphered

Source: https://arpitbhayani.me/blogs/decipher-single-xor Date: 2020-06-21

Crack single-byte XOR ciphers! Learn how to decipher encrypted messages without the key using frequency analysis and Python.


Encryption is a process of encoding messages such that it can only be read and understood by the intended parties. The process of extracting the original message from an encrypted one is called Decryption. Encryption usually scrambles the original message using a key, called encryption key, that the involved parties agree on.

The strength of an encryption algorithm is determined by how hard it would be to extract the original message without knowing the encryption key. Usually, this depends on the number of bits in the key - bigger the key, the longer it takes to decrypt the enciphered data.

In this essay, we will work with a very simple cipher (encryption algorithm) that uses an encryption key with a size of one byte, and try to decipher the ciphered text and retrieve the original message without knowing the encryption key. The problem statement, defined above, is based on .

001-ai-topological-sort.md
tldr.md
002 Temporal Primer
002-temporal-primer.md
tldr.md
003 Rag Production
003-rag-production.md
tldr.md
004 Structure Of Llm Chat
004-structure-of-llm-chat.md
tldr.md
005 How Llms Work
005-how-llms-work.md
tldr.md
006 Monolith Is Distributed System
006-monolith-is-distributed-system.md
tldr.md
007 Defensive Databases
007-defensive-databases.md
tldr.md
008 Bm25
008-bm25.md
tldr.md
009 Join Algorithms
009-join-algorithms.md
tldr.md
010 Venting At Work
010-venting-at-work.md
tldr.md
011 Half Life
011-half-life.md
tldr.md
012 Multi Paxos
012-multi-paxos.md
tldr.md
013 Mysql Replication Internals
013-mysql-replication-internals.md
tldr.md
014 Bloom Filters
014-bloom-filters.md
tldr.md
015 Clock Sync Nightmare
015-clock-sync-nightmare.md
tldr.md
016 Kafka Partitions
016-kafka-partitions.md
tldr.md
017 Product Quantization
017-product-quantization.md
tldr.md
018 Qkv Matrices
018-qkv-matrices.md
tldr.md
019 Deleted Production
019-deleted-production.md
tldr.md
020 How Llm Inference Works
020-how-llm-inference-works.md
tldr.md
021 Blocking Queues
021-blocking-queues.md
tldr.md
022 Heartbeats In Distributed Systems
022-heartbeats-in-distributed-systems.md
tldr.md
023 Cassandra Writes
023-cassandra-writes.md
tldr.md
024 Redis Replication
024-redis-replication.md
tldr.md
025 Arrogant People At Work
025-arrogant-people-at-work.md
tldr.md
026 Cdn Content Replication
026-cdn-content-replication.md
tldr.md
027 Cant Fix Everything Day One
027-cant-fix-everything-day-one.md
tldr.md
028 Emotions At Work
028-emotions-at-work.md
tldr.md
029 Grpc Http2
029-grpc-http2.md
tldr.md
030 Meetings With No Agenda Are A Waste Of Time
030-meetings-with-no-agenda-are-a-waste-of-time.md
tldr.md
031 Growth Is Not About Doing Everything
031-growth-is-not-about-doing-everything.md
tldr.md
032 Career Longevity Vs Job Hopping
032-career-longevity-vs-job-hopping.md
tldr.md
033 Stay Relevant At Higher Salary Levels
033-stay-relevant-at-higher-salary-levels.md
tldr.md
034 Why Consensus
034-why-consensus.md
tldr.md
035 Database Deadlocks
035-database-deadlocks.md
tldr.md
036 Cpu Cache Locality
036-cpu-cache-locality.md
tldr.md
037 Eventual Consistency
037-eventual-consistency.md
tldr.md
038 Dns Udp Tcp
038-dns-udp-tcp.md
tldr.md
039 Masters
039-masters.md
tldr.md
040 Empathy Makes Great Engineers Unstoppable
040-empathy-makes-great-engineers-unstoppable.md
tldr.md
041 Good Mentors Build People
041-good-mentors-build-people.md
tldr.md
042 Always Have Back Burner Projects
042-always-have-back-burner-projects.md
tldr.md
043 Before You Push Back Know What Youre Standing On
043-before-you-push-back-know-what-youre-standing-on.md
tldr.md
044 Be The One They Can Count On
044-be-the-one-they-can-count-on.md
tldr.md
045 How Much People Bet On You
045-how-much-people-bet-on-you.md
tldr.md
046 How To Get Leadership To Say Yes To Your Project
046-how-to-get-leadership-to-say-yes-to-your-project.md
tldr.md
047 Dont Let Your Best Ideas Die In Silence
047-dont-let-your-best-ideas-die-in-silence.md
tldr.md
048 Be Someone Others Want To Work With
048-be-someone-others-want-to-work-with.md
tldr.md
049 Dont Fall For Xy Problem Ask Right Questions
049-dont-fall-for-xy-problem-ask-right-questions.md
tldr.md
050 Biggest Lie Startups Tell Engineers
050-biggest-lie-startups-tell-engineers.md
tldr.md
051 Promotions Are Proactive Not Reactive
051-promotions-are-proactive-not-reactive.md
tldr.md
052 Not Enough To Be Right Learn To Be Heard
052-not-enough-to-be-right-learn-to-be-heard.md
tldr.md
053 No One Ships Alone
053-no-one-ships-alone.md
tldr.md
054 Not Every Mistake Needs A Correction
054-not-every-mistake-needs-a-correction.md
tldr.md
055 Build Influence At Work
055-build-influence-at-work.md
tldr.md
056 Your Soft Skills Arent Soft At All
056-your-soft-skills-arent-soft-at-all.md
tldr.md
057 Experience Before Forming Opinion
057-experience-before-forming-opinion.md
tldr.md
058 Curiosity And High Bias For Action
058-curiosity-and-high-bias-for-action.md
tldr.md
059 Worklog
059-worklog.md
tldr.md
060 Mistakes And Growth
060-mistakes-and-growth.md
tldr.md
061 Own It Instead Of Sweeping It Aside
061-own-it-instead-of-sweeping-it-aside.md
tldr.md
062 Dont Wait Step Up
062-dont-wait-step-up.md
tldr.md
063 Temporary Fix Is Permanent
063-temporary-fix-is-permanent.md
tldr.md
064 Interview Bias And What Sets You Apart
064-interview-bias-and-what-sets-you-apart.md
tldr.md
065 Saying This Isnt My Problem Is A Problem
065-saying-this-isnt-my-problem-is-a-problem.md
tldr.md
066 Okr
066-okr.md
tldr.md
067 Miscommunication
067-miscommunication.md
tldr.md
068 When In Doubt Code It Out
068-when-in-doubt-code-it-out.md
tldr.md
069 Follow Up Without Annoying People
069-follow-up-without-annoying-people.md
tldr.md
070 Lead Projects That Land
070-lead-projects-that-land.md
tldr.md
071 Abstract Thinking Skill Next Decade
071-abstract-thinking-skill-next-decade.md
tldr.md
072 We Engineers Suck At Task Estimation
072-we-engineers-suck-at-task-estimation.md
tldr.md
073 Shiny Object Syndrome In Tech
073-shiny-object-syndrome-in-tech.md
tldr.md
074 3p
074-3p.md
tldr.md
075 Leverage The Equilibrium
075-leverage-the-equilibrium.md
tldr.md
076 On Demand Container Loading In Aws Lambda
076-on-demand-container-loading-in-aws-lambda.md
tldr.md
077 Sql Has Problems We Can Fix Them Pipe Syntax In Sql
077-sql-has-problems-we-can-fix-them-pipe-syntax-in-sql.md
tldr.md
078 Nanolog A Nanosecond Scale Logging System
078-nanolog-a-nanosecond-scale-logging-system.md
tldr.md
079 Best Resource Is Mythical
079-best-resource-is-mythical.md
tldr.md
080 Wtf The Who To Follow Service At Twitter
080-wtf-the-who-to-follow-service-at-twitter.md
tldr.md
081 Know A Lot
081-know-a-lot.md
tldr.md
082 Out Of Syllabus
082-out-of-syllabus.md
tldr.md
083 Negotiate The Offer
083-negotiate-the-offer.md
tldr.md
084 Never Bad Mouth Your Ex Exployer
084-never-bad-mouth-your-ex-exployer.md
tldr.md
085 Culture Fit
085-culture-fit.md
tldr.md
086 Quantification In Resume
086-quantification-in-resume.md
tldr.md
087 Hiring Is Unfair
087-hiring-is-unfair.md
tldr.md
088 Questions For Interviewers
088-questions-for-interviewers.md
tldr.md
089 Collaboration Communication
089-collaboration-communication.md
tldr.md
090 Out Of Vicious Interview Cycle
090-out-of-vicious-interview-cycle.md
tldr.md
091 Pitch Projects Not Ideas
091-pitch-projects-not-ideas.md
tldr.md
092 Read Design Docs
092-read-design-docs.md
tldr.md
093 Read Rca Docs
093-read-rca-docs.md
tldr.md
094 Start Generalist
094-start-generalist.md
tldr.md
095 Do Not Rely On Summaries
095-do-not-rely-on-summaries.md
tldr.md
096 Structure Your Design Interviews
096-structure-your-design-interviews.md
tldr.md
097 Title Inflation
097-title-inflation.md
tldr.md
098 Find Your Own Project
098-find-your-own-project.md
tldr.md
099 Six Pointers To Crack Coding And Design Interviews
099-six-pointers-to-crack-coding-and-design-interviews.md
tldr.md
100 Keep Yourself Unblocked
100-keep-yourself-unblocked.md
tldr.md
101 Genetic Knapsack
101-genetic-knapsack.md
tldr.md
102 Pseudorandom Number Generation Lfsr
102-pseudorandom-number-generation-lfsr.md
tldr.md
103 How Indexes Work On Partitioned And Sharded Data
103-how-indexes-work-on-partitioned-and-sharded-data.md
tldr.md
104 Some Data Partitioning Strategies For Distributed Data Stores
104-some-data-partitioning-strategies-for-distributed-data-stores.md
tldr.md
105 Data Partitioning
105-data-partitioning.md
tldr.md
106 Leaderless Replication
106-leaderless-replication.md
tldr.md
107 Conflict Resolution
107-conflict-resolution.md
tldr.md
108 Conflict Detection
108-conflict-detection.md
tldr.md
109 Multi Master Replication
109-multi-master-replication.md
tldr.md
110 Monotonic Reads
110-monotonic-reads.md
tldr.md
111 Read Your Write Consistency
111-read-your-write-consistency.md
tldr.md
112 Handling Outages Master Replica
112-handling-outages-master-replica.md
tldr.md
113 Replication Formats
113-replication-formats.md
tldr.md
114 Replication Strategies
114-replication-strategies.md
tldr.md
115 Master Replica Replication
115-master-replica-replication.md
tldr.md
116 Durability
116-durability.md
tldr.md
117 Isolation
117-isolation.md
tldr.md
118 Atomicity
118-atomicity.md
tldr.md
119 Consistency
119-consistency.md
tldr.md
120 Architectures In Distributed Systems
120-architectures-in-distributed-systems.md
tldr.md
121 Mistaken Beliefs Of Distributed Systems
121-mistaken-beliefs-of-distributed-systems.md
tldr.md
122 Fork Bomb
122-fork-bomb.md
tldr.md
123 Chained Operators Python
123-chained-operators-python.md
tldr.md
124 Taxonomy On Sql
124-taxonomy-on-sql.md
tldr.md
125 The Weird Walrus
125-the-weird-walrus.md
tldr.md
126 Fully Persistent Arrays
126-fully-persistent-arrays.md
tldr.md
127 Persistent Data Structures Introduction
127-persistent-data-structures-introduction.md
tldr.md
128 Constant Folding Python
128-constant-folding-python.md
tldr.md
129 String Interning Python
129-string-interning-python.md
tldr.md
130 Recursion Visualizer Python
130-recursion-visualizer-python.md
tldr.md
131 Flajolet Martin
131-flajolet-martin.md
tldr.md
132 2q Cache
132-2q-cache.md
tldr.md
133 Israeli Queues
133-israeli-queues.md
tldr.md
134 1d Terrain
134-1d-terrain.md
tldr.md
135 Jaccard Minhash
135-jaccard-minhash.md
tldr.md
136 Ts Smoothing
136-ts-smoothing.md
tldr.md
137 Lfu
137-lfu.md
tldr.md
138 Morris Counter
138-morris-counter.md
tldr.md
139 Slowsort
139-slowsort.md
tldr.md
140 Bitcask
140-bitcask.md
tldr.md
141 Phi Accrual
141-phi-accrual.md
tldr.md
142 10x Engineer
142-10x-engineer.md
tldr.md
143 Decipher Repeated Key Xor
143-decipher-repeated-key-xor.md
tldr.md
144 Decipher Single Xor
144-decipher-single-xor.md
tldr.md
145 Python Iterable Integers
145-python-iterable-integers.md
tldr.md
146 Inheritance C
146-inheritance-c.md
tldr.md
147 Rum
147-rum.md
tldr.md
148 Consistent Hashing
148-consistent-hashing.md
tldr.md
149 Python Caches Integers
149-python-caches-integers.md
tldr.md
150 Fractional Cascading
150-fractional-cascading.md
tldr.md
151 Copy On Write
151-copy-on-write.md
tldr.md
152 Midpoint Insertion Caching Strategy
152-midpoint-insertion-caching-strategy.md
tldr.md
153 Fsm Python
153-fsm-python.md
tldr.md
154 Bayesian Average
154-bayesian-average.md
tldr.md
155 Sliding Window Ratelimiter
155-sliding-window-ratelimiter.md
tldr.md
156 Idf
156-idf.md
tldr.md
157 Better Programmer
157-better-programmer.md
tldr.md
158 Python Prompts
158-python-prompts.md
tldr.md
159 Rule 30 Cellular Automata
159-rule-30-cellular-automata.md
tldr.md
160 Function Overloading
160-function-overloading.md
tldr.md
161 Isolation Forest
161-isolation-forest.md
tldr.md
162 Image Steganography
162-image-steganography.md
tldr.md
163 Long Integers Python
163-long-integers-python.md
tldr.md
164 I Changed My Python
164-i-changed-my-python.md
tldr.md
165 Benchmark And Compare Pagination Approach In Mongodb
165-benchmark-and-compare-pagination-approach-in-mongodb.md
tldr.md
166 Mongodb Cursor Skip Is Slow
166-mongodb-cursor-skip-is-slow.md
tldr.md
167 Fast And Efficient Pagination In Mongodb
167-fast-and-efficient-pagination-in-mongodb.md
tldr.md
168 Making Http Requests Using Netcat
168-making-http-requests-using-netcat.md
tldr.md
Cryptopals Set 1 Challenge 3

Single-byte XOR cipher

The Single-byte XOR cipher algorithm works with an encryption key of size 1 byte - which means the encryption key could be one of the possible 256 values of a byte. Now we take a detailed look at how the encryption and decryption processes look like for this cipher.

Encryption

As part of the encryption process, the original message is iterated bytewise and every single byte b is XORed with the encryption key key and the resultant stream of bytes is again translated back as characters and sent to the other party. These encrypted bytes need not be among the usual printable characters and should ideally be interpreted as a stream of bytes. Following is the python-based implementation of the encryption process.

Plain text

As an example, we can try to encrypt the plain text - abcd - with encryption key 69 and as per the algorithm, we perform XOR bytewise on the given plain text. For character a, the byte i.e. ASCII value is 97 which when XORed with 69 results in 36 whose character equivalent is $, similarly for b the encrypted byte is ', for c it is & and for d it is !. Hence when abcd is encrypted using single-byte XOR cipher and encryption key 69, the resultant ciphertext i.e. the encrypted message is $'&!.

https://user-images.githubusercontent.com/4745789/85209379-0b377f80-b355-11ea-8206-54ad558b4a6f.png

Decryption

Decryption is the process of extracting the original message from the encrypted ciphertext given the encryption key. XOR has a property - if a = b ^ c then b = a ^ c, hence the decryption process is exactly the same as the encryption i.e. we iterate through the encrypted message bytewise and XOR each byte with the encryption key - the resultant will be the original message.

Since encryption and decryption both have an exact same implementation - we pass the ciphertext to the function single_byte_xor, defined above, to get the original message back.

Plain text

Deciphering without the encryption key

Things become really interesting when we have to recover the original message given the ciphertext and having no knowledge of the encryption key; although we do know the encryption algorithm.

As a sample plain text, we take the last couple of messages, sent across on their German military radio network during World War II. These messages were intercepted and decrypted by the British troops. During wartime, the messages were encrypted using Enigma Machine and Alan Turing famously cracked the Enigma Code (similar to encryption key) that was used to encipher German messages.

https://user-images.githubusercontent.com/4745789/85209533-72096880-b356-11ea-8a84-97f2feb86b44.png

In this essay, instead of encrypting the message using the Enigma Code, we are going to use Single-byte XOR cipher and try to recover the original message back without any knowledge of the encryption key.

Here, we assume that the original message, to be encrypted, is a genuine English lowercased sentence. The ciphertext that we would try to decipher can be obtained as

Plain text

Bruteforce

There are a very limited number of possible encryption keys - 256 to be exact - we can, very conveniently, go for the Bruteforce approach and try to decrypt the ciphered text with every single one of it. So we start iterating over all keys in the range [0, 256) and decrypt the ciphertext and see which one resembles the original message the most.

https://user-images.githubusercontent.com/4745789/85209704-ad586700-b357-11ea-8b7c-4d4616af609a.png

In the illustration above, we see that the message decrypted through key 82 is, in fact, our original message, while the other retrieved plain texts look scrambled and garbage. Doing this visually is very easy; we, as humans, are able to comprehend familiarity but how will a computer recognize this?

We need a way to quantify the closeness of a text to a genuine English sentence. Closer the decrypted text is to be a genuine English sentence, the closer it would be to our original plain text.

We can do this only because of our assumption - that the original plain text is a genuine English sentence.

ETAOIN SHRDLU

Letter Frequency is the number of times letters of an alphabet appear on average in written language. In the English language the letter frequency of letter a is 8.239%, for b it is 1.505% which means out of 100 letters written in English, the letter a, on an average, will show up 8.239% of times while b shows up 1.505% of times. Letter frequency (in percentage) for other letters is as shown below.

Plain text

This Letter Frequency analysis is a rudimentary way for language identification in which we see if the current letter frequency distribution of a text matches the average letter frequency distribution of the English language. ETAOIN SHRDLU is the approximate order of frequency of the 12 most commonly used letters in the English language.

The following chart shows Letter Frequency analysis for decrypted plain texts with encryption keys from 79 to 84.

https://user-images.githubusercontent.com/4745789/85209804-5a32e400-b358-11ea-8e1b-2b6bb3e22868.png

In the illustration above, we could clearly see how well the Letter Frequency distribution for encryption key 82 fits the distribution of the English language. Now that our hypothesis holds true, we need a way to quantify this measure and we call if the Fitting Quotient.

Fitting Quotient

Fitting Quotient is the measure that suggests how well the two Letter Frequency Distributions match. Heuristically, we define the Fitting Quotient as the average of the absolute difference between the frequencies (in percentage) of letters in text and the corresponding letter in the English Language. Thus having a smaller value of Fitting Quotient implies the text is closer to the English Language.

https://user-images.githubusercontent.com/4745789/85219888-f2ff4900-b3c4-11ea-933a-96e26580a3fb.png

Python-based implementation of the, above defined, Fitting Quotient is as shown below. The function first computes the relative frequency for each letter in text and then takes an average of the absolute difference between the two distributions.

Plain text

Deciphering

Now that we have everything we require to directly get the plain text out of the given ciphertext we wrap it in a function that iterates over all possible encryption keys in the range [0, 256), decrypts the ciphertext, computes the fitting quotient for the plain text and returns the one that minimizes the quotient as the original message. Python-based implementation of this deciphering logic is as illustrated below.

Plain text

This approach was also tested against 100 random English sentences with random Encryption keys and it was found that this deciphering technique fared well for all the samples. The approach would fail if the sentence is very short or contains a lot of symbols. The source code for this entire deciphering process is available in a Jupyter notebook at arpitbhayani.me/decipher-single-byte-xor.

References

  • Etaoin shrdlu
  • English Letter Frequency
  • Single-byte XOR encryption
  • Cryptopals Challenge - Set 1 Challenge 3