InkdownInkdown
Start writing

Arpit Bhayani Blogs

336 files·168 subfolders

Shared Workspace

Arpit Bhayani Blogs
001 Ai Topological Sort

161-isolation-forest

Shared from "Arpit Bhayani Blogs" on Inkdown

Isolation Forest - Fast and Efficient Anomaly Detection

Source: https://arpitbhayani.me/blogs/isolation-forest Date: 2020-01-31

Uncover anomalies with Isolation Forest, an unsupervised algorithm. Learn its core principles, tree construction, and scoring for anomaly detection.


Anomaly detection is identifying something that could not be stated as “normal”; the definition of “normal” depends on the phenomenon that is being observed and the properties it bears. In this article, we dive deep into an unsupervised anomaly detection algorithm called Isolation Forest. This algorithm beautifully exploits the characteristics of anomalies, keeping it independent of data distributions making the approach novel.

Characteristics of anomalies

Since anomalies deviate from normal, they are few in numbers (minority) and/or have attribute values that are very different from those of normal. The paper nicely puts it as . These characteristics of anomalies make them more susceptible to isolation than normal points and form the guiding principle of the Isolation Forest algorithm.

001-ai-topological-sort.md
tldr.md
002 Temporal Primer
002-temporal-primer.md
tldr.md
003 Rag Production
003-rag-production.md
tldr.md
004 Structure Of Llm Chat
004-structure-of-llm-chat.md
tldr.md
005 How Llms Work
005-how-llms-work.md
tldr.md
006 Monolith Is Distributed System
006-monolith-is-distributed-system.md
tldr.md
007 Defensive Databases
007-defensive-databases.md
tldr.md
008 Bm25
008-bm25.md
tldr.md
009 Join Algorithms
009-join-algorithms.md
tldr.md
010 Venting At Work
010-venting-at-work.md
tldr.md
011 Half Life
011-half-life.md
tldr.md
012 Multi Paxos
012-multi-paxos.md
tldr.md
013 Mysql Replication Internals
013-mysql-replication-internals.md
tldr.md
014 Bloom Filters
014-bloom-filters.md
tldr.md
015 Clock Sync Nightmare
015-clock-sync-nightmare.md
tldr.md
016 Kafka Partitions
016-kafka-partitions.md
tldr.md
017 Product Quantization
017-product-quantization.md
tldr.md
018 Qkv Matrices
018-qkv-matrices.md
tldr.md
019 Deleted Production
019-deleted-production.md
tldr.md
020 How Llm Inference Works
020-how-llm-inference-works.md
tldr.md
021 Blocking Queues
021-blocking-queues.md
tldr.md
022 Heartbeats In Distributed Systems
022-heartbeats-in-distributed-systems.md
tldr.md
023 Cassandra Writes
023-cassandra-writes.md
tldr.md
024 Redis Replication
024-redis-replication.md
tldr.md
025 Arrogant People At Work
025-arrogant-people-at-work.md
tldr.md
026 Cdn Content Replication
026-cdn-content-replication.md
tldr.md
027 Cant Fix Everything Day One
027-cant-fix-everything-day-one.md
tldr.md
028 Emotions At Work
028-emotions-at-work.md
tldr.md
029 Grpc Http2
029-grpc-http2.md
tldr.md
030 Meetings With No Agenda Are A Waste Of Time
030-meetings-with-no-agenda-are-a-waste-of-time.md
tldr.md
031 Growth Is Not About Doing Everything
031-growth-is-not-about-doing-everything.md
tldr.md
032 Career Longevity Vs Job Hopping
032-career-longevity-vs-job-hopping.md
tldr.md
033 Stay Relevant At Higher Salary Levels
033-stay-relevant-at-higher-salary-levels.md
tldr.md
034 Why Consensus
034-why-consensus.md
tldr.md
035 Database Deadlocks
035-database-deadlocks.md
tldr.md
036 Cpu Cache Locality
036-cpu-cache-locality.md
tldr.md
037 Eventual Consistency
037-eventual-consistency.md
tldr.md
038 Dns Udp Tcp
038-dns-udp-tcp.md
tldr.md
039 Masters
039-masters.md
tldr.md
040 Empathy Makes Great Engineers Unstoppable
040-empathy-makes-great-engineers-unstoppable.md
tldr.md
041 Good Mentors Build People
041-good-mentors-build-people.md
tldr.md
042 Always Have Back Burner Projects
042-always-have-back-burner-projects.md
tldr.md
043 Before You Push Back Know What Youre Standing On
043-before-you-push-back-know-what-youre-standing-on.md
tldr.md
044 Be The One They Can Count On
044-be-the-one-they-can-count-on.md
tldr.md
045 How Much People Bet On You
045-how-much-people-bet-on-you.md
tldr.md
046 How To Get Leadership To Say Yes To Your Project
046-how-to-get-leadership-to-say-yes-to-your-project.md
tldr.md
047 Dont Let Your Best Ideas Die In Silence
047-dont-let-your-best-ideas-die-in-silence.md
tldr.md
048 Be Someone Others Want To Work With
048-be-someone-others-want-to-work-with.md
tldr.md
049 Dont Fall For Xy Problem Ask Right Questions
049-dont-fall-for-xy-problem-ask-right-questions.md
tldr.md
050 Biggest Lie Startups Tell Engineers
050-biggest-lie-startups-tell-engineers.md
tldr.md
051 Promotions Are Proactive Not Reactive
051-promotions-are-proactive-not-reactive.md
tldr.md
052 Not Enough To Be Right Learn To Be Heard
052-not-enough-to-be-right-learn-to-be-heard.md
tldr.md
053 No One Ships Alone
053-no-one-ships-alone.md
tldr.md
054 Not Every Mistake Needs A Correction
054-not-every-mistake-needs-a-correction.md
tldr.md
055 Build Influence At Work
055-build-influence-at-work.md
tldr.md
056 Your Soft Skills Arent Soft At All
056-your-soft-skills-arent-soft-at-all.md
tldr.md
057 Experience Before Forming Opinion
057-experience-before-forming-opinion.md
tldr.md
058 Curiosity And High Bias For Action
058-curiosity-and-high-bias-for-action.md
tldr.md
059 Worklog
059-worklog.md
tldr.md
060 Mistakes And Growth
060-mistakes-and-growth.md
tldr.md
061 Own It Instead Of Sweeping It Aside
061-own-it-instead-of-sweeping-it-aside.md
tldr.md
062 Dont Wait Step Up
062-dont-wait-step-up.md
tldr.md
063 Temporary Fix Is Permanent
063-temporary-fix-is-permanent.md
tldr.md
064 Interview Bias And What Sets You Apart
064-interview-bias-and-what-sets-you-apart.md
tldr.md
065 Saying This Isnt My Problem Is A Problem
065-saying-this-isnt-my-problem-is-a-problem.md
tldr.md
066 Okr
066-okr.md
tldr.md
067 Miscommunication
067-miscommunication.md
tldr.md
068 When In Doubt Code It Out
068-when-in-doubt-code-it-out.md
tldr.md
069 Follow Up Without Annoying People
069-follow-up-without-annoying-people.md
tldr.md
070 Lead Projects That Land
070-lead-projects-that-land.md
tldr.md
071 Abstract Thinking Skill Next Decade
071-abstract-thinking-skill-next-decade.md
tldr.md
072 We Engineers Suck At Task Estimation
072-we-engineers-suck-at-task-estimation.md
tldr.md
073 Shiny Object Syndrome In Tech
073-shiny-object-syndrome-in-tech.md
tldr.md
074 3p
074-3p.md
tldr.md
075 Leverage The Equilibrium
075-leverage-the-equilibrium.md
tldr.md
076 On Demand Container Loading In Aws Lambda
076-on-demand-container-loading-in-aws-lambda.md
tldr.md
077 Sql Has Problems We Can Fix Them Pipe Syntax In Sql
077-sql-has-problems-we-can-fix-them-pipe-syntax-in-sql.md
tldr.md
078 Nanolog A Nanosecond Scale Logging System
078-nanolog-a-nanosecond-scale-logging-system.md
tldr.md
079 Best Resource Is Mythical
079-best-resource-is-mythical.md
tldr.md
080 Wtf The Who To Follow Service At Twitter
080-wtf-the-who-to-follow-service-at-twitter.md
tldr.md
081 Know A Lot
081-know-a-lot.md
tldr.md
082 Out Of Syllabus
082-out-of-syllabus.md
tldr.md
083 Negotiate The Offer
083-negotiate-the-offer.md
tldr.md
084 Never Bad Mouth Your Ex Exployer
084-never-bad-mouth-your-ex-exployer.md
tldr.md
085 Culture Fit
085-culture-fit.md
tldr.md
086 Quantification In Resume
086-quantification-in-resume.md
tldr.md
087 Hiring Is Unfair
087-hiring-is-unfair.md
tldr.md
088 Questions For Interviewers
088-questions-for-interviewers.md
tldr.md
089 Collaboration Communication
089-collaboration-communication.md
tldr.md
090 Out Of Vicious Interview Cycle
090-out-of-vicious-interview-cycle.md
tldr.md
091 Pitch Projects Not Ideas
091-pitch-projects-not-ideas.md
tldr.md
092 Read Design Docs
092-read-design-docs.md
tldr.md
093 Read Rca Docs
093-read-rca-docs.md
tldr.md
094 Start Generalist
094-start-generalist.md
tldr.md
095 Do Not Rely On Summaries
095-do-not-rely-on-summaries.md
tldr.md
096 Structure Your Design Interviews
096-structure-your-design-interviews.md
tldr.md
097 Title Inflation
097-title-inflation.md
tldr.md
098 Find Your Own Project
098-find-your-own-project.md
tldr.md
099 Six Pointers To Crack Coding And Design Interviews
099-six-pointers-to-crack-coding-and-design-interviews.md
tldr.md
100 Keep Yourself Unblocked
100-keep-yourself-unblocked.md
tldr.md
101 Genetic Knapsack
101-genetic-knapsack.md
tldr.md
102 Pseudorandom Number Generation Lfsr
102-pseudorandom-number-generation-lfsr.md
tldr.md
103 How Indexes Work On Partitioned And Sharded Data
103-how-indexes-work-on-partitioned-and-sharded-data.md
tldr.md
104 Some Data Partitioning Strategies For Distributed Data Stores
104-some-data-partitioning-strategies-for-distributed-data-stores.md
tldr.md
105 Data Partitioning
105-data-partitioning.md
tldr.md
106 Leaderless Replication
106-leaderless-replication.md
tldr.md
107 Conflict Resolution
107-conflict-resolution.md
tldr.md
108 Conflict Detection
108-conflict-detection.md
tldr.md
109 Multi Master Replication
109-multi-master-replication.md
tldr.md
110 Monotonic Reads
110-monotonic-reads.md
tldr.md
111 Read Your Write Consistency
111-read-your-write-consistency.md
tldr.md
112 Handling Outages Master Replica
112-handling-outages-master-replica.md
tldr.md
113 Replication Formats
113-replication-formats.md
tldr.md
114 Replication Strategies
114-replication-strategies.md
tldr.md
115 Master Replica Replication
115-master-replica-replication.md
tldr.md
116 Durability
116-durability.md
tldr.md
117 Isolation
117-isolation.md
tldr.md
118 Atomicity
118-atomicity.md
tldr.md
119 Consistency
119-consistency.md
tldr.md
120 Architectures In Distributed Systems
120-architectures-in-distributed-systems.md
tldr.md
121 Mistaken Beliefs Of Distributed Systems
121-mistaken-beliefs-of-distributed-systems.md
tldr.md
122 Fork Bomb
122-fork-bomb.md
tldr.md
123 Chained Operators Python
123-chained-operators-python.md
tldr.md
124 Taxonomy On Sql
124-taxonomy-on-sql.md
tldr.md
125 The Weird Walrus
125-the-weird-walrus.md
tldr.md
126 Fully Persistent Arrays
126-fully-persistent-arrays.md
tldr.md
127 Persistent Data Structures Introduction
127-persistent-data-structures-introduction.md
tldr.md
128 Constant Folding Python
128-constant-folding-python.md
tldr.md
129 String Interning Python
129-string-interning-python.md
tldr.md
130 Recursion Visualizer Python
130-recursion-visualizer-python.md
tldr.md
131 Flajolet Martin
131-flajolet-martin.md
tldr.md
132 2q Cache
132-2q-cache.md
tldr.md
133 Israeli Queues
133-israeli-queues.md
tldr.md
134 1d Terrain
134-1d-terrain.md
tldr.md
135 Jaccard Minhash
135-jaccard-minhash.md
tldr.md
136 Ts Smoothing
136-ts-smoothing.md
tldr.md
137 Lfu
137-lfu.md
tldr.md
138 Morris Counter
138-morris-counter.md
tldr.md
139 Slowsort
139-slowsort.md
tldr.md
140 Bitcask
140-bitcask.md
tldr.md
141 Phi Accrual
141-phi-accrual.md
tldr.md
142 10x Engineer
142-10x-engineer.md
tldr.md
143 Decipher Repeated Key Xor
143-decipher-repeated-key-xor.md
tldr.md
144 Decipher Single Xor
144-decipher-single-xor.md
tldr.md
145 Python Iterable Integers
145-python-iterable-integers.md
tldr.md
146 Inheritance C
146-inheritance-c.md
tldr.md
147 Rum
147-rum.md
tldr.md
148 Consistent Hashing
148-consistent-hashing.md
tldr.md
149 Python Caches Integers
149-python-caches-integers.md
tldr.md
150 Fractional Cascading
150-fractional-cascading.md
tldr.md
151 Copy On Write
151-copy-on-write.md
tldr.md
152 Midpoint Insertion Caching Strategy
152-midpoint-insertion-caching-strategy.md
tldr.md
153 Fsm Python
153-fsm-python.md
tldr.md
154 Bayesian Average
154-bayesian-average.md
tldr.md
155 Sliding Window Ratelimiter
155-sliding-window-ratelimiter.md
tldr.md
156 Idf
156-idf.md
tldr.md
157 Better Programmer
157-better-programmer.md
tldr.md
158 Python Prompts
158-python-prompts.md
tldr.md
159 Rule 30 Cellular Automata
159-rule-30-cellular-automata.md
tldr.md
160 Function Overloading
160-function-overloading.md
tldr.md
161 Isolation Forest
161-isolation-forest.md
tldr.md
162 Image Steganography
162-image-steganography.md
tldr.md
163 Long Integers Python
163-long-integers-python.md
tldr.md
164 I Changed My Python
164-i-changed-my-python.md
tldr.md
165 Benchmark And Compare Pagination Approach In Mongodb
165-benchmark-and-compare-pagination-approach-in-mongodb.md
tldr.md
166 Mongodb Cursor Skip Is Slow
166-mongodb-cursor-skip-is-slow.md
tldr.md
167 Fast And Efficient Pagination In Mongodb
167-fast-and-efficient-pagination-in-mongodb.md
tldr.md
168 Making Http Requests Using Netcat
168-making-http-requests-using-netcat.md
tldr.md
few and different

The usual approach for detecting anomalies

The existing models train to see what constitutes “normal” and then label everything that does not conform to this definition as anomalies. Almost every single algorithm has its own way of defining a normal point/instance; some do it through statistical methods, some use classification or clustering but in the end, the process remains the same - define normal and filter out everything else.

The issue with the usual approach

The usual methods are not optimized to detect anomalies, instead, they are optimized to find normal instances, because of which the result of anomaly detection either contains too many false positives or might detect too few anomalies. Many of these methods are computationally complex and hence suit low dimensional and/or small-sized data.

Isolation Forest algorithm addresses both of the above concerns and provides an efficient and accurate way to detect anomalies.

The algorithm

Now we take a go through the algorithm, and dissect it stage by stage and in the process understand the math behind it. Fasten your seat belts, it’s going to be a bumpy ride.

The core principle

The core of the algorithm is to “isolate” anomalies by creating decision trees over random attributes. The random partitioning produces noticeable shorter paths for anomalies since

  • fewer instances (of anomalies) result in smaller partitions
  • distinguishable attribute values are more likely to be separated in early partitioning

Hence, when a forest of random trees collectively produces shorter path lengths for some particular points, then they are highly likely to be anomalies.

Decision tree splits for normal points and anomalies

The diagram above shows the number of splits required to isolate a normal point and an anomaly. Splits, represented through blue lines, happens at random on a random attribute and in the process building a decision tree. The number of splits determines the level at which the isolation happened and will be used to generate the anomaly score.

The process is repeated multiple times and we note the isolation level for each point/instance. Once the iterations are over, we generate an anomaly score for each point/instance, suggesting its likeliness to be an anomaly. The score is a function of the average level at which the point was isolated. The top m gathered on the basis of the score, are labeled as anomalies.

Construction of decision tree

The decision tree is constructed by splitting the sub-sample points/instances over a split value of a randomly selected attribute such that the instances whose corresponding attribute value is smaller than the split value goes left and the others go right, and the process is continued recursively until the tree is fully constructed. The split value is selected at random between the minimum and maximum values of the selected attribute.

There are two types of node in the decision tree

Internal Node

Internal nodes are non-leaf and contain the split value, split attribute and pointers to two child sub-trees. An internal node is always a parent to two child sub-trees making the entire decision tree a proper binary tree.

External Node

External nodes are leaf nodes that could not be split further and reside at the bottom of the tree. Each external node will hold the size of the un-built subtree which is used to calculate the anomaly score.

Decision tree with internal and external nodes

Why sub-sampling helps

The Isolation Forest algorithm works well when the trees are created, not from the entire dataset, but from a sub-sampled data set. This is very different from almost all other techniques where they thrive on data and demands more of it for greater accuracy. Sub-sampling works wonder in this algorithm because normal instances can interfere with the isolation process by being a little closer to the anomalies.

Importance of sub-sampling in Isolation Forest

The image above shows how sub-sampling actually makes a clear separation between normal points and anomalies. In the original dataset, we see that normal points and very close to anomalies making detection tougher and inaccurate (with a lot of false negatives). Because of sub-sampling, we could see a clear separation of anomalies and normal instances. This makes the entire process of anomaly detection efficient and accurate.

Optimizing decision tree construction

Since anomalies are susceptible to isolation and have a tendency to reside closer to the root of the decision tree, we construct the decision tree till it reaches a certain height max_height and not split points further. This height is the height post which we are (almost) sure that there could not be any anomalies.

Plain text

Constructing the forest

The process of tree construction is repeated multiple times and each time we pick a random sub-sample and construct the tree. There are no strict rules to determine the number of iterations, but in general, we could say the more the merrier. The sub-sampling count is also a parameter and could change depending on the data set.

The pseudocode for forest construction is as follows

Plain text

While constructing the tree we pass max_height as log2(nodes_count) as that is the average height of a proper binary tree that could be constructed from nodes_count number of nodes. Since anomalies reside closer to the root node it is highly unlikely that any anomaly will isolate after the tree has reached height max_height. This helps us save a lot of computation and tree construction making it computationally and memory efficient.

Scoring the anomalies

Every anomaly detection algorithm has to score its data points/instances and quantify the confidence the algorithm has on its potential anomalies. The generated anomaly score has to be bounded and comparable. In Isolation Forest, that fact that anomalies always stay closer to the root, becomes our guiding and defining insight that will help us build a scoring function. The anomaly score will a function of path length which is defined as

Path Length h(x) of a point x is the number of edges x traverses from the root node.

As the maximum possible height of the tree grows by order of n, the average height grows by log(n) - this makes normalizing of the scoring function a little tricky. To remedy this we use the insights from the structure of the decision tree. The decision tree has two types of nodes internal and external such that external has no child while internal is a parent to exactly two nodes - which means the decision tree is a proper binary tree and hence we conclude

The average path length h(x) for external node termination is the same as the average path length of unsuccessful search in BST.

In a BST, an unsuccessful search always terminates at a NULL pointer and if we treat external node of the decision tree as NULL of BST, then we could say that average path length of external node termination is same as average path length of unsuccessful search in BST (constructed only from internal nodes of the decision tree), and it is given by

BST unsuccessful search estimation

where H(i) is the harmonic number and it can be estimated by ln(i) + 0.5772156649 (Euler–Mascheroni constant). c(n) is the average of path length h(x) given n, we use it to normalize h(x).

To understand the derivation in detail refer to the references at the end of this article.

The anomaly score of an instance x is defined as

scoring function

where E(h(x)) is the average path length (average of h(x)) from a collection of isolation trees. From the scoring function defined above, we could deduce that if

  • the score is very close to 1, then they are definitely anomalies
  • the score is much smaller than 0.5, then they are quite safe to be regarded as normal instances, and
  • all the instances return around 0.5, then the entire sample does not really have any distinct anomaly

Evaluating anomalies

In the evaluation stage, an anomaly score is derived from the expected path length E(h(x)) for each test instance. Using get_path_length function (pseudocode below), a single path length h(x) is calculated by traversing through the decision tree.

If iteration terminates at an external node where size > 1 then the return value is e (number of edges traversed till current node) plus an adjustment c(size), estimated from the formula above. This adjustment is for the unbuilt decision tree (for efficiency) beyond the max height. When h(x) is obtained for each node of each tree, an anomaly score is produced by computing s(x, sample_size). Sorting instances by the score s in descending order and getting top m will yield us m anomalies.

Plain text

References for BST unsuccessful search estimation

  • IIT KGP, Algorithms, Lecture Notes - Page 7
  • What is real big-O of search in BST?
  • CMU CMSC 420: Lecture 5 - Slide 13
  • CISE UFL: Data Structures, Algorithms, & Applications - 1st Proof

Conclusion

The isolation forest algorithm thrives on sub-sampled data and does not need to build the tree from the entire data set; it works well with sub-sampled data. While constructing the tree, we need not build tree taller than max_height (very cheap to compute), making it low on memory footprint. Since the algorithm does not depend on computationally expensive operations like distance or density calculation, it executes really fast. The training stage has a linear time complexity with a low constant and hence could be used in a real-time online system.

I hope this article helped you to understand Isolation Forest, an unsupervised anomaly detection algorithm. I stumbled upon this through an engineering blog of Grofers. This algorithm was very interesting to me because of its novel approach and hence I dived deep into it. FYI: In 2018, Isolation Forest was extended by Sahand Hariri, Matias Carrasco Kind, Robert J. Brunner. I have not read the Extended Isolation Forest algorithm but have definitely added it to my reading list. I recommend that if you liked this algorithm you should definitely give the extended version a skim.