TLDR: Phi Accrual Failure Detection Algorithm
Date: 2020-07-12 Source: https://arpitbhayani.me/blogs/phi-accrual
Overview
Explore Phi Accrual Failure Detection, an adaptive algorithm for detecting failures in distributed systems. Learn how it improves on heartbeats! One of the most important virtues of any distributed system is its ability to detect failures in any of its subsystems before things go havoc.
Key Points
- One of the most important virtues of any distributed system is its ability to detect failures in any of its subsystems before things go havoc.
- Heartbeats with constants timeouts: The conventional Failure Detection algorithms use heartbeat messages with a fixed timeout in order to determine if a system is alive or not.
- Detailing φ: We define φ as the suspicion level output by this failure detector and as the algorithm is adaptive, the value will be dynamic and will reflect the current network conditions and system behavior.
- Now that we have defined what φ is, we need a way to compute the probability of receiving another heartbeat given we have seen some heartbeats before.