摘要:
记录对redis企业版的高可用技术中看门狗的分析
Highly Available Redis | Redis
Auto failover
A Redis Enterprise cluster uses two watchdog processes to detect failures:
- Node watchdog: Monitors all processes running on a given node. For example, the node watchdog triggers a shard failover event if a specific shard is not responsive.
- Cluster watchdog: Responsible for the health of the cluster nodes and uses agossip protocol to manage the membership of the nodes in the cluster. For example, cluster watchdog triggers a node failure event or detects a network split incident.
These watchdog processes are part of the distributed cluster manager entity and reside on each node of the cluster. It is extremely important for failure detection to be managed by entities that run inside the cluster in order to avoid situations like that shown on the left side of the figure below. In this example, the watchdog entity is located in the wrong side of the network split and cannot trigger the failover process:
Once a failure event is detected, the Redis Enterprise cluster automatically and transparently runs a set of internal distributed processes that failover the relevant shard(s) and endpoint(s) (if needed) to healthy cluster nodes. If necessary, they also reroute user traffic through a different proxy or proxies.
The Redis Enterprise cluster has out-of-the-box HA profiles for noisy (public cloud) and quiet (virtual private cloud, on-premises) environments. We have found that triggering failovers too aggressively can create stability issues. On the other hand, in a quiet network environment, a Redis Enterprise cluster can be easily tuned to support a constant single-digit (<10 sec) failover time in all failure scenarios.