Linux Redis Operations: Redis Sentinel Mode for High Availability

Introduction

In a production environment, the high availability of Redis is crucial for ensuring business continuity. Master-slave replication provides data redundancy and read-write separation, but it cannot automatically handle master node failures. The Redis Sentinel mode achieves a high availability architecture by monitoring master and slave nodes, performing automatic failover, and electing a new master node. Deploying and operating the Sentinel mode in a Linux environment involves complex tasks such as multi-node configuration, monitoring alerts, and fault recovery, which place high demands on operations engineers.

1. Principles of Redis Sentinel Mode

1.1 Working Mechanism of Sentinel Mode

Redis Sentinel is a collection of independent processes responsible for monitoring Redis master-slave instances, detecting failures, executing automatic failover, and notifying clients. Its core functions include:

Monitoring: Regularly checking the health status of master and slave nodes (via a heartbeat mechanism).
Failure Detection: Determining whether the master node is offline (subjective down) through a multi-sentinel voting mechanism.
Automatic Failover:

Electing a new master node (choosing from the slave nodes).
Updating the slave node configuration to point to the new master node.
Notifying clients to update their connections.

Configuration Management: Maintaining master-slave topology information and dynamically updating configurations.

1.2 Components of Sentinel Mode

Sentinel Nodes: Run the redis-sentinel process, and it is recommended to deploy an odd number (e.g., 3 or 5) to ensure voting consistency.
Master Node: Handles write operations and synchronizes data to slave nodes.
Slave Nodes: Handle read operations and serve as backups for the master node.
Clients: Obtain current master node information through Sentinel and dynamically adjust connections.

1.3 Key Concepts

Subjective Down (SDOWN): A single Sentinel considers the master node unavailable (heartbeat timeout).
Objective Down (ODOWN): A majority of Sentinels agree that the master node has failed.
Failover: A slave node is upgraded to a master node, and the master-slave relationship is reconfigured.
Priority (slave-priority): Affects the order in which slave nodes are elected as the new master node.

1.4 Challenges from an Operations Perspective

Deployment Complexity: Multiple Sentinel nodes need to coordinate, making configuration cumbersome.
Failover Reliability: Ensuring quick switching while maintaining data consistency.
Monitoring and Alerts: Real-time monitoring of Sentinel status and failover events.
Client Adaptation: Clients need to support Sentinel mode (e.g., Jedis, Lettuce).

2. Deployment and Configuration of Sentinel Mode

2.1 Environment Preparation

Taking CentOS 8 as an example, deploy a one master, two slaves + three Sentinel architecture:

Master Node: IP 192.168.1.100, port 6379.
Slave Node 1: IP 192.168.1.101, port 6379.
Slave Node 2: IP 192.168.1.102, port 6379.
Sentinel Nodes: IP 192.168.1.100, 192.168.1.101, 192.168.1.102, port 26379.
Prerequisites:

Redis 7.0.12 has been installed (refer to the first article in the series).
Master-slave replication has been configured (refer to the fourth article in the series).

The firewall allows ports 6379 and 26379:

sudo firewall-cmd --permanent --add-port=6379/tcp
sudo firewall-cmd --permanent --add-port=26379/tcp
sudo firewall-cmd --reload

2.2 Master and Slave Node Configuration

Master node (192.168.1.100) configuration file /etc/redis/redis.conf:

bind 192.168.1.100
port 6379
requirepass your_secure_password
masterauth your_secure_password
dir /var/redis/data/
logfile /var/log/redis/redis.log
maxmemory 4gb
maxmemory-policy allkeys-lru
repl-backlog-size 10mb

Slave nodes (192.168.1.101 and 192.168.1.102) have similar configurations, adding:

slaveof 192.168.1.100 6379
slave-read-only yes

Start the master and slave nodes:

sudo systemctl start redis

2.3 Sentinel Node Configuration

Create the configuration file /etc/redis/sentinel.conf on each Sentinel node (192.168.1.100, 192.168.1.101, 192.168.1.102):

port 26379
dir /var/redis/sentinel/
logfile /var/log/redis/sentinel.log
sentinel monitor mymaster 192.168.1.100 6379 2
sentinel auth-pass mymaster your_secure_password
sentinel down-after-milliseconds mymaster 30000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000

Key Parameters:

sentinel monitor : Monitor the master node, quorum=2 means at least 2 Sentinels must agree to trigger objective down.
sentinel auth-pass: Master node password.
down-after-milliseconds: Master node heartbeat timeout (default 30 seconds).
parallel-syncs: Number of slave nodes to synchronize simultaneously during failover, recommended to be 1 to reduce pressure on the master node.
failover-timeout: Failover timeout (default 180 seconds).

2.4 Starting Sentinel Nodes

Start Sentinel on each Sentinel node:

redis-sentinel /etc/redis/sentinel.conf

Or manage with systemd by creating /etc/systemd/system/redis-sentinel.service:

[Unit]
Description=Redis Sentinel
After=network.target

[Service]
User=redis
Group=redis
ExecStart=/usr/local/bin/redis-sentinel /etc/redis/sentinel.conf
ExecStop=/usr/local/bin/redis-cli -p 26379 shutdown
Restart=always

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl start redis-sentinel
sudo systemctl enable redis-sentinel

2.5 Verifying Sentinel Deployment

Check Sentinel Status:

redis-cli -h 192.168.1.100 -p 26379 INFO SENTINEL

Example output:

master0:name=mymaster,status=ok,address=192.168.1.100:6379,slaves=2,sentinels=3

Verify Master-Slave Information:
```
redis-cli -h 192.168.1.100 -p 26379 SENTINEL masters
```
Outputs master node information and the list of slave nodes.
Test Failover:

Stop the master node:
```
sudo systemctl stop redis
```
Observe the Sentinel logs (/var/log/redis/sentinel.log) to confirm that a slave node (e.g., 192.168.1.101) is elected as the new master node.
Check client connections to verify the new master node is available.

2.6 Configuration Optimization

Number of Sentinel Nodes: At least 3 Sentinels, distributed across different physical machines, with quorum set to (n/2)+1.
Heartbeat Optimization: Adjust down-after-milliseconds based on network latency (e.g., 10 seconds).

Notification Script: Configure failover notification:

sentinel notification-script mymaster /usr/local/bin/notify.sh

Example script notify.sh:

#!/bin/bash
echo "Sentinel event: $1 $2 $3 $4" >> /var/log/redis/notify.log
# Send alerts to email or corporate WeChat

3. Monitoring and Alerts

3.1 Monitoring Key Metrics

Use INFO SENTINEL and INFO REPLICATION to monitor Sentinel and master-slave status:

Sentinel Nodes:

sentinel_masters: Number of monitored master nodes.
sentinel_sentinels: Number of active Sentinels.
sentinel_slaves: Number of slave nodes.

Master Node:

connected_slaves: Number of connected slave nodes.

Slave Nodes:

master_link_status: Connection status to the master node.

3.2 Prometheus + Redis Exporter

Deploy Redis Exporter (including Sentinel):

docker run -d --name redis-exporter -p 9121:9121 oliver006/redis_exporter \
  --redis.addr=redis://192.168.1.100:6379,redis://192.168.1.100:26379 \
  --redis.password=your_secure_password

Prometheus Configuration:

scrape_configs:
  - job_name: 'redis-sentinel'
    static_configs:
      - targets: ['192.168.1.100:9121', '192.168.1.101:9121', '192.168.1.102:9121']

Key Metrics:

redis_sentinel_masters: Number of master nodes monitored by Sentinel.
redis_sentinel_sentinels: Number of active Sentinels.
redis_sentinel_failover_state: Failover state.

3.3 Alert Rules

Example Prometheus alert rules:

groups:
- name: redis-sentinel
  rules:
  - alert: SentinelDown
    expr: redis_sentinel_sentinels < 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Redis Sentinel {{ $labels.instance }} down"
  - alert: MasterFailover
    expr: redis_sentinel_failover_state != 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Redis master failover triggered on {{ $labels.instance }}"

3.4 Automated Monitoring Script

Python Script Example: Monitor Sentinel status and send alerts.

import redis
import smtplib
from email.mime.text import MIMEText

s = redis.Redis(host='192.168.1.100', port=26379)
info = s.info('sentinel')
if info['sentinel_sentinels'] < 3:
    msg = MIMEText(f'Sentinel count dropped to {info["sentinel_sentinels"]}')
    msg['Subject'] = 'Redis Sentinel Alert'
    msg['From'] = '[email protected]'
    msg['To'] = '[email protected]'
    with smtplib.SMTP('smtp.example.com') as smtp:
        smtp.send_message(msg)

Operations Practice:

Run the script regularly (e.g., cron every 5 minutes).
Combine with Grafana to visualize Sentinel status and failover events.

4. Troubleshooting and Recovery

4.1 Common Failures and Troubleshooting

Sentinel Cannot Connect to Master Node:

Check the Sentinel logs: /var/log/redis/sentinel.log.
Verify sentinel auth-pass configuration.
Test network: telnet 192.168.1.100 6379.

Causes: Master node down, network interruption, incorrect password.
Troubleshooting:
Solution: Fix the network or restart the master node.

Failover Failure:

Increase the number of Sentinel nodes.
Ensure slave nodes are healthy and adjust slave-priority.

Causes: Insufficient number of Sentinels (less than quorum), unavailable slave nodes.
Troubleshooting: Check sentinel_sentinels and sentinel_slaves.
Solution:

Client Connection Issues:

Causes: Clients not adapted to Sentinel mode, failing to update the master node after failover.
Troubleshooting: Check client logs to confirm if the master node was obtained through Sentinel.
Solution: Use clients that support Sentinel (e.g., JedisSentinelPool).

4.2 Fault Recovery

Master Node Down:

Sentinel automatically elects a slave node (e.g., 192.168.1.101) as the new master node.

Verify the new master node:

redis-cli -h 192.168.1.101 -a your_secure_password INFO REPLICATION

Repair the original master node and rejoin as a slave node:

redis-cli -h 192.168.1.100 -a your_secure_password SLAVEOF 192.168.1.101 6379

Sentinel Node Down:

Restart the Sentinel node and check logs to confirm rejoining the cluster.
If the Sentinel node cannot be restored, deploy a new Sentinel node and update the configuration.

Operations Practice:

Regularly practice failover to ensure switch time < 30 seconds.
Backup Sentinel configuration files and record master-slave topology.

5. Operations Case: Sentinel Mode Optimization and Fault Recovery

5.1 Scenario Description

A Redis Sentinel architecture (one master, two slaves, three Sentinels, 16GB memory, Redis 7.0) experienced a master node failure during peak hours, with failover taking 45 seconds and a 10% client connection interruption.

5.2 Troubleshooting Steps

Check Sentinel Logs:

The Sentinel logs show the master node (192.168.1.100) was subjectively down, triggering objective down.
Failover took a long time, with down-after-milliseconds set to 30 seconds.

Analyze Slave Nodes:

Slave node 192.168.1.101 was elected as master, but synchronization delay was high (5MB).
repl-backlog-size was only 1MB, leading to full synchronization.

Client Issues:

Some clients did not obtain the new master node through Sentinel, resulting in connection failures.

5.3 Optimization Measures

Adjust Sentinel Parameters:

Reduce down-after-milliseconds:

sentinel down-after-milliseconds mymaster 10000

Increase repl-backlog-size:
```
repl-backlog-size 50mb
```

Optimize Slave Nodes:

Increase slave node priority (slave-priority 100).
Enable diskless replication:
```
repl-diskless-sync yes
```

Client Optimization:

Update clients to JedisSentinelPool, supporting dynamic master node switching.

Enhance Monitoring:

Configure Prometheus alerts to monitor redis_sentinel_failover_state.
Add Grafana dashboards to display failover times.

5.4 Optimization Results

Failover time reduced to 15 seconds.
Client connection interruption rate reduced to 1%.
Replication delay reduced to 100KB, enhancing system stability.

6. Summary and Outlook

This article explained the principles of Redis Sentinel mode, deployment steps, monitoring and alerts, troubleshooting, and recovery practices. The Sentinel mode significantly enhances Redis’s high availability through automatic failover, and operations engineers need to master configuration optimization, monitoring deployment, and client adaptation techniques.