Requirement for High Availability

In the previous post , I talked about how I use Pi-hole for my DNS resolution. DNS is the core component for the internet to work, so if I were to be patching or rebooting the host running Pi-hole, no devices in my home network would be able to use the internet. A simple fix would be to run two separate hosts with Pi-hole, like another Raspberry Pi Zero. But, in both Windows and Linux based operating systems, I’ve seen issues where the time to switch to the secondary DNS when the primary goes down is a lot, and it leaves with a gap in time where the internet is still down for the end user.

Solution to the DNS problem: Keepalived

The fix to this problem is achieving high availability at a network layer, way below where DNS operates. The Virtual Router Redundancy Protocol (VRRP) is widely used by enterprise routing devices to have such a feature. Keepalived is a routing software written in C which implements the VRRP finite state machine (FSM). Keepalived can do a lot more, but at its very core, it can provide a virtual IP address (VIP or floating IP) that is held by a master node defined in the configuration as a way to provide high availability. When the master node goes down or is unreachable, one of the backup nodes takes over based on the priority. All this means that the same IP address moves from one node to another in case of failure.

Keepalived Overview

Adding this solution to our earlier problem, we get two nodes with Pi-hole and Keepalived installed. Both nodes share a single IP address, the VIP. Using this as the DNS server, whenever the primary Pi-hole instance goes down, the VIP transparently switches to the backup node, while the DNS clients see no difference at all. It is business as usual for them since the IP address does not change.

Solution for synchronization: Gravity Sync

The above setup works just fine as it. But if you notice, the two Pi-hole instances do not communicate with each other. Let’s say that you want to add a new domain to blocklist/allowlist or add a new DNS A record. You would need to do that on both devices separately, doubling the amount of work. If, like me, you have 3 nodes running it, keeping those in sync can be a nightmare. Enter Gravity Sync , a tool to keep the Pi-hole instances in sync automatically on schedule. The way it works under the hood is by using SSH and rsync to keep the gravity database and dnsmasq configs in sync. This means that anytime you make a change in the primary instance, during the scheduled sync, these changes will be copied over to all the backup instances as well.

Keepalived with Gravity Sync

My running setup showcase

Primary node statistics

This is the statistics from my primary node that serves most of the traffic. I have close to a million entries in the blocklist and an impressive 58% block rate, most of which are google ad domains and other tracking sites.

HA DNS server setup with 3 nodes

The above diagram represents how I have it set up in my home lab. The two instances running in virtual machines carry the bulk of the load as they hold the two Keepalived VIPs. If one of them fails, the other node takes over both the VIPs due to how the priority is set. If the two nodes fail, both VIPs transfer to the Raspberry Pi Zero which is physically separate from the ESXi server.

The reason why the Pi does not get control of either of the VIPs until two nodes fail is that the Pi Zero is slow. Not just in terms of DNS requests handled per second (plenty sufficient for normal use, but servers tend to be chatty with DNS), but also the amount of bandwidth available since it connects to the network over a 2.4GHz wireless connection. The VMs have wired ethernet at gigabit speeds, so it is not an issue.

Sample Keepalived config

Here is a sample config from the primary node (named as nuc here) to give you an idea:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
vrrp_instance nuc-stdby-pi {
    state MASTER                      # Initializes by taking ownership of this VIP
    interface ens160
    virtual_router_id 51              # Instance 1
    priority 150                      # Highest priority of the three
    advert_int 1
    unicast_src_ip 10.10.10.3         # IP of the node
    unicast_peer {
        10.10.10.8                    # IP of the 2nd node
        10.10.10.21                   # IP of the 3rd node
    }
    authentication {
        # Use the same password on all nodes to authenticate
        auth_type PASS
        auth_pass securepass123
    }
    virtual_ipaddress {
        # Virtual IP which will be used as the first DNS server entry
        10.10.10.254/24
    }
}

vrrp_instance stdby-nuc-pi {
    state BACKUP                      # Initializes as a standby node for this VIP
    interface ens160
    virtual_router_id 52              # Instance 2
    priority 100                      # Second highest priority of the three
    advert_int 1
    unicast_src_ip 10.10.10.3         # IP of the node
    unicast_peer {
        10.10.10.8                    # IP of the 2nd node
        10.10.10.21                   # IP of the 3rd node
    }
    authentication {
        # Use the same password on all nodes to authenticate, need not be same as the above instance
        auth_type PASS
        auth_pass securepass123
    }
    virtual_ipaddress {
        # Virtual IP which will be used as the second DNS server entry
        10.10.10.253/24
    }
}

Outcome of this adventure

With this, I have one highly available pair and another redundant set for my DNS services, meaning any reboot/patching of the hypervisor or individual VMs would not affect the internet access for devices in my network.