- Published on
How to fix leader re-election in consul clusters
If a consul cluster has lost its leader, running consul operator raft list-peers
would give the following error
➜ kubectl exec -it object-consul-server-0 -- consul operator raft list-peers
-tls-server-name=object-consul-server.test.com
-client-cert=/consul/tls/server/tls.crt
-client-key=/consul/tls/server/tls.key
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)
command terminated with exit code 1
In order to fix this, we need to manually specify which Consul server Pods should be a part of the cluster by listing them in a peers.json
file that the Consul server agent reads at startup.
Assuming you have the following Consul Pods:
➜ kubectl exec -it object-consul-server-0 -- consul members
-tls-server-name=object-consul-server.test.com
-client-cert=/consul/tls/server/tls.crt
-client-key=/consul/tls/server/tls.key
Node Address Status Type Build Protocol DC Partition Segment
object-consul-server-0 server-0-IP:8301 alive server 1.14.11 2 test default <all>
object-consul-server-1 server-1-IP:8301 alive server 1.14.11 2 test default <all>
object-consul-server-2 server-2-IP:8301 alive server 1.14.11 2 test default <all>
object-consul-server-3 server-3-IP:8301 alive server 1.14.11 2 test default <all>
object-consul-server-5 server-5-IP:8301 alive server 1.14.11 2 test default <all>
Currently none of the Pods are a leader in the cluster. Let's create a peers.json
file inside object-consul-server-0
and restart the Pod so that Consul picks up the details in the file, makes object-consul-server-0
the leader of the cluster and all the other replicas as followers. In order to do this, you need the IP addresses and the node-ids of all the Consul Pods. The IPs are available in the consul members
output above. You can find the node-id of a Consul Pod like so:
➜ kubectl exec -it object-consul-server-2 -- cat /consul/data/node-id
274fff00-bbb2-150c-dc06-ee7293f18570
The node-id is stored inside a /consul/data/node-id
file inside each respective Pod.
With these information, create a peers.json file like so:
[
{
"id": "248b78d1-fc5e-b968-663b-04357fa8c288",
"address": "server-0-IP:8301",
"non_voter": false
},
{
"id": "960601ff-db00-771b-e054-214570366719",
"address": "server-1-IP:8301",
"non_voter": false
},
{
"id": "274fff00-bbb2-150c-dc06-ee7293f18570",
"address": "server-2-IP:8301",
"non_voter": false
},
{
"id": "d4d0eb6f-cfdd-dc9a-e34b-41d8a8ca334f",
"address": "server-3-IP:8301",
"non_voter": false
},
{
"id": "d7063fc0-bfb8-51b2-e62c-67a2d14b8303",
"address": "server-5-IP:8301",
"non_voter": false
}
]
The peers.json
is an array of all the Pods in our Consul cluster, where id
is the node-id and address
is the IP address of each respective Pod.
Once you have this file ready, save it as /consul/data/raft/peers.json
in the Pod which you want to make the leader of the cluster and restart that specific Pod. Once the Pod gets recreated, Consul will pick up the peers.json
file, elects that Pod as the leader and joins all the other Pods into the cluster as followers. Consul deletes the peers.json file once its used.
If you want information on this entire process, try reading this post on disaster recovery from the Consul docs or run cat /consul/data/raft/peers.info
inside a Consul Pod. The peers.info
also has some details on how peers.json
works.