During the lifecycle of an Operator, it is possible that there may be more than one instance running at any given time, for example when rolling out an upgrade for the Operator. In such a scenario, it is necessary to avoid contention between multiple Operator instances using leader election. This ensures only one leader instance handles the reconciliation while the other instances are inactive but ready to take over when the leader steps down.

There are two different leader election implementations to choose from, each with its own trade-off:

  • Leader-for-life: The leader Pod only gives up leadership (using garbage collection) when it is deleted. This implementation precludes the possibility of two instances mistakenly running as leaders (split brain). However, this method can be subject to a delay in electing a new leader. For example, when the leader Pod is on an unresponsive or partitioned node, the pod-eviction-timeout dictates how it takes for the leader Pod to be deleted from the node and step down (default 5m). See the Leader-for-life Go documentation for more.

  • Leader-with-lease: The leader Pod periodically renews the leader lease and gives up leadership when it cannot renew the lease. This implementation allows for a faster transition to a new leader when the existing leader is isolated, but there is a possibility of split brain in certain situations. See the Leader-with-lease Go documentation for more.

By default, the Operator SDK enables the Leader-for-life implementation. Consult the related Go documentation for both approaches to consider the trade-offs that make sense for your use case,

The following examples illustrate how to use the two options.

Using Leader-for-life election

With the Leader-for-life election implementation, a call to leader.Become() blocks the Operator as it retries until it can become the leader by creating the ConfigMap named memcached-operator-lock:

import (
  ...
  "github.com/operator-framework/operator-sdk/pkg/leader"
)

func main() {
  ...
  err = leader.Become(context.TODO(), "memcached-operator-lock")
  if err != nil {
    log.Error(err, "Failed to retry for leader lock")
    os.Exit(1)
  }
  ...
}

If the Operator is not running inside a cluster, leader.Become() simply returns without error to skip the leader election since it cannot detect the Operator’s namespace.

Using Leader-with-lease election

The Leader-with-lease implementation can be enabled using the Manager Options for leader election:

import (
  ...
  "sigs.k8s.io/controller-runtime/pkg/manager"
)

func main() {
  ...
  opts := manager.Options{
    ...
    LeaderElection: true,
    LeaderElectionID: "memcached-operator-lock"
  }
  mgr, err := manager.New(cfg, opts)
  ...
}

When the Operator is not running in a cluster, the Manager returns an error when starting since it cannot detect the Operator’s namespace in order to create the ConfigMap for leader election. You can override this namespace by setting the Manager’s LeaderElectionNamespace option.