Architecture Basics

pg_auto_failover is designed as a simple and robust way to manage automated Postgres failover in production. On-top of robust operations, pg_auto_failover setup is flexible and allows either Business Continuity or High Availability configurations. pg_auto_failover design includes configuration changes in a live system without downtime.

pg_auto_failover is designed to be able to handle a single PostgreSQL service using three nodes. In this setting, the system is resilient to losing any one of three nodes.

pg_auto_failover Architecture for a standalone PostgreSQL service

It is important to understand that when using only two Postgres nodes then pg_auto_failover is optimized for Business Continuity. In the event of losing a single node, pg_auto_failover is capable of continuing the PostgreSQL service, and prevents any data loss when doing so, thanks to PostgreSQL Synchronous Replication.

That said, there is a trade-off involved in this architecture. The business continuity bias relaxes replication guarantees for asynchronous replication in the event of a standby node failure. This allows the PostgreSQL service to accept writes when there’s a single server available, and opens the service for potential data loss if the primary server were also to fail.

The pg_auto_failover Monitor

Each PostgreSQL node in pg_auto_failover runs a Keeper process which informs a central Monitor node about notable local changes. Some changes require the Monitor to orchestrate a correction across the cluster:

New nodes

At initialization time, it’s necessary to prepare the configuration of each node for PostgreSQL streaming replication, and get the cluster to converge to the nominal state with both a primary and a secondary node in each group. The monitor determines each new node’s role

Node failure

The monitor orchestrates a failover when it detects an unhealthy node. The design of pg_auto_failover allows the monitor to shut down service to a previously designated primary node without causing a “split-brain” situation.

The monitor is the authoritative node that manages global state and makes changes in the cluster by issuing commands to the nodes’ keeper processes. A pg_auto_failover monitor node failure has limited impact on the system. While it prevents reacting to other nodes’ failures, it does not affect replication. The PostgreSQL streaming replication setup installed by pg_auto_failover does not depend on having the monitor up and running.

pg_auto_failover Glossary

pg_auto_failover handles a single PostgreSQL service with the following concepts:

Monitor

The pg_auto_failover monitor is a service that keeps track of one or several formations containing groups of nodes.

The monitor is implemented as a PostgreSQL extension, so when you run the command pg_autoctl create monitor a PostgreSQL instance is initialized, configured with the extension, and started. The monitor service embeds a PostgreSQL instance.

Formation

A formation is a logical set of PostgreSQL services that are managed together.

It is possible to operate many formations with a single monitor instance. Each formation has a group of Postgres nodes and the FSM orchestration implemented by the monitor applies separately to each group.

Group

A group of two PostgreSQL nodes work together to provide a single PostgreSQL service in a Highly Available fashion. A group consists of a PostgreSQL primary server and a secondary server setup with Hot Standby synchronous replication. Note that pg_auto_failover can orchestrate the whole setting-up of the replication for you.

In pg_auto_failover versions up to 1.3, a single Postgres group can contain only two Postgres nodes. Starting with pg_auto_failover 1.4, there’s no limit to the number of Postgres nodes in a single group. Note that each Postgres instance that belongs to the same group serves the same dataset in its data directory (PGDATA).

Note

The notion of a formation that contains multiple groups in pg_auto_failover is useful when setting up and managing a whole Citus formation, where the coordinator nodes belong to group zero of the formation, and each Citus worker node becomes its own group and may have Postgres standby nodes.

Keeper

The pg_auto_failover keeper is an agent that must be running on the same server where your PostgreSQL nodes are running. The keeper controls the local PostgreSQL instance (using both the pg_ctl command-line tool and SQL queries), and communicates with the monitor:

it sends updated data about the local node, such as the WAL delta in between servers, measured via PostgreSQL statistics views.

it receives state assignments from the monitor.

Also the keeper maintains local state that includes the most recent communication established with the monitor and the other PostgreSQL node of its group, enabling it to detect Network Partitions.

Note

In pg_auto_failover versions up to and including 1.3, the keeper process started with pg_autoctl run manages a separate Postgres instance, running as its own process tree.

Starting in pg_auto_failover version 1.4, the keeper process (started with pg_autoctl run) runs the Postgres instance as a sub-process of the main pg_autoctl process, allowing tighter control over the Postgres execution. Running the sub-process also makes the solution work better both in container environments (because it’s now a single process tree) and with systemd, because it uses a specific cgroup per service unit.

Node

A node is a server (virtual or physical) that runs PostgreSQL instances and a keeper service. At any given time, any node might be a primary or a secondary Postgres instance. The whole point of pg_auto_failover is to decide this state.

As a result, refrain from naming your nodes with the role you intend for them. Their roles can change. If they didn’t, your system wouldn’t need pg_auto_failover!

State

A state is the representation of the per-instance and per-group situation. The monitor and the keeper implement a Finite State Machine to drive operations in the PostgreSQL groups; allowing pg_auto_failover to implement High Availability with the goal of zero data loss.

The keeper main loop enforces the current expected state of the local PostgreSQL instance, and reports the current state and some more information to the monitor. The monitor uses this set of information and its own health-check information to drive the State Machine and assign a goal state to the keeper.

The keeper implements the transitions between a current state and a monitor-assigned goal state.

Client-side HA

Implementing client-side High Availability is included in PostgreSQL’s driver libpq from version 10 onward. Using this driver, it is possible to specify multiple host names or IP addresses in the same connection string:

$ psql -d "postgresql://host1,host2/dbname?target_session_attrs=read-write"
$ psql -d "postgresql://host1:port2,host2:port2/dbname?target_session_attrs=read-write"
$ psql -d "host=host1,host2 port=port1,port2 target_session_attrs=read-write"

When using either of the syntax above, the psql application attempts to connect to host1, and when successfully connected, checks the target_session_attrs as per the PostgreSQL documentation of it:

If this parameter is set to read-write, only a connection in which read-write transactions are accepted by default is considered acceptable. The query SHOW transaction_read_only will be sent upon any successful connection; if it returns on, the connection will be closed. If multiple hosts were specified in the connection string, any remaining servers will be tried just as if the connection attempt had failed. The default value of this parameter, any, regards all connections as acceptable.

When the connection attempt to host1 fails, or when the target_session_attrs can not be verified, then the psql application attempts to connect to host2.

The behavior is implemented in the connection library libpq, so any application using it can benefit from this implementation, not just psql.

When using pg_auto_failover, configure your application connection string to use the primary and the secondary server host names, and set target_session_attrs=read-write too, so that your application automatically connects to the current primary, even after a failover occurred.

Monitoring protocol

The monitor interacts with the data nodes in 2 ways:

Data nodes periodically connect and run SELECT pgautofailover.node_active(…) to communicate their current state and obtain their goal state.

The monitor periodically connects to all the data nodes to see if they are healthy, doing the equivalent of pg_isready.

When a data node calls node_active, the state of the node is stored in the pgautofailover.node table and the state machines of both nodes are progressed. The state machines are described later in this readme. The monitor typically only moves one state forward and waits for the node(s) to converge except in failure states.

If a node is not communicating to the monitor, it will either cause a failover (if node is a primary), disabling synchronous replication (if node is a secondary), or cause the state machine to pause until the node comes back (other cases). In most cases, the latter is harmless, though in some cases it may cause downtime to last longer, e.g. if a standby goes down during a failover.

To simplify operations, a node is only considered unhealthy if the monitor cannot connect and it hasn’t reported its state through node_active for a while. This allows, for example, PostgreSQL to be restarted without causing a health check failure.

Synchronous vs. asynchronous replication

By default, pg_auto_failover uses synchronous replication, which means all writes block until at least one standby node has reported receiving them. To handle cases in which the standby fails, the primary switches between two states called wait_primary and primary based on the health of standby nodes, and based on the replication setting number_sync_standby.

When in the wait_primary state, synchronous replication is disabled by automatically setting synchronous_standby_names = '' to allow writes to proceed. However doing so also disables failover, since the standby might get arbitrarily far behind. If the standby is responding to health checks and within 1 WAL segment of the primary (by default), synchronous replication is enabled again on the primary by setting synchronous_standby_names = '*' which may cause a short latency spike since writes will then block until the standby has caught up.

When using several standby nodes with replication quorum enabled, the actual setting for synchronous_standby_names is set to a list of those standby nodes that are set to participate to the replication quorum.

If you wish to disable synchronous replication, you need to add the following to postgresql.conf:

synchronous_commit = 'local'

This ensures that writes return as soon as they are committed on the primary – under all circumstances. In that case, failover might lead to some data loss, but failover is not initiated if the secondary is more than 10 WAL segments (by default) behind on the primary. During a manual failover, the standby will continue accepting writes from the old primary. The standby will stop accepting writes only if it’s fully caught up (most common), the primary fails, or it does not receive writes for 2 minutes.

A note about performance

In some cases the performance impact on write latency when setting synchronous replication makes the application fail to deliver expected performance. If testing or production feedback shows this to be the case, it is beneficial to switch to using asynchronous replication.

The way to use asynchronous replication in pg_auto_failover is to change the synchronous_commit setting. This setting can be set per transaction, per session, or per user. It does not have to be set globally on your Postgres instance.

One way to benefit from that would be:

alter role fast_and_loose set synchronous_commit to local;

That way performance-critical parts of the application don’t have to wait for the standby nodes. Only use this when you can also lower your data durability guarantees.

Node recovery

When bringing a node back after a failover, the keeper (pg_autoctl run) can simply be restarted. It will also restart postgres if needed and obtain its goal state from the monitor. If the failed node was a primary and was demoted, it will learn this from the monitor. Once the node reports, it is allowed to come back as a standby by running pg_rewind. If it is too far behind, the node performs a new pg_basebackup.