.. _citus_quickstart: Citus Tutorial ============== In this guide we’ll create a Citus cluster with a coordinator node and three workers. Every node will have a secondary for failover. We’ll simulate failure in the coordinator and worker nodes and see how the system continues to function. This tutorial uses `docker compose`__ in order to separate the architecture design from some of the implementation details. This allows reasoning at the architecture level within this tutorial, and better see which software component needs to be deployed and run on which node. __ https://docs.docker.com/compose/ The setup provided in this tutorial is good for replaying at home in the lab. It is not intended to be production ready though. In particular, no attention have been spent on volume management. After all, this is a tutorial: the goal is to walk through the first steps of using pg_auto_failover to provide HA to a Citus formation. Pre-requisites -------------- When using `docker compose` we describe a list of services, each service may run on one or more nodes, and each service just runs a single isolated process in a container. Within the context of a tutorial, or even a development environment, this matches very well to provisioning separate physical machines on-prem, or Virtual Machines either on-prem on in a Cloud service. The docker image used in this tutorial is named `pg_auto_failover:citus`. It can be built locally when using the attached :download:`Dockerfile ` found within the GitHub repository for pg_auto_failover. To build the image, either use the provided Makefile and run ``make build``, or run the docker build command directly: :: $ git clone https://github.com/hapostgres/pg_auto_failover $ cd pg_auto_failover/docs/cluster $ docker build -t pg_auto_failover:citus -f Dockerfile ../.. $ docker compose build Our first Citus Cluster ----------------------- To create a cluster we use the following docker compose definition: .. literalinclude:: citus/docker-compose-scale.yml :language: yaml :emphasize-lines: 5,15,27 :linenos: To run the full Citus cluster with HA from this definition, we can use the following command: :: $ docker compose up --scale coord=2 --scale worker=6 The command above starts the services up. The command also specifies a ``--scale`` option that is different for each service. We need: - one monitor node, and the default scale for a service is 1, - one primary Citus coordinator node and one secondary Cituscoordinator node, which is to say two coordinator nodes, - and three Citus worker nodes, each worker with both a primary Postgres node and a secondary Postgres node, so that's a scale of 6 here. The default policy for the pg_auto_failover monitor is to assign a primary and a secondary per auto failover :ref:`group`. In our case, every node being provisioned with the same command, we benefit from that default policy:: $ pg_autoctl create worker --ssl-self-signed --auth trust --pg-hba-lan --run When provisioning a production cluster, it is often required to have a better control over which node participates in which group, then using the ``--group N`` option in the ``pg_autoctl create worker`` command line. Within a given group, the first node that registers is a primary, and the other nodes are secondary nodes. The monitor takes care of that in a way that we don't have to. In a High Availability setup, every node should be ready to be promoted primary at any time, so knowing which node in a group is assigned primary first is not very interesting. While the cluster is being provisionned by docker compose, you can run the following command and have a dynamic dashboard to follow what's happening. The following command is like ``top`` for pg_auto_failover:: $ docker compose exec monitor pg_autoctl watch Because the ``pg_basebackup`` operation that is used to create the secondary nodes takes some time when using Citus, because of the first CHECKPOINT which is quite slow. So at first when inquiring about the cluster state you might see the following output: .. code-block:: bash $ docker compose exec monitor pg_autoctl show state Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ---------+-------+-------------------+----------------+--------------+---------------------+-------------------- coord0a | 0/1 | cd52db444544:5432 | 1: 0/200C4A0 | read-write | wait_primary | wait_primary coord0b | 0/2 | 66a31034f2e4:5432 | 1: 0/0 | none ! | wait_standby | catchingup worker1a | 1/3 | dae7c062e2c1:5432 | 1: 0/2003B18 | read-write | wait_primary | wait_primary worker1b | 1/4 | 397e6069b09b:5432 | 1: 0/0 | none ! | wait_standby | catchingup worker2a | 2/5 | 5bf86f9ef784:5432 | 1: 0/2006AB0 | read-write | wait_primary | wait_primary worker2b | 2/6 | 23498b801a61:5432 | 1: 0/0 | none ! | wait_standby | catchingup worker3a | 3/7 | c23610380024:5432 | 1: 0/2003B18 | read-write | wait_primary | wait_primary worker3b | 3/8 | 2ea8aac8a04a:5432 | 1: 0/0 | none ! | wait_standby | catchingup After a while though (typically around a minute or less), you can run that same command again for stable result: .. code-block:: bash $ docker compose exec monitor pg_autoctl show state Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ---------+-------+-------------------+----------------+--------------+---------------------+-------------------- coord0a | 0/1 | cd52db444544:5432 | 1: 0/3127AD0 | read-write | primary | primary coord0b | 0/2 | 66a31034f2e4:5432 | 1: 0/3127AD0 | read-only | secondary | secondary worker1a | 1/3 | dae7c062e2c1:5432 | 1: 0/311B610 | read-write | primary | primary worker1b | 1/4 | 397e6069b09b:5432 | 1: 0/311B610 | read-only | secondary | secondary worker2a | 2/5 | 5bf86f9ef784:5432 | 1: 0/311B610 | read-write | primary | primary worker2b | 2/6 | 23498b801a61:5432 | 1: 0/311B610 | read-only | secondary | secondary worker3a | 3/7 | c23610380024:5432 | 1: 0/311B648 | read-write | primary | primary worker3b | 3/8 | 2ea8aac8a04a:5432 | 1: 0/311B648 | read-only | secondary | secondary You can see from the above that the coordinator node has a primary and secondary instance for high availability. When connecting to the coordinator, clients should try connecting to whichever instance is running and supports reads and writes. We can review the available Postgres URIs with the :ref:`pg_autoctl_show_uri` command:: $ docker compose exec monitor pg_autoctl show uri Type | Name | Connection String -------------+---------+------------------------------- monitor | monitor | postgres://autoctl_node@552dd89d5d63:5432/pg_auto_failover?sslmode=require formation | default | postgres://66a31034f2e4:5432,cd52db444544:5432/citus?target_session_attrs=read-write&sslmode=require To check that Citus worker nodes have been registered to the coordinator, we can run a psql session right from the coordinator container: .. code-block:: bash $ docker compose exec coord psql -d citus -c 'select * from citus_get_active_worker_nodes();' node_name | node_port --------------+----------- dae7c062e2c1 | 5432 5bf86f9ef784 | 5432 c23610380024 | 5432 (3 rows) We are now reaching the limits of using a simplified docker compose setup. When using the ``--scale`` option, it is not possible to give a specific hostname to each running node, and then we get a randomly generated string instead or useful node names such as ``worker1a`` or ``worker3b``. Create a Citus Cluster, take two -------------------------------- In order to implement the following architecture, we need to introduce a more complex docker compose file than in the previous section. .. figure:: ./tikz/arch-citus.svg :alt: pg_auto_failover architecture with a Citus formation pg_auto_failover architecture with a Citus formation This time we create a cluster using the following docker compose definition: .. literalinclude:: citus/docker-compose.yml :language: yaml :emphasize-lines: 3,15,40,44,48,52,56,60,64,68 :linenos: This definition is a little more involved than the previous one. We take benefit from `YAML anchors and aliases`__ to define a *template* for our coordinator nodes and worker nodes, and then apply that template to the actual nodes. __ https://yaml101.com/anchors-and-aliases/ Also this time we provision an application service (named "app") that sits in the background and allow us to later connect to our current primary coordinator. See :download:`Dockerfile.app ` for the complete definition of this service. We start this cluster with a simplified command line this time: :: $ docker compose up And this time we get the following cluster as a result: :: $ docker compose exec monitor pg_autoctl show state Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ---------+-------+---------------+----------------+--------------+---------------------+-------------------- coord0a | 0/3 | coord0a:5432 | 1: 0/312B040 | read-write | primary | primary coord0b | 0/4 | coord0b:5432 | 1: 0/312B040 | read-only | secondary | secondary worker1a | 1/1 | worker1a:5432 | 1: 0/311C550 | read-write | primary | primary worker1b | 1/2 | worker1b:5432 | 1: 0/311C550 | read-only | secondary | secondary worker2b | 2/7 | worker2b:5432 | 2: 0/5032698 | read-write | primary | primary worker2a | 2/8 | worker2a:5432 | 2: 0/5032698 | read-only | secondary | secondary worker3a | 3/5 | worker3a:5432 | 1: 0/311C870 | read-write | primary | primary worker3b | 3/6 | worker3b:5432 | 1: 0/311C870 | read-only | secondary | secondary And then we have the following application connection string to use: :: $ docker compose exec monitor pg_autoctl show uri Type | Name | Connection String -------------+---------+------------------------------- monitor | monitor | postgres://autoctl_node@f0135b83edcd:5432/pg_auto_failover?sslmode=require formation | default | postgres://coord0b:5432,coord0a:5432/citus?target_session_attrs=read-write&sslmode=require And finally, the nodes being registered as Citus worker nodes also make more sense: :: $ docker compose exec coord0a psql -d citus -c 'select * from citus_get_active_worker_nodes()' node_name | node_port -----------+----------- worker1a | 5432 worker3a | 5432 worker2b | 5432 (3 rows) .. important:: At this point, it is important to note that the Citus coordinator only knows about the primary nodes in each group. The High Availability mechanisms are all implemented in pg_auto_failover, which mostly uses the Citus API ``citus_update_node`` during worker node failovers. Our first Citus worker failover ------------------------------- We see that in the ``citus_get_active_worker_nodes()`` output we have ``worker1a``, ``worker2b``, and ``worker3a``. As mentioned before, that should have no impact on the operations of the Citus cluster when nodes are all dimensionned the same. That said, some readers among you will prefer to have the *A* nodes as primaries to get started with. So let's implement our first worker failover then. With pg_auto_failover, this is as easy as doing: :: $ docker compose exec monitor pg_autoctl perform failover --group 2 15:40:03 9246 INFO Waiting 60 secs for a notification with state "primary" in formation "default" and group 2 15:40:03 9246 INFO Listening monitor notifications about state changes in formation "default" and group 2 15:40:03 9246 INFO Following table displays times when notifications are received Time | Name | Node | Host:Port | Current State | Assigned State ---------+----------+-------+---------------+---------------------+-------------------- 22:58:42 | worker2b | 2/7 | worker2b:5432 | primary | draining 22:58:42 | worker2a | 2/8 | worker2a:5432 | secondary | prepare_promotion 22:58:42 | worker2a | 2/8 | worker2a:5432 | prepare_promotion | prepare_promotion 22:58:42 | worker2a | 2/8 | worker2a:5432 | prepare_promotion | wait_primary 22:58:42 | worker2b | 2/7 | worker2b:5432 | primary | demoted 22:58:42 | worker2b | 2/7 | worker2b:5432 | draining | demoted 22:58:42 | worker2b | 2/7 | worker2b:5432 | demoted | demoted 22:58:43 | worker2a | 2/8 | worker2a:5432 | wait_primary | wait_primary 22:58:44 | worker2b | 2/7 | worker2b:5432 | demoted | catchingup 22:58:46 | worker2b | 2/7 | worker2b:5432 | catchingup | catchingup 22:58:46 | worker2b | 2/7 | worker2b:5432 | catchingup | secondary 22:58:46 | worker2b | 2/7 | worker2b:5432 | secondary | secondary 22:58:46 | worker2a | 2/8 | worker2a:5432 | wait_primary | primary 22:58:46 | worker2a | 2/8 | worker2a:5432 | primary | primary So it took around 5 seconds to do a full worker failover in worker group 2. Now we'll do the same on the group 1 to fix the other situation, and review the resulting cluster state. :: $ docker compose exec monitor pg_autoctl show state Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ---------+-------+---------------+----------------+--------------+---------------------+-------------------- coord0a | 0/3 | coord0a:5432 | 1: 0/312ADA8 | read-write | primary | primary coord0b | 0/4 | coord0b:5432 | 1: 0/312ADA8 | read-only | secondary | secondary worker1a | 1/1 | worker1a:5432 | 1: 0/311B610 | read-write | primary | primary worker1b | 1/2 | worker1b:5432 | 1: 0/311B610 | read-only | secondary | secondary worker2b | 2/7 | worker2b:5432 | 2: 0/50000D8 | read-only | secondary | secondary worker2a | 2/8 | worker2a:5432 | 2: 0/50000D8 | read-write | primary | primary worker3a | 3/5 | worker3a:5432 | 1: 0/311B648 | read-write | primary | primary worker3b | 3/6 | worker3b:5432 | 1: 0/311B648 | read-only | secondary | secondary Which seen from the Citus coordinator, looks like the following: :: $ docker compose exec coord0a psql -d citus -c 'select * from citus_get_active_worker_nodes()' node_name | node_port -----------+----------- worker1a | 5432 worker3a | 5432 worker2a | 5432 (3 rows) Distribute Data to Workers -------------------------- Let's create a database schema with a single distributed table. :: $ docker compose exec app psql .. code-block:: sql -- in psql CREATE TABLE companies ( id bigserial PRIMARY KEY, name text NOT NULL, image_url text, created_at timestamp without time zone NOT NULL, updated_at timestamp without time zone NOT NULL ); SELECT create_distributed_table('companies', 'id'); Next download and ingest some sample data, still from within our psql session: :: \copy companies from program 'curl -o- https://examples.citusdata.com/mt_ref_arch/companies.csv' with csv # ( COPY 75 ) Handle Worker Failure --------------------- Now we'll intentionally crash a worker's primary node and observe how the pg_auto_failover monitor unregisters that node in the coordinator and registers the secondary instead. :: # the pg_auto_failover keeper process will be unable to resurrect # the worker node if pg_control has been removed $ docker compose exec worker1a rm /tmp/pgaf/global/pg_control # shut it down $ docker compose exec worker1a /usr/lib/postgresql/14/bin/pg_ctl stop -D /tmp/pgaf The keeper will attempt to start worker 1a three times and then report the failure to the monitor, who promotes worker1b to replace worker1a. Citus worker worker1a is unregistered with the coordinator node, and worker1b is registered in its stead. Asking the coordinator for active worker nodes now shows worker1b, worker2a, and worker3a: :: $ docker compose exec app psql -c 'select * from master_get_active_worker_nodes();' node_name | node_port -----------+----------- worker3a | 5432 worker2a | 5432 worker1b | 5432 (3 rows) Finally, verify that all rows of data are still present: :: $ docker compose exec app psql -c 'select count(*) from companies;' count ------- 75 Meanwhile, the keeper on worker 1a heals the node. It runs ``pg_basebackup`` to fetch the current PGDATA from worker1a. This restores, among other things, a new copy of the file we removed. After streaming replication completes, worker1b becomes a full-fledged primary and worker1a its secondary. :: $ docker compose exec monitor pg_autoctl show state Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ---------+-------+---------------+----------------+--------------+---------------------+-------------------- coord0a | 0/3 | coord0a:5432 | 1: 0/3178B20 | read-write | primary | primary coord0b | 0/4 | coord0b:5432 | 1: 0/3178B20 | read-only | secondary | secondary worker1a | 1/1 | worker1a:5432 | 2: 0/504C400 | read-only | secondary | secondary worker1b | 1/2 | worker1b:5432 | 2: 0/504C400 | read-write | primary | primary worker2b | 2/7 | worker2b:5432 | 2: 0/50FF048 | read-only | secondary | secondary worker2a | 2/8 | worker2a:5432 | 2: 0/50FF048 | read-write | primary | primary worker3a | 3/5 | worker3a:5432 | 1: 0/31CD8C0 | read-write | primary | primary worker3b | 3/6 | worker3b:5432 | 1: 0/31CD8C0 | read-only | secondary | secondary Handle Coordinator Failure -------------------------- Because our application connection string includes both coordinator hosts with the option ``target_session_attrs=read-write``, the database client will connect to whichever of these servers supports both reads and writes. However if we use the same trick with the pg_control file to crash our primary coordinator, we can watch how the monitor promotes the secondary. :: $ docker compose exec coord0a rm /tmp/pgaf/global/pg_control $ docker compose exec coord0a /usr/lib/postgresql/14/bin/pg_ctl stop -D /tmp/pgaf After some time, coordinator A's keeper heals it, and the cluster converges in this state: :: $ docker compose exec monitor pg_autoctl show state Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ---------+-------+---------------+----------------+--------------+---------------------+-------------------- coord0a | 0/3 | coord0a:5432 | 2: 0/50000D8 | read-only | secondary | secondary coord0b | 0/4 | coord0b:5432 | 2: 0/50000D8 | read-write | primary | primary worker1a | 1/1 | worker1a:5432 | 2: 0/504C520 | read-only | secondary | secondary worker1b | 1/2 | worker1b:5432 | 2: 0/504C520 | read-write | primary | primary worker2b | 2/7 | worker2b:5432 | 2: 0/50FF130 | read-only | secondary | secondary worker2a | 2/8 | worker2a:5432 | 2: 0/50FF130 | read-write | primary | primary worker3a | 3/5 | worker3a:5432 | 1: 0/31CD8C0 | read-write | primary | primary worker3b | 3/6 | worker3b:5432 | 1: 0/31CD8C0 | read-only | secondary | secondary We can check that the data is still available through the new coordinator node too: :: $ docker compose exec app psql -c 'select count(*) from companies;' count ------- 75 Cleanup ------- To dispose of the entire tutorial environment, just use the following command: :: $ docker compose down Next steps ---------- As mentioned in the first section of this tutorial, the way we use docker compose here is not meant to be production ready. It's useful to understand and play with a distributed system such as Citus though, and makes it simple to introduce faults and see how the pg_auto_failover High Availability reacts to those faults. One obvious missing element to better test the system is the lack of persistent volumes in our docker compose based test rig. It is possible to create external volumes and use them for each node in the docker compose definition. This allows restarting nodes over the same data set. See the command :ref:`pg_autoctl_do_tmux_compose_session` for more details about how to run a docker compose test environment with docker compose, including external volumes for each node. Now is a good time to go read `Citus Documentation`__ too, so that you know how to use this cluster you just created! __ https://docs.citusdata.com/en/latest/