Having configured PostgreSQL 9.3 master/slave replication from bits and pieces of documentation on the Internet, I feel like a post is in order documenting my experience for others.
I am not going to get into too much detail here. Installation instructions are abundant and are slightly different by operating system. I am assuming you are already past this stage.
Before we can enable replication, we need to lay some ground work. First, we need a replication user. On the “master” host, log on as the user running postgres (typically
postgres) and do this:
psql -c "CREATE USER replication REPLICATION LOGIN CONNECTION LIMIT 1 ENCRYPTED PASSWORD 'replication';"
You may select a more secure password if you wish. Now edit
Add the following lines:
host replication replication 184.108.40.206/32 md5 # This is your master CIDR/IP host replication replication 220.127.116.11/32 md5 # This is your slave CIDR/IP
Modify the following lines to enable replication:
wal_level = hot_standby max_wal_senders = 5 # Recommended to allocate 5 per slave hot_standby = on wal_keep_segments = 1000 # Each segment is 16Megs.
Some explanation is in order. When I first configured replication, I set
wal_keep_segments=8. That was woefully inadequate and needless to say the slave could not keep up with the replication. After some investigation, I decided to raise that to the maximum I can afford in terms of storage and then monitor the backlog.
After you enable replication on master, it needs to be restarted.
As per this Stack Overflow post, you may also want to configure the following:
When the slave can’t pull the WAL segment directly from the master, it will attempt to use the restorecommand to load it. You can configure the slave to automatically remove segments using the archivecleanup_command setting.
# on master archive_mode = on archive_command = 'cp %p /path_to/archive/%f' # on slave restore_command = 'cp /path_to/archive/%f "%p"'
I only learned of this feature recently and I have not explored it in great detail. Fortunately, the size of my master data set is reasonable and the network connection between master and slave is fast enough that I can rebuild a failed master or slave in a matter of minutes. For now, I am leaving exploration of archiving for later and I will update this post.
Before you proceed, the slave host must be able to connect to the master. You may want to use
telnet command to make sure you can connect to the right port. I am leaving that as an exercise for the reader.
Before the slave is operational you need to replication the initial database. Make sure the PostgreSQL process is not running before you proceed as it may result in a corrupted initial backup. Log on to the slave machine as
postgres user, go into the postgres data directory and execute the following:
pg_basebackup -h 18.104.22.168 -D . --username=replication --password
This may run for some time depending on the size of your data set. Rember how we agreed above that
22.214.171.124 is your master IP, so replace it with the right numbers.
Now, let’s enable replication:
This file probably does not exist yet, so you will create a blank file that should like following:
standby_mode = 'on' primary_conninfo = 'host=126.96.36.199 port=5432 user=replication password=replication' trigger_file = '/tmp/postgresql.trigger.5432'
Remember to replace
replication with the IP of your master and the password you’ve actually chosen.
trigger_file is just a file the slave will watch for. Should your master fail and you need to activate the slave as a new master, you simply need to create that file. Immediately, the slave will assume the master is dead and activate itself. We will discuss that topic in a moment.
As postgres user on master using
CREATE TABLE TEST (test VARCHAR(40)); INSERT INTO test VALUES ('bar'); INSERT INTO test VALUES ('baz'); INSERT INTO test VALUES ('bat');
As postgres user on slave using
SELECT * FROM TEST;
You should see
bat just like you inserted them on the master.
You can tell how much data is pending to be sent to the slave by running this query against the master:
SELECT application_name, client_addr, pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_location(), sent_location)) FROM pg_stat_replication;
You can tell how far behind the slave is by running this query against the slave:
select now() - pg_last_xact_replay_timestamp() AS replication_delay;
How the failover and recovery work
The way failover works is this:
- Master failure is detected.
- Trigger file on the slave is created. This can be automated, but there appears to be no standard mechanism for this.
- The slave is now the new master.
- Your application needs to be aware that the configuration has changed and must switch to use the new master. Unfortunately, standard JDBC driver does not support this. I personally think that failover is a sensitive activity and there is no generic scenario for this.
- Your old failed master is now the new slave. You must rebuild it by performing
pg_basebackupas if you are configuring a new slave.