Keith's Ramblings…

Monitoring Streaming Replica Lag Effectively

with 3 comments

One of the easiest ways to monitor replica lag when using streaming replication is to turn hot standby on your replica and use pg_last_xact_replay_timestamp() and/or the other recovery information functions. Here’s an example query to run on the replica systems to get the number of seconds behind it is:

The issue with this query is that while your replica(s) may be 100% caught up, the time interval being returned is always increasing until new write activity occurs on the primary that the replica can replay. This can cause your monitoring to give false positives that your replica is falling behind if you have things set up to ensure your replicas are no more than a few minutes behind. A side affect of this monitoring query can also give you an indication that writes to your primary have stopped for some reason.

One of our clients has a smaller sized database that doesn’t get quite as much write traffic as our typical clients do. But it still has failover replicas and still needs to be monitored just like our other larger clients to ensure it doesn’t fall too far behind. So, my coworker introduced me to the pg_stat_replication view that was added in PostgreSQL 9.1. Querying this from the primary returns information about streaming replication replicas connected to it.

He also provided a handy query to get back a simple, easy to understand numeric value to indicate replica lag. The issue I ran into using the query is that this view uses pg_stat_activity as one of its sources. If you’re not a superuser, you’re not going to get any statistics on sessions that aren’t your own (and hopefully you’re not using a superuser role as the role for your monitoring solution). So, instead I made a function with SECURITY DEFINER set, made a superuser role the owner, and gave my monitoring role EXECUTE privileges on the function.

Running this query gives back a few handy columns that should be good enough for most monitoring tools. You can easily add more columns from pg_stat_replication or any other tables you need to join against for more info.

UPDATE: If you’re running PostgreSQL 9.2+, there is a new, built-in function that avoids needing the above function all together and can just query pg_stat_replication directly.

Unfortunately, this function still requires superuser privileges to obtain all relevant data and most monitoring tools do not use a superuser role (I hope). So, in that case you do still need a SECURITY DEFINER function, but it can be a much much simpler one.

UPDATE: As of PostgreSQL 10+, there is now a special pg_monitor system role that can be granted to any other role and will allow viewing of specific system metrics without requiring superuser. The following functions fall under that access, so neither superuser nor SECURITY DEFINER are required for replication monitoring anymore! See

This can also be useful to monitor replica lag when you don’t have hot standby turned on for your replicas to allow read-only queries.

Combining both of the replication monitoring solutions mentioned in this post should give you a much better overall picture of the status of your primary/replica systems.

Also published on Medium.

Written by Keith

October 30th, 2013 at 10:51 am

Posted in PostgreSQL

Tagged with , ,