Keith's Ramblings…

Archive for the ‘replication’ tag

Monitoring Streaming Replica Lag Effectively

with 3 comments

One of the easiest ways to monitor replica lag when using streaming replication is to turn hot standby on your replica and use pg_last_xact_replay_timestamp() and/or the other recovery information functions. Here’s an example query to run on the replica systems to get the number of seconds behind it is:

The issue with this query is that while your replica(s) may be 100% caught up, the time interval being returned is always increasing until new write activity occurs on the primary that the replica can replay. This can cause your monitoring to give false positives that your replica is falling behind if you have things set up to ensure your replicas are no more than a few minutes behind. A side affect of this monitoring query can also give you an indication that writes to your primary have stopped for some reason.

One of our clients has a smaller sized database that doesn’t get quite as much write traffic as our typical clients do. But it still has failover replicas and still needs to be monitored just like our other larger clients to ensure it doesn’t fall too far behind. So, my coworker introduced me to the pg_stat_replication view that was added in PostgreSQL 9.1. Querying this from the primary returns information about streaming replication replicas connected to it.

He also provided a handy query to get back a simple, easy to understand numeric value to indicate replica lag. The issue I ran into using the query is that this view uses pg_stat_activity as one of its sources. If you’re not a superuser, you’re not going to get any statistics on sessions that aren’t your own (and hopefully you’re not using a superuser role as the role for your monitoring solution). So, instead I made a function with SECURITY DEFINER set, made a superuser role the owner, and gave my monitoring role EXECUTE privileges on the function.

Running this query gives back a few handy columns that should be good enough for most monitoring tools. You can easily add more columns from pg_stat_replication or any other tables you need to join against for more info.

UPDATE: If you’re running PostgreSQL 9.2+, there is a new, built-in function that avoids needing the above function all together and can just query pg_stat_replication directly.

Unfortunately, this function still requires superuser privileges to obtain all relevant data and most monitoring tools do not use a superuser role (I hope). So, in that case you do still need a SECURITY DEFINER function, but it can be a much much simpler one.

UPDATE: As of PostgreSQL 10+, there is now a special pg_monitor system role that can be granted to any other role and will allow viewing of specific system metrics without requiring superuser. The following functions fall under that access, so neither superuser nor SECURITY DEFINER are required for replication monitoring anymore! See

This can also be useful to monitor replica lag when you don’t have hot standby turned on for your replicas to allow read-only queries.

Combining both of the replication monitoring solutions mentioned in this post should give you a much better overall picture of the status of your primary/replica systems.

Written by Keith

October 30th, 2013 at 10:51 am

Posted in PostgreSQL

Tagged with , ,

Mimeo – Repulling Incremental Replication Data

without comments

With other per-table replication methods, if the source and destination become out of sync, you typically have to repull the entire table. One of the nice things about using an incremental based replication method (based on incrementing time or serial number; see previous blog post) is that it can make repulling smaller batches of that data much easier.

One of our clients had been having some issues with their website hits tracking table. Some of the hits had been missed via the normal tracking method and the had to be re-obtained via other means and re-inserted into the hits tracking table on production. This table is also replicated to a data warehouse system for reporting. Since this table uses incremental replication based on time, the old data that was reinserted to the source with the old timestamp values would never make it to the reporting database on its own.

All of mimeo’s refresh functions have a p_repull boolean parameter that can be set to true and have it purge the destination table and repull all the data from the source. But the incremental refresh functions have two additional parameters: p_repull_start & p_repull_end. Right now I’m only supporting time-based incremental replication, so both of these values are timestamps. They let you set a starting and/or ending value for a block of data that you’d like purged on the destination and repulled from the source. If one or the other is left off, it just sets a boundary for the start or end and gets everything before or after the timestamps set. For very large tables (which most inserter/updater tables seem to be from my experience working on this tool) this can be a gigantic time-saver for getting the source and destination tables back in sync. If you do use this, just keep in mind that these start and end times are exclusive (< & >, not <= & >=).

Here is an example of it in use. I also set the p_debug option so I can follow, in real-time, the repull process. This information is also available via pg_jobmon, with the number of rows done kept up to date in the details log table as it runs. The data missing was between April 8th and 11th, so I set the start and end days a few minutes before and after each day just to make sure I got everything.


Written by Keith

April 16th, 2013 at 2:34 pm

Mimeo – DML Replication

without comments

For the last introductory post about mimeo, I’ll be talking about DML replication (previous posts here and here). This is the most common way replication is done on a per table basis (at least that I’ve seen). Typically, a trigger is placed on the source table that tracks all changes (INSERTS, UPDATES & DELETES) and then some mechanism is used to replay those statements on the destination.

For mimeo, this is done with a queue table that just contains the primary key columns to note that a change was done to that row. The trigger places the primary key values into a queue table (also located on the source system) and then mimeo reads the queue table values to replay them on the destination. Saying that the statements are just replayed on the destination is really simplifying things though. While that is technically a legitimate way to replicate table changes, it is far from the most efficient. What mimeo actually does is

  1. Grab all queue table values, using a SELECT DISTINCT to only get a single copy of each row changed (since multiple changes to the same row put the same value into the queue table multiple times).
  2. Grab the full row from the source using the primary key to get the most recent values.
  3. Perform a DELETE … USING … command, removing all rows from the destination table that have a matching primary key value in the queue table.
  4. INSERT full rows from step 2 into the destination table.
  5. Clear the processed rows from the queue table.

This method is much more efficient because

  • Even if a row is updated 100,000 times between refresh runs, only one update is ever run on the destination with the latest value of that row.
  • And since all rows that were changed are deleted from the destination, this avoids having to check if something was actually an update or a delete. If it was an update, it will be reinserted from the queue. If it was deleted, it won’t be in the queue for insert.

And since this is all done in a single transaction on the destination, the result appears exactly the same as if the statements had actually been replayed as they happened on the source.

Mimeo also has a specialized DML replication method that can be useful in a data warehousing environment. One common need is to preserve deleted rows, but not track every single update done to a row. Just the last value that row has needs to be kept for archive purposes. The log deletion (logdel) replication method can provide this. It basically uses the same method as the normal DML above, but the trigger & queue table on the source are a little different. The queue table has the same columns as the source table as well as an extra timestamp column that records when a row was deleted. For an insert or update, just the primary key values are stored in the queue table, but for deletes the entire row gets stored. The replication steps are pretty much the same as DML except there’s an extra one to insert the deleted rows. And the destination table has an extra timestamp column as well to record when that row was deleted on the source.

So that’s basically how mimeo dml replication works. The dml/logdel maker functions take care of setting up the source table triggers, trigger functions, and queue tables for you as long as you’ve got the permissions set properly. The rest of the replication methods also have maker & destroyer functions to make setup and tear down easier. I’ll have further blog posts with some tips and use cases of how we’ve put mimeo to use for our clients. If you’ve got any questions or suggestions, please feel free to post here, on github or poke me on freenode IRC in #postgresql.

Written by Keith

March 15th, 2013 at 1:25 pm

Mimeo – Incremental Replication

with 7 comments

Continuing from my introductory post to mimeo, I’ll now discuss one of the methods that’s more unique to this replication tool.

If a table happens to have a timestamp column that is set on every insert or update of a row, the incremental replication methods can be used. This simply uses that column to track where in the replication process it left off each time it is run. There’s one for insert-only tables and another that can handle if that column is also set on every row update. While the insert-only one requires no primary/unique key, the updater one does. This method does not replicate any row deletions.

This means that, just like the snapshot method, only select privileges and no triggers are required on the source table. This method is ideal for insert-only, high-transaction tables such as one tracking hits on a website. Adding a trigger to track the changes to such a table for replication could place a lot of extra load of your front-end production systems.

Since this method was introduced in our environment, we’ve had to deal with several edge cases. One of the first was when the rows were just grabbed by getting everything larger than the last recorded timestamp. This runs into issues when the source table hasn’t stopped inserting rows for its latest timestamp value. Since that value is used for the next batch’s lower boundary and it thinks it has all rows matching that timestamp, it may miss some the next batch. So a boundary interval was introduced. This sets the batch’s upper limit to be less than a given interval. For example, say the last recorded timestamp on the destination was 2013-02-18 12:30:00 and mimeo runs the same day at 13:30:00. With a 10 minute upper boundary (mimeo’s default), this would get all rows with values > 2013-02-18 12:30:00 and < 2013-02-18 13:20:00. If rows are constantly being inserted, this does mean the destination is always 10 minutes behind. But it also ensures that no rows are ever missed. The boundary interval is required to enforce data integrity, but it is also configurable on a per table basis.

Another issue along the same lines has to do with when a limit is set on how many rows are pulled each batch. If the maximum number of rows in a batch is pulled, the upper boundary could be cut off in the middle of any timestamp value, not just the latest values being inserted. This is handled by always removing the highest value from the batch when the maximum number is pulled, delaying it to being pulled the next run. Bigger issues occur when the batch contains timestamp values that are all the same. There is no way to ensure a consistent pull of data from the source in this case. So if this issue is encountered, mimeo just cancels that batch entirely. To fix it, you must either remove the batch limit or set it to a high enough value that it can pull data with at least two different timestamp values. The internal logging & monitoring system (pg_jobmon) sets off a critical alert if this occurs so you will know if it happens. Before v0.10.0 a batch limit was always used, so this was a bigger concern then. Since that version, they are no longer turned on by default, but I left the option available. So if you use that option, just be aware of these limitations.

And lastly, probably be biggest cause of issues with time-based replication is daylight savings. When time changes, you chance losing data on your destination table, especially when setting the clock back since it thinks it already got that data. Now is a good time to mention that if you can run your databases on a system that uses UTC/GMT time, you can make a lot of time-based issues with storing data go away. But if that’s not possible, and you want to use this replication method, there are some configuration options available. When you set up incremental replication using mimeo’s maker functions, it checks to see what timezone the database is running in. If you’re in UTC/GMT, you’ve got nothing to worry about. If you’re not, mimeo sets a flag to turn off replication around the time change. I haven’t found any better (ie, less outrageously complex) solutions to this other than just completely stopping replication for that time period. By default it turns off replication between 12:30am and 02:30am. If you need to narrow down or change that time period, the start & end times are configurable.

So at first, basing replication on a timestamp seems like it wouldn’t be too complex an issue. But as always, things are never as simple as they may seem to be.

Written by Keith

February 18th, 2013 at 12:41 pm

Mimeo – A per-table replication extension for PostgreSQL

with 3 comments

One of the biggest projects I’ve been working on the last few months is an extension that came about trying to organize a per-table replication process that has been in use with several of our clients, but never really formalized. After nearly 300 hours of time logged working on it, and mentioning it several times in other blog posts, I figured it’s about time I talk more about it.

I got the name mimeo after searching a thesaurus for words similar to “copy” & “replicate” and came across a mimeograph. The terms “low cost” and copying “small quantities” seemed to fit with the theme of what I was trying to accomplish, so the name stuck. There are some other great add-ons for PostgreSQL that allow per-table replication (Bucardo being the other one I’m more familiar with), but their setup and use can be a bit daunting. And if you just need a few tables copied, a bit overkill. The goal with mimeo was to keep the installation, maintenance and monitoring as simple as possible. Honestly, the hardest part of the extension I’ve found, and had others report the same to me, is just managing the permissions. I’ve got some plans to make some of it easier, but the extension doesn’t assume or require any superuser privileges, which I think is another plus.

The existing code I was working off of had several different replication methods that were used, and each had their own merits. I’ll be covering them over several blog posts, along with some general tips on usage, since I think a single blog post discussing the entire thing would be a bit much. I’ve already done some pretty extensive documentation and even written a howto, so these blog posts will mostly be informational and not really focused too much on the entire setup and maintenance process.

Before I get into too much detail, I just have to thank the PostgreSQL team for getting the extension system into place with 9.1. The only reason I was able to even come close to organizing the existing processes into something formal like this was because of that. The versioning control of a group of objects within the database allows a much smoother upgrading (and downgrading) process. And also a big thanks to David Wheeler for the pgTAP suite. If you look in the tests folder you can see I made extensive use of it and it’s saved me a tremendous amount of development time (especially with the latest v0.10.0 release where I rewrote much of the refresh process). When dealing with data replication you want to be really sure you’re getting every bit of data across 100% of the time and not breaking anything as development progresses.

Now onto the details. The most basic way to copy a table from one database to another is to just grab the entire thing. That’s covered in mimeo with the snapshot replication method. A table setup with this method will have the entire contents refreshed every time it is run. To help make this processes more transparent to a user of these tables, a view with two underlying tables is used. The view only ever points to one table at a time. When the refresh runs, the table it’s not pointing to is truncated and refreshed. A brief lock is then taken to swap the view to that new table. If you’ve got to refresh a rather larger table, the lock that a truncate takes would make the table unusable during that time. And doing a delete instead of a truncate could lead to some very heavy bloat in addition to the locking. The view swapping minimizes both of these issues.

Some additions to the snapshot process I was able to make were to allow the automatic propagation of column changes. This includes adding & dropping columns as well as type changes. Indexes are also copied over as well at creation time (this will be optional soon). You shouldn’t need to worry about constraints on the destination end since that should be being controlled on the source table. That would make the replication needlessly take longer. Since the table is re-created from scratch on a column change, there is an extra configuration option to run some additional commands such as setting permissions on the view/table. This can also be handled with the default privileges settings that were added in 9.0, but the extension option allows you to control them for each individually replicated table.

The snapshot method is ideal for smaller tables. It’s also much more efficient than the DML replication method (replaying every insert/update/delete, which I’ll be covering in a later post) if almost the entire table is updated in the span of time between table refreshes. One of the things I am currently working on is a way to just skip the refresh process if nothing has changed on the source. This would then make the snapshot process ideal for static tables or ones that rarely ever change.

That’s all for this post. We’re using this extension in some of our production environments already, so I’m confident in the code even though it’s not “1.0”. Would appreciate any feedback to speed up my decision to make such a stable release.

My next post will cover the incremental replication methods.

Written by Keith

February 7th, 2013 at 11:33 am