Mimeo - Incremental Replication

2013-02-18 796 words 4 minutes

Contents

Continuing from my introductory post to mimeo, I’ll now discuss one of the methods that’s more unique to this replication tool.

If a table happens to have a timestamp column that is set on every insert or update of a row, the incremental replication methods can be used. This simply uses that column to track where in the replication process it left off each time it is run. There’s one for insert-only tables and another that can handle if that column is also set on every row update. While the insert-only one requires no primary/unique key, the updater one does. This method does not replicate any row deletions.

This means that, just like the snapshot method, only select privileges and no triggers are required on the source table. This method is ideal for insert-only, high-transaction tables such as one tracking hits on a website. Adding a trigger to track the changes to such a table for replication could place a lot of extra load of your front-end production systems.

Since this method was introduced in our environment, we’ve had to deal with several edge cases. One of the first was when the rows were just grabbed by getting everything larger than the last recorded timestamp. This runs into issues when the source table hasn’t stopped inserting rows for its latest timestamp value. Since that value is used for the next batch’s lower boundary and it thinks it has all rows matching that timestamp, it may miss some the next batch. So a boundary interval was introduced. This sets the batch’s upper limit to be less than a given interval. For example, say the last recorded timestamp on the destination was 2013-02-18 12:30:00 and mimeo runs the same day at 13:30:00. With a 10 minute upper boundary (mimeo’s default), this would get all rows with values > 2013-02-18 12:30:00 and < 2013-02-18 13:20:00. If rows are constantly being inserted, this does mean the destination is always 10 minutes behind. But it also ensures that no rows are ever missed. The boundary interval is required to enforce data integrity, but it is also configurable on a per table basis.

Another issue along the same lines has to do with when a limit is set on how many rows are pulled each batch. If the maximum number of rows in a batch is pulled, the upper boundary could be cut off in the middle of any timestamp value, not just the latest values being inserted. This is handled by always removing the highest value from the batch when the maximum number is pulled, delaying it to being pulled the next run. Bigger issues occur when the batch contains timestamp values that are all the same. There is no way to ensure a consistent pull of data from the source in this case. So if this issue is encountered, mimeo just cancels that batch entirely. To fix it, you must either remove the batch limit or set it to a high enough value that it can pull data with at least two different timestamp values. The internal logging & monitoring system (pg_jobmon) sets off a critical alert if this occurs so you will know if it happens. Before v0.10.0 a batch limit was always used, so this was a bigger concern then. Since that version, they are no longer turned on by default, but I left the option available. So if you use that option, just be aware of these limitations.

And lastly, probably be biggest cause of issues with time-based replication is daylight savings. When time changes, you chance losing data on your destination table, especially when setting the clock back since it thinks it already got that data. Now is a good time to mention that if you can run your databases on a system that uses UTC/GMT time, you can make a lot of time-based issues with storing data go away. But if that’s not possible, and you want to use this replication method, there are some configuration options available. When you set up incremental replication using mimeo’s maker functions, it checks to see what timezone the database is running in. If you’re in UTC/GMT, you’ve got nothing to worry about. If you’re not, mimeo sets a flag to turn off replication around the time change. I haven’t found any better (ie, less outrageously complex) solutions to this other than just completely stopping replication for that time period. By default it turns off replication between 12:30am and 02:30am. If you need to narrow down or change that time period, the start & end times are configurable.

So at first, basing replication on a timestamp seems like it wouldn’t be too complex an issue. But as always, things are never as simple as they may seem to be.