Keith's Ramblings…

WARNING: If accidentally read, induce vomiting

Mimeo – Repulling Incremental Replication Data

without comments

With other per-table replication methods, if the source and destination become out of sync, you typically have to repull the entire table. One of the nice things about using an incremental based replication method (based on incrementing time or serial number; see previous blog post) is that it can make repulling smaller batches of that data much easier.

One of our clients had been having some issues with their website hits tracking table. Some of the hits had been missed via the normal tracking method and the had to be re-obtained via other means and re-inserted into the hits tracking table on production. This table is also replicated to a data warehouse system for reporting. Since this table uses incremental replication based on time, the old data that was reinserted to the source with the old timestamp values would never make it to the reporting database on its own.

All of mimeo’s refresh functions have a p_repull boolean parameter that can be set to true and have it purge the destination table and repull all the data from the source. But the incremental refresh functions have two additional parameters: p_repull_start & p_repull_end. Right now I’m only supporting time-based incremental replication, so both of these values are timestamps. They let you set a starting and/or ending value for a block of data that you’d like purged on the destination and repulled from the source. If one or the other is left off, it just sets a boundary for the start or end and gets everything before or after the timestamps set. For very large tables (which most inserter/updater tables seem to be from my experience working on this tool) this can be a gigantic time-saver for getting the source and destination tables back in sync. If you do use this, just keep in mind that these start and end times are exclusive (< & >, not <= & >=).

Here is an example of it in use. I also set the p_debug option so I can follow, in real-time, the repull process. This information is also available via pg_jobmon, with the number of rows done kept up to date in the details log table as it runs. The data missing was between April 8th and 11th, so I set the start and end days a few minutes before and after each day just to make sure I got everything.


Written by Keith

April 16th, 2013 at 2:34 pm