Mimeo – Repulling Incremental Replication Data

2013-04-16 801 words 4 minutes

Contents

With other per-table replication methods, if the source and destination become out of sync, you typically have to repull the entire table. One of the nice things about using an incremental based replication method (based on incrementing time or serial number; see previous blog post) is that it can make repulling smaller batches of that data much easier.

One of our clients had been having some issues with their website hits tracking table. Some of the hits had been missed via the normal tracking method and the had to be re-obtained via other means and re-inserted into the hits tracking table on production. This table is also replicated to a data warehouse system for reporting. Since this table uses incremental replication based on time, the old data that was reinserted to the source with the old timestamp values would never make it to the reporting database on its own.

All of mimeo’s refresh functions have a p_repull boolean parameter that can be set to true and have it purge the destination table and repull all the data from the source. But the incremental refresh functions have two additional parameters: p_repull_start & p_repull_end. Right now I’m only supporting time-based incremental replication, so both of these values are timestamps. They let you set a starting and/or ending value for a block of data that you’d like purged on the destination and repulled from the source. If one or the other is left off, it just sets a boundary for the start or end and gets everything before or after the timestamps set. For very large tables (which most inserter/updater tables seem to be from my experience working on this tool) this can be a gigantic time-saver for getting the source and destination tables back in sync. If you do use this, just keep in mind that these start and end times are exclusive (< & >, not <= & >=).

Here is an example of it in use. I also set the p_debug option so I can follow, in real-time, the repull process. This information is also available via pg_jobmon, with the number of rows done kept up to date in the details log table as it runs. The data missing was between April 8th and 11th, so I set the start and end days a few minutes before and after each day just to make sure I got everything.

somedb=# select mimeo.refresh_inserter('ods_tracking.hits', p_repull := true, p_repull_start := '2013-04-07 23:55:00', p_repull_end := '2013-04-11 00:05:00', p_debug := true);
NOTICE:  Job ID: 430449
CONTEXT:  SQL statement "SELECT gdb(p_debug,'Job ID: '||v_job_id::text)"
PL/pgSQL function mimeo.refresh_inserter(text,integer,boolean,text,text,boolean) line 70 at PERFORM
NOTICE:  Request to repull data from 2013-04-07 23:55:00 to 2013-04-11 00:05:00
CONTEXT:  SQL statement "SELECT gdb(p_debug,'Request to repull data from '||COALESCE(p_repull_start, '-infinity')||' to '||COALESCE(p_repull_end, 'infinity'))"
PL/pgSQL function mimeo.refresh_inserter(text,integer,boolean,text,text,boolean) line 149 at PERFORM
NOTICE:  Deleting current, local data: DELETE FROM ods_tracking.hits WHERE hitdate &gt; '2013-04-07 23:55:00' AND hitdate &lt; '2013-04-11 00:05:00'
CONTEXT:  SQL statement "SELECT gdb(p_debug,'Deleting current, local data: '||v_delete_sql)"
PL/pgSQL function mimeo.refresh_inserter(text,integer,boolean,text,text,boolean) line 162 at PERFORM
NOTICE:  SELECT partner,ipaddress,hitdate,userid,affiliate_id FROM tracking.hits WHERE hitdate &gt; '2013-04-07 23:55:00' AND hitdate &lt; '2013-04-11 00:05:00'
CONTEXT:  SQL statement "SELECT gdb(p_debug,v_remote_sql)"
PL/pgSQL function mimeo.refresh_inserter(text,integer,boolean,text,text,boolean) line 184 at PERFORM
NOTICE:  Fetching rows in batches: 50000 done so far. Last fetched: 2013-04-08 00:34:02-04
CONTEXT:  SQL statement "SELECT gdb(p_debug,'Fetching rows in batches: '||v_total||' done so far. Last fetched: '||v_last_fetched)"
PL/pgSQL function mimeo.refresh_inserter(text,integer,boolean,text,text,boolean) line 201 at PERFORM
NOTICE:  Fetching rows in batches: 100000 done so far. Last fetched: 2013-04-08 01:19:08-04
CONTEXT:  SQL statement "SELECT gdb(p_debug,'Fetching rows in batches: '||v_total||' done so far. Last fetched: '||v_last_fetched)"
PL/pgSQL function mimeo.refresh_inserter(text,integer,boolean,text,text,boolean) line 201 at PERFORM
NOTICE:  Fetching rows in batches: 150000 done so far. Last fetched: 2013-04-08 02:10:44-04
CONTEXT:  SQL statement "SELECT gdb(p_debug,'Fetching rows in batches: '||v_total||' done so far. Last fetched: '||v_last_fetched)"
PL/pgSQL function mimeo.refresh_inserter(text,integer,boolean,text,text,boolean) line 201 at PERFORM
NOTICE:  Fetching rows in batches: 200000 done so far. Last fetched: 2013-04-08 03:07:14-04
CONTEXT:  SQL statement "SELECT gdb(p_debug,'Fetching rows in batches: '||v_total||' done so far. Last fetched: '||v_last_fetched)"
PL/pgSQL function mimeo.refresh_inserter(text,integer,boolean,text,text,boolean) line 201 at PERFORM

[...]

NOTICE:  Fetching rows in batches: 5600000 done so far. Last fetched: 2013-04-10 23:59:59-04
CONTEXT:  SQL statement "SELECT gdb(p_debug,'Fetching rows in batches: '||v_total||' done so far. Last fetched: '||v_last_fetched)"
PL/pgSQL function mimeo.refresh_inserter(text,integer,boolean,text,text,boolean) line 201 at PERFORM
NOTICE:  Fetching rows in batches: 5650000 done so far. Last fetched: 2013-04-10 23:59:59-04
CONTEXT:  SQL statement "SELECT gdb(p_debug,'Fetching rows in batches: '||v_total||' done so far. Last fetched: '||v_last_fetched)"
PL/pgSQL function mimeo.refresh_inserter(text,integer,boolean,text,text,boolean) line 201 at PERFORM
NOTICE:  Fetching rows in batches: 5687611 done so far. Last fetched: 2013-04-11 00:04:59-04
CONTEXT:  SQL statement "SELECT gdb(p_debug,'Fetching rows in batches: '||v_total||' done so far. Last fetched: '||v_last_fetched)"
PL/pgSQL function mimeo.refresh_inserter(text,integer,boolean,text,text,boolean) line 201 at PERFORM
NOTICE:  Lower boundary value is: 2013-04-16 11:00:01-04
CONTEXT:  SQL statement "SELECT gdb(p_debug, 'Lower boundary value is: '||coalesce(v_last_value, CURRENT_TIMESTAMP))"
PL/pgSQL function mimeo.refresh_inserter(text,integer,boolean,text,text,boolean) line 250 at PERFORM
 refresh_inserter 
------------------

(1 row)