Keith's Ramblings…

WARNING: If accidentally read, induce vomiting

Mimeo – Incremental Replication

with 7 comments

Continuing from my introductory post to mimeo, I’ll now discuss one of the methods that’s more unique to this replication tool.

If a table happens to have a timestamp column that is set on every insert or update of a row, the incremental replication methods can be used. This simply uses that column to track where in the replication process it left off each time it is run. There’s one for insert-only tables and another that can handle if that column is also set on every row update. While the insert-only one requires no primary/unique key, the updater one does. This method does not replicate any row deletions.

This means that, just like the snapshot method, only select privileges and no triggers are required on the source table. This method is ideal for insert-only, high-transaction tables such as one tracking hits on a website. Adding a trigger to track the changes to such a table for replication could place a lot of extra load of your front-end production systems.

Since this method was introduced in our environment, we’ve had to deal with several edge cases. One of the first was when the rows were just grabbed by getting everything larger than the last recorded timestamp. This runs into issues when the source table hasn’t stopped inserting rows for its latest timestamp value. Since that value is used for the next batch’s lower boundary and it thinks it has all rows matching that timestamp, it may miss some the next batch. So a boundary interval was introduced. This sets the batch’s upper limit to be less than a given interval. For example, say the last recorded timestamp on the destination was 2013-02-18 12:30:00 and mimeo runs the same day at 13:30:00. With a 10 minute upper boundary (mimeo’s default), this would get all rows with values > 2013-02-18 12:30:00 and < 2013-02-18 13:20:00. If rows are constantly being inserted, this does mean the destination is always 10 minutes behind. But it also ensures that no rows are ever missed. The boundary interval is required to enforce data integrity, but it is also configurable on a per table basis.

Another issue along the same lines has to do with when a limit is set on how many rows are pulled each batch. If the maximum number of rows in a batch is pulled, the upper boundary could be cut off in the middle of any timestamp value, not just the latest values being inserted. This is handled by always removing the highest value from the batch when the maximum number is pulled, delaying it to being pulled the next run. Bigger issues occur when the batch contains timestamp values that are all the same. There is no way to ensure a consistent pull of data from the source in this case. So if this issue is encountered, mimeo just cancels that batch entirely. To fix it, you must either remove the batch limit or set it to a high enough value that it can pull data with at least two different timestamp values. The internal logging & monitoring system (pg_jobmon) sets off a critical alert if this occurs so you will know if it happens. Before v0.10.0 a batch limit was always used, so this was a bigger concern then. Since that version, they are no longer turned on by default, but I left the option available. So if you use that option, just be aware of these limitations.

And lastly, probably be biggest cause of issues with time-based replication is daylight savings. When time changes, you chance losing data on your destination table, especially when setting the clock back since it thinks it already got that data. Now is a good time to mention that if you can run your databases on a system that uses UTC/GMT time, you can make a lot of time-based issues with storing data go away. But if that’s not possible, and you want to use this replication method, there are some configuration options available. When you set up incremental replication using mimeo’s maker functions, it checks to see what timezone the database is running in. If you’re in UTC/GMT, you’ve got nothing to worry about. If you’re not, mimeo sets a flag to turn off replication around the time change. I haven’t found any better (ie, less outrageously complex) solutions to this other than just completely stopping replication for that time period. By default it turns off replication between 12:30am and 02:30am. If you need to narrow down or change that time period, the start & end times are configurable.

So at first, basing replication on a timestamp seems like it wouldn’t be too complex an issue. But as always, things are never as simple as they may seem to be.

Written by Keith

February 18th, 2013 at 12:41 pm

  • intgr

    > And lastly, probably be biggest cause of issues with time-based
    > replication is daylight savings. When time changes, you chance losing
    > data on your destination table, especially when setting the clock back
    > since it thinks it already got that data.

    I’m surprised that there are Postgres people who don’t know about this awesome Postgres feature – timestamp with time zone. While a timestamp alone is ambiguous, a timestamp and its timezone offset at that instant clears up the ambiguity. But it’s probably easier just to forget about timezones entirely and issue “SET timezone=’UTC’” in your session, Postgres will convert all “timestamp with time zone” values to UTC (actually, it’s stored in UTC and usually gets converted to your local zone). No need to ask the administrator to configure anything.

    In fact, almost always when you have a timestamp column, you actually want timestamp with time zone.

    • keithf4

      I am aware of the timestamptz data type. However, there is still the other timestamp without time zone data type and I have to account for that. In fact, the system this replication tool was developed on has been using timestamp without time zone for a very long time and is not in UTC, which is how I came to realize this needed to be accounted for. While it would be nice to get them converted to timestamptz, that is a large endeavor that they have no interest in fixing any time soon.

      You are correct, though, that it is much preferred to use timestamp with time zone and I would highly recommend it as well.

  • disqus_sOiA2USSjq

    How are NTP updates and clock differences between replica handled? It
    would seem your solution would overlook data between the gaps.

    • keithf4

      As it currently is, major time differences between the source and destination could cause problems, yes. So far I’ve only run this on systems in the same timezone and the interval between the refreshes running has been wide enough to account for very slight time differences. The lower boundary value of a refresh run is always determined by looking at the actual max value on the destination table. The upper boundary is determined by the CURRENT_TIMESTAMP – boundary interval. So if your clocks are different by an interval greater than the boundary interval, then yes you could possibly see missing data in those gaps. If you’re running the two systems in the same timezone and you have time differences that great, I’m having trouble thinking of a use-case where that’s normal. Definitely need to get NTP going there and get the times in sync properly.

      I’ve added an interval offset to my feature TODO list, but I have no need of it currently myself, so I don’t have any plans in the near future for adding it. If someone adds a feature request to the repo that they need it, I could see about adding it sooner.

  • Pingback: Mimeo – Repulling Incremental Replication Data at Keith's Ramblings…

  • Dariusz D.

    Hey,

    after execute:

    SELECT mimeo.inserter_maker(‘public.stopvoip’, ‘serial’, ‘radacctid’, 1);

    I get:

    ERROR: could not establish connection
    CONTEXT: SQL statement “SELECT p_table_exists FROM mimeo.manage_dest_table(p_dest_table, NULL, p_debug)”
    PL/pgSQL function mimeo.inserter_maker(text,text,text,integer,text,text,boolean,text[],text,boolean,boolean) line 117 at SQL statement

    I have PostgreSQL 9.2.4 and working dblink between two databeses (located on the same server)

    • Dariusz D.

      Nevermind, I’ve changed datasource in table mimeo.dblink_mapping_mimeo

      from
      ‘host=10.205.9.170 port=5432 dbname=radius’

      to
      ‘dbname=radius’

      and it works,