Keith's Ramblings…

Archive for the ‘mimeo’ tag

Mimeo – Repulling Incremental Replication Data

without comments

With other per-table replication methods, if the source and destination become out of sync, you typically have to repull the entire table. One of the nice things about using an incremental based replication method (based on incrementing time or serial number; see previous blog post) is that it can make repulling smaller batches of that data much easier.

One of our clients had been having some issues with their website hits tracking table. Some of the hits had been missed via the normal tracking method and the had to be re-obtained via other means and re-inserted into the hits tracking table on production. This table is also replicated to a data warehouse system for reporting. Since this table uses incremental replication based on time, the old data that was reinserted to the source with the old timestamp values would never make it to the reporting database on its own.

All of mimeo’s refresh functions have a p_repull boolean parameter that can be set to true and have it purge the destination table and repull all the data from the source. But the incremental refresh functions have two additional parameters: p_repull_start & p_repull_end. Right now I’m only supporting time-based incremental replication, so both of these values are timestamps. They let you set a starting and/or ending value for a block of data that you’d like purged on the destination and repulled from the source. If one or the other is left off, it just sets a boundary for the start or end and gets everything before or after the timestamps set. For very large tables (which most inserter/updater tables seem to be from my experience working on this tool) this can be a gigantic time-saver for getting the source and destination tables back in sync. If you do use this, just keep in mind that these start and end times are exclusive (< & >, not <= & >=).

Here is an example of it in use. I also set the p_debug option so I can follow, in real-time, the repull process. This information is also available via pg_jobmon, with the number of rows done kept up to date in the details log table as it runs. The data missing was between April 8th and 11th, so I set the start and end days a few minutes before and after each day just to make sure I got everything.


Written by Keith

April 16th, 2013 at 2:34 pm

PostgreSQL Extension Updates & Preserving Privileges

with 5 comments

My latest update of Mimeo required me to do something that I knew I would eventually need to do: drop and recreate a function in an extension without breaking the original privileges it had. If the code within a function changes, a CREATE OR REPLACE wouldn’t affect privileges. But when you need to change the parameters (not overload it) or return type of a function, it must be dropped and recreated.

Since extension updates are plain SQL, this is a little trickier than it would be if I could use variables or plpgsql to do this (like I did in the refresh_snap() function to preserve privileges when there are column changes that force a destination table recreation). I’d had an idea of how to do it, but until I actually tried it I wasn’t sure if it would work in the extension update process. This is some of the code from the beginning and end of the 0.11.1 to 0.12.0 update of mimeo:

I’m really glad this works because it’s much less complicated than I was thinking it was going to be. I just create a temp table to hold the grant commands with the original privileges, run some SQL to generate them, have a temp function that goes back and replays them at the end of the update and then drops the unneeded objects. If you look closely, the generated GRANT statement has the signature of the new function while the DROP statement has the old function signature.

The refresh functions (refresh_logdel() here) are functions that we’ve typically given permissions to other users to execute so they can refresh tables on demand as needed. You could check those permissions before the update, make note of them, and reapply them afterwards. But I think it’s much more preferable for the extension update to handle it itself if it can. This same method can be used to preserve permissions on any object just by looking it up in the relevant information_schema view.

Don’t know if this is the best way to do this, but so far it works. I’d appreciate any feedback if anyone has a better approach.

UPDATE 2014/02/14: At the time I wrote this post I wasn’t really familiar with the anonymous DO block. Since then I’ve redone this process and completely eliminated the need for the function. Redoing the code block above, you can see it simplifies things a lot.


Written by Keith

March 28th, 2013 at 12:51 pm

Posted in PostgreSQL

Tagged with , , ,

Mimeo – DML Replication

without comments

For the last introductory post about mimeo, I’ll be talking about DML replication (previous posts here and here). This is the most common way replication is done on a per table basis (at least that I’ve seen). Typically, a trigger is placed on the source table that tracks all changes (INSERTS, UPDATES & DELETES) and then some mechanism is used to replay those statements on the destination.

For mimeo, this is done with a queue table that just contains the primary key columns to note that a change was done to that row. The trigger places the primary key values into a queue table (also located on the source system) and then mimeo reads the queue table values to replay them on the destination. Saying that the statements are just replayed on the destination is really simplifying things though. While that is technically a legitimate way to replicate table changes, it is far from the most efficient. What mimeo actually does is

  1. Grab all queue table values, using a SELECT DISTINCT to only get a single copy of each row changed (since multiple changes to the same row put the same value into the queue table multiple times).
  2. Grab the full row from the source using the primary key to get the most recent values.
  3. Perform a DELETE … USING … command, removing all rows from the destination table that have a matching primary key value in the queue table.
  4. INSERT full rows from step 2 into the destination table.
  5. Clear the processed rows from the queue table.

This method is much more efficient because

  • Even if a row is updated 100,000 times between refresh runs, only one update is ever run on the destination with the latest value of that row.
  • And since all rows that were changed are deleted from the destination, this avoids having to check if something was actually an update or a delete. If it was an update, it will be reinserted from the queue. If it was deleted, it won’t be in the queue for insert.

And since this is all done in a single transaction on the destination, the result appears exactly the same as if the statements had actually been replayed as they happened on the source.

Mimeo also has a specialized DML replication method that can be useful in a data warehousing environment. One common need is to preserve deleted rows, but not track every single update done to a row. Just the last value that row has needs to be kept for archive purposes. The log deletion (logdel) replication method can provide this. It basically uses the same method as the normal DML above, but the trigger & queue table on the source are a little different. The queue table has the same columns as the source table as well as an extra timestamp column that records when a row was deleted. For an insert or update, just the primary key values are stored in the queue table, but for deletes the entire row gets stored. The replication steps are pretty much the same as DML except there’s an extra one to insert the deleted rows. And the destination table has an extra timestamp column as well to record when that row was deleted on the source.

So that’s basically how mimeo dml replication works. The dml/logdel maker functions take care of setting up the source table triggers, trigger functions, and queue tables for you as long as you’ve got the permissions set properly. The rest of the replication methods also have maker & destroyer functions to make setup and tear down easier. I’ll have further blog posts with some tips and use cases of how we’ve put mimeo to use for our clients. If you’ve got any questions or suggestions, please feel free to post here, on github or poke me on freenode IRC in #postgresql.

Written by Keith

March 15th, 2013 at 1:25 pm

Mimeo – Incremental Replication

with 7 comments

Continuing from my introductory post to mimeo, I’ll now discuss one of the methods that’s more unique to this replication tool.

If a table happens to have a timestamp column that is set on every insert or update of a row, the incremental replication methods can be used. This simply uses that column to track where in the replication process it left off each time it is run. There’s one for insert-only tables and another that can handle if that column is also set on every row update. While the insert-only one requires no primary/unique key, the updater one does. This method does not replicate any row deletions.

This means that, just like the snapshot method, only select privileges and no triggers are required on the source table. This method is ideal for insert-only, high-transaction tables such as one tracking hits on a website. Adding a trigger to track the changes to such a table for replication could place a lot of extra load of your front-end production systems.

Since this method was introduced in our environment, we’ve had to deal with several edge cases. One of the first was when the rows were just grabbed by getting everything larger than the last recorded timestamp. This runs into issues when the source table hasn’t stopped inserting rows for its latest timestamp value. Since that value is used for the next batch’s lower boundary and it thinks it has all rows matching that timestamp, it may miss some the next batch. So a boundary interval was introduced. This sets the batch’s upper limit to be less than a given interval. For example, say the last recorded timestamp on the destination was 2013-02-18 12:30:00 and mimeo runs the same day at 13:30:00. With a 10 minute upper boundary (mimeo’s default), this would get all rows with values > 2013-02-18 12:30:00 and < 2013-02-18 13:20:00. If rows are constantly being inserted, this does mean the destination is always 10 minutes behind. But it also ensures that no rows are ever missed. The boundary interval is required to enforce data integrity, but it is also configurable on a per table basis.

Another issue along the same lines has to do with when a limit is set on how many rows are pulled each batch. If the maximum number of rows in a batch is pulled, the upper boundary could be cut off in the middle of any timestamp value, not just the latest values being inserted. This is handled by always removing the highest value from the batch when the maximum number is pulled, delaying it to being pulled the next run. Bigger issues occur when the batch contains timestamp values that are all the same. There is no way to ensure a consistent pull of data from the source in this case. So if this issue is encountered, mimeo just cancels that batch entirely. To fix it, you must either remove the batch limit or set it to a high enough value that it can pull data with at least two different timestamp values. The internal logging & monitoring system (pg_jobmon) sets off a critical alert if this occurs so you will know if it happens. Before v0.10.0 a batch limit was always used, so this was a bigger concern then. Since that version, they are no longer turned on by default, but I left the option available. So if you use that option, just be aware of these limitations.

And lastly, probably be biggest cause of issues with time-based replication is daylight savings. When time changes, you chance losing data on your destination table, especially when setting the clock back since it thinks it already got that data. Now is a good time to mention that if you can run your databases on a system that uses UTC/GMT time, you can make a lot of time-based issues with storing data go away. But if that’s not possible, and you want to use this replication method, there are some configuration options available. When you set up incremental replication using mimeo’s maker functions, it checks to see what timezone the database is running in. If you’re in UTC/GMT, you’ve got nothing to worry about. If you’re not, mimeo sets a flag to turn off replication around the time change. I haven’t found any better (ie, less outrageously complex) solutions to this other than just completely stopping replication for that time period. By default it turns off replication between 12:30am and 02:30am. If you need to narrow down or change that time period, the start & end times are configurable.

So at first, basing replication on a timestamp seems like it wouldn’t be too complex an issue. But as always, things are never as simple as they may seem to be.

Written by Keith

February 18th, 2013 at 12:41 pm

PostgreSQL Extension Developer Tips – Part 4

without comments

So my latest release of mimeo (v0.10.0) taught me a lesson to look up a new feature to make sure of when it was added and that it would be compatible for the versions I’m trying to support. That feature was the new GET STACKED DIAGNOSTICS command that allows the capture and display of more details in the exception block to help make debugging problems easier. However that was introduced in 9.2 and I’m trying to stay compatible with 9.1. I tried wrapping the call in an IF statement that checked the current postgres version, but apparently that command gets evaluated when the function is installed, not on execution. So I’ve had to remove what I thought was a nice additional feature for now.

The reason this post is an extension developer tip was the way I handled releasing a fix for this. If you’re upgrading an extension and you are several released versions behind, when you run the ALTER EXTENSION … UPDATE TO … command, all of the updates between your version and the target version are run in sequence. However, due to the issue described above, the 0.9.3 -> 0.10.0 update script would fail if you’re not running postgres 9.2. This means anyone updating from 0.9.3 and earlier to 0.10.0 and later would never be able to run the intermediary update script, and hence never upgrade. The key to allowing a bypass of this version update is in the way the update scripts are named.

The name you give the file must be the version you’re upgrading from and the version it is upgrading you to. The trick to bypass the 0.10.0 update is to just create an update script like this

Seeing this file, the extension update planner will see that it is a shorter update path and choose it instead of running the intermediate update step from 0.9.3 to 0.10.0. The important thing you have to remember is to include ALL updates that occurred between the given versions in the bypass script.

There’s a handy function that can show you the update path that an extension update (or downgrade) can take so you can be sure things will work as expected

This will show you every single update and downgrade path possible, and if a valid path exists, all the steps along that path. This can be quite spammy for extensions with a long update history. Luckily this function returns a record set, so you can filter with a WHERE condition. Here’s an example with mimeo 0.9.2 installed and the path that would be taken without the bypass script.

With the bypass script available, the update paths turn into this.

So now when I give the update command

it will skip right over the 0.10.0 update script. This will allow PostgreSQL versions older than 9.2 to update without any issues. So, my apologies to anyone that grabbed my extension right away after my blog post and ran into issues. At least this gave me an opportunity for another tip!

Written by Keith

February 9th, 2013 at 11:42 am

Posted in PostgreSQL

Tagged with , , ,