Keith's Ramblings…

Archive for the ‘pg_jobmon’ tag

Mimeo – A per-table replication extension for PostgreSQL

with 3 comments

One of the biggest projects I’ve been working on the last few months is an extension that came about trying to organize a per-table replication process that has been in use with several of our clients, but never really formalized. After nearly 300 hours of time logged working on it, and mentioning it several times in other blog posts, I figured it’s about time I talk more about it.

https://github.com/omniti-labs/mimeo

I got the name mimeo after searching a thesaurus for words similar to “copy” & “replicate” and came across a mimeograph. The terms “low cost” and copying “small quantities” seemed to fit with the theme of what I was trying to accomplish, so the name stuck. There are some other great add-ons for PostgreSQL that allow per-table replication (Bucardo being the other one I’m more familiar with), but their setup and use can be a bit daunting. And if you just need a few tables copied, a bit overkill. The goal with mimeo was to keep the installation, maintenance and monitoring as simple as possible. Honestly, the hardest part of the extension I’ve found, and had others report the same to me, is just managing the permissions. I’ve got some plans to make some of it easier, but the extension doesn’t assume or require any superuser privileges, which I think is another plus.

The existing code I was working off of had several different replication methods that were used, and each had their own merits. I’ll be covering them over several blog posts, along with some general tips on usage, since I think a single blog post discussing the entire thing would be a bit much. I’ve already done some pretty extensive documentation and even written a howto, so these blog posts will mostly be informational and not really focused too much on the entire setup and maintenance process.

Before I get into too much detail, I just have to thank the PostgreSQL team for getting the extension system into place with 9.1. The only reason I was able to even come close to organizing the existing processes into something formal like this was because of that. The versioning control of a group of objects within the database allows a much smoother upgrading (and downgrading) process. And also a big thanks to David Wheeler for the pgTAP suite. If you look in the tests folder you can see I made extensive use of it and it’s saved me a tremendous amount of development time (especially with the latest v0.10.0 release where I rewrote much of the refresh process). When dealing with data replication you want to be really sure you’re getting every bit of data across 100% of the time and not breaking anything as development progresses.

Now onto the details. The most basic way to copy a table from one database to another is to just grab the entire thing. That’s covered in mimeo with the snapshot replication method. A table setup with this method will have the entire contents refreshed every time it is run. To help make this processes more transparent to a user of these tables, a view with two underlying tables is used. The view only ever points to one table at a time. When the refresh runs, the table it’s not pointing to is truncated and refreshed. A brief lock is then taken to swap the view to that new table. If you’ve got to refresh a rather larger table, the lock that a truncate takes would make the table unusable during that time. And doing a delete instead of a truncate could lead to some very heavy bloat in addition to the locking. The view swapping minimizes both of these issues.

Some additions to the snapshot process I was able to make were to allow the automatic propagation of column changes. This includes adding & dropping columns as well as type changes. Indexes are also copied over as well at creation time (this will be optional soon). You shouldn’t need to worry about constraints on the destination end since that should be being controlled on the source table. That would make the replication needlessly take longer. Since the table is re-created from scratch on a column change, there is an extra configuration option to run some additional commands such as setting permissions on the view/table. This can also be handled with the default privileges settings that were added in 9.0, but the extension option allows you to control them for each individually replicated table.

The snapshot method is ideal for smaller tables. It’s also much more efficient than the DML replication method (replaying every insert/update/delete, which I’ll be covering in a later post) if almost the entire table is updated in the span of time between table refreshes. One of the things I am currently working on is a way to just skip the refresh process if nothing has changed on the source. This would then make the snapshot process ideal for static tables or ones that rarely ever change.

That’s all for this post. We’re using this extension in some of our production environments already, so I’m confident in the code even though it’s not “1.0”. Would appreciate any feedback to speed up my decision to make such a stable release.

My next post will cover the incremental replication methods.

Written by Keith

February 7th, 2013 at 11:33 am

PG Jobmon Exception Handling

without comments

As a quick review for those unfamiliar with what PG Jobmon is, it’s an extension to allow autonomously logging steps within a function so that if the function fails, the individually logged steps are not rolled back and lost. This provides an audit trail and monitoring capabilities for functions critical to your database infrastructure. See my Projects page for the code and other posts tagged with “pg_jobmon” for more info.

One of the tricker issues I came across when making an extension out of the existing code that PG Jobmon was based on was getting useful errors back, both on the console and in the log tables. If an error happened before you logged the first step, or job logging even started, trying to handle  logging the errors in the exception block would cause some rather useless feedback, often hiding the real error. The below has become sort of a template for any function’s exception block where I use jobmon.

This accounts for when an error occurs before job logging started (call to add_job) or if it occurs between job logging starting and the first step being logged (between add_job and first call to add_step). Another call to RAISE EXCEPTION with the original SQL error is made after all that to ensure the real error is still reported back normally.

UPDATE (2013-02-07): I’ve added some additional error output to the actual raising of the exception to better show where the error is coming from. If you’d like that in your jobmon log as well, just add the variable into the update_step() call. The only downside to this is that GET STACKED DIAGNOSTIC only works on 9.2.

Written by Keith

January 11th, 2013 at 11:46 am

PG Jobmon Reaches 1.0.0

without comments

We’ve been running PG Jobmon in our production databases at OmniTI for a while now and it’s been working very well. Found an issue with the monitoring function, polished some other features up and figured it was finally time to give it 1.0.0 stable once that was all fixed.

Since my last post, I did get the simple job logging functions added that I mentioned (back in 0.4.0). To just simply log the number of rows affected by a single query call, you can use sql_step(job_id, action, sql). It returns a boolean to let you be able to test whether the step was successful or not.

The result in the job_detail table then looks like

If you just want to log a single query as a complete job, you can use sql_job(job_name, sql).

The result is similar to the simple step function, but makes an entire job log with the given name and a single step entry with how many rows were affected.

The other big change that came with 1.0.0 is with the check_job_status() monitoring function. It now no longer requires an interval argument, and it’s recommended not to pass one unless you really need to. The thing that brought this change about was if you passed an interval smaller than the highest threshold value set in the job_check_config table, it could return some confusing results. I could’ve made it return a clearer result, but honestly it really makes no sense at all to pass a smaller interval than what you’ve set to be monitored for in the configuration table. So now it throws an error if you do so. And if you just use the version that takes no argument, it automatically gets the largest interval threshold you’ve configured and uses that. So if you get a new longer job to monitor in the future, you no longer have to update your monitoring processes to account for it as long as you use the no-argument version.

The processes of monitoring for job problems has also been modified to allow check_job_status() to be able to raise notices when a job produces three level 2 (WARNING) alerts in a row as well. Previously this only happened with critical failures. The fail_job() function can now take an optional alert level argument to allow this to happen. This is useful for non-critical issues that, for example, shouldn’t cause a page to your oncall, but should still be looked into. For an example, see the next version of my Mimeo extension (>=0.9.0) that can send you a notification when a single replication batch has hit the configured row limit and may possibly cause replication to fall behind if it continues for too long. It’s an easily fixed problem and shouldn’t cause anyone to have to wake up in the middle of the night.

I’ve been using this extension extensively with several other extensions I’ve been working on that really need a good monitoring process available to make sure they are working properly. The first, mentioned above, is called Mimeo, which does specialized, per-table replication between PostgreSQL databases. The other is PG Partman, a table partition manager for time & serial based partitioning. Both of these processes would be pretty useless without some way to easily let you know they are working right and provide an audit trail when something goes wrong and you’re not around. The autonomous step logging that PG Jobmon provides, without rolling back all the logged steps when something goes wrong, does this wonderfully. Yes that kind of monitoring could be built right into those extensions, but I’ve always liked the part of the Unix philosophy of doing one thing and doing it well and allowing those smaller powerful tools to work together to a greater affect. And we’ve found that having PG Jobmon available for any other functions we want to monitor very helpful since it’s not tied directly into the other extensions.

PG Jobmon – https://github.com/omniti-labs/pg_jobmon
Mimeo – https://github.com/keithf4/mimeo
PG Partman – https://github.com/keithf4/pg_partman

Written by Keith

December 31st, 2012 at 11:05 am

PostgreSQL Job Logging & Monitor Extension – The Monitor

without comments

Following up to my post last month about pg_jobmon’s logging capability, this post will discuss its ability to let you monitor your jobs’ statuses. Like last month, I will again be assuming you’ve installed pg_jobmon to the jobmon schema.

The Monitoring Function

The main function that’s used to monitor logged jobs is check_job_status(interval). It returns row output, so it can easily be used as a query for monitoring software (more on that later). The interval argument is used to tell the function how far back to go when checking the job logs. To help the function run faster, this should be not much greater than the longest interval between any job that is listed in job_check_config (see below). For example, if you’re monitoring a special job that is run weekly, and nothing you’re monitoring runs with any greater interval of time (bi-weekly, monthly, etc), then you should call

I didn’t give it a strict ‘1 week’ interval just to give it a little buffer time. The above is what should be returned if everything is running fine. If not, the alert_text column will contain a list of all the jobs that returned either alert code 2 or 3. The alert_code field is the primary means of communicating the status of running jobs in the database. Use this in combination with the alert_code column in job_status_text to get a plaintext meaning to the code. As I noted last month, pg_jobmon uses three error codes to express the associated alert levels. The logger functions will use the alert_text values to update the status columns in job_log when a job is closed either successfully or failed.

Setup

By default, any job that has failed to run 3 consecutive times will cause the above function to return alert_code 3. There’s no additional setup to do if that sort of monitoring will work for your situation. If you need any sort of alerting other than that, you can use the job_check_config table to set that up.

Only one line needs to be added to this table for each job that needs different alerting criteria.

  • job_name – This is the EXACT job name as it appears in the job_log table. It is case sensitive, so jobs entered here should be in all caps. Easiest thing to do is just copy the job name right from the job_log table.
  • warn_threshold – This is the interval of time that can pass without the job running before alert_code 2 is returned by check_job_status()
  • error_threshold – This is the interval of time that can pass without the job running before alert_code 3 is returned by check_job_status()
  • active – Set this to TRUE if you want check_job_status() to actively monitor this job. Set to FALSE to disable checking without removing the data from the config table
  • sensitivity – This is the number of times the job can fail (status column in job_log is the text value of alert_code 3, CRITICAL by default) before alert_code 3 is returned by check_job_status(). Note that if you want this to return immediately on the first failure, set it to zero, not one.

The threshold times use the end_time column value from the job_log table to determine when that job last ran successfully. That means a job will have to have run successfully at least once for check_job_status() to return an all clear. This is why the default value for active is false, to avoid accidentally setting off an alert for a new job you just added to the monitor.

Here’s an example of one that runs once a day and should return alert 2 if not finished within 2 hours of its last run time and alert 3 if not run within 3 hours. Also a single failure of this job will cause the next run of check_job_status() to return alert_code 3 along with the failing job’s name.

Here’s an example of what a call to check_job_status() returns if a job hasn’t run in its expected timeframe

Third Party Monitoring Software

If you’ve used Nagios as your monitoring solution, the alert_text values probably looks familiar. This extension was designed initially for use in Nagios and kept its alert_text values from then. They seem to fit appropriately as defaults, and still work in our current monitoring software, so they were left like that. You’re free to update the job_status_text.alert_text values with whatever you wish, though, if your monitoring software needs something else.

The easiest way to get pg_jobmon working with Nagios is to use the check_postgres monitoring script. check_postgres has a custom_query option that works perfectly and already provides the output that Nagios needs. I’m not going to get into the finer details of configuring Nagios, but what follows are example command and service definitions that I’ve set up for my own system where Nagios and postgres run on the same machine. Adjust the paths and server values as needed for your systems.

The other monitoring service that we use, and that this monitor is even easier to set up with, is Circonus. All you need to do is define a Check with a similar query that Nagios used (note the extra text value as the first select field)

And then create a Rule using the regular expression matching options.

In the above image, the last returned value shows the full text result of running the query. So if there are several jobs that have failed, they will all show up in the Circonus alert that is sent out. For us, Severity 3 just sends an email out to the DBA team, so we can try and fix things before actually setting a page off. Sev 1 alerts both our on-call DBA and the on-call Sys Admin.

As I said before, the logging portion of this extension has been in existence for a while in several forms across different systems, so it’s pretty stable and probably won’t change too much in the near future. This extension is an attempt to bring some consistency and ease of maintenance across the systems we manage. The only thing I can see making a big impact in its design is if PostgreSQL actually gets built in autonomous functions. The monitoring portion is fairly new, and I’m hoping to find ways to make it more useful, but it seems to be working well so far in both the monitoring software packages I’ve tried it in. For either portion, I’m more than open to ideas and the code is always available for anyone to play with and contribute back.

Written by Keith

July 19th, 2012 at 10:59 am

PostgreSQL Job Logging & Monitor Extension – The Logger

with one comment

Update 2012/06/05: So of course after I go publishing this blog post, I discover a major shortcoming in 0.3.0. It requires super-user privileges to actually run. Teach me to test with only my own account in the database. I’ll be pushing 0.3.1 out shortly with a fix to allow non-superusers to be able to use these logging functions. Just requires a little bit of configuration, which I’ve added to that section below and the README.md file.

Job logging and monitoring in PostgreSQL can be an invaluable tool for tracking down issues when important database tasks are automated. OmniTI has some great job logging systems in place for the databases we support, but there was no consistent monitoring of this logging that was specifically built with it in mind. I thought this would be a great opportunity to take advantage of PostgreSQL 9.1’s new extensions capability to try and bring a consistent logging and monitoring tool into use that would be easily maintained and updatable. This blog post will first concentrate on the logging portion. I’ll do a follow up post about the monitor hopefully next month.

https://github.com/omniti-labs/pg_jobmon

The first, and easier piece to pull into an extension format was the logging. Easier mostly because previous developers here had done a good job of getting logging working. 🙂 Since Postgres’s functions aren’t autonomous, true job logging can be difficult. If a function fails to run, everything that it was doing, including inserts to a logging table, would be rolled back. Using dblink to have Postgres connect back to itself is a neat trick to get autonomous functions working and have intermittent commits actually be persistent despite a job failure. The details of doing this aren’t really the main point of this blog post, though, so if you’d like to see the details on how this works, the source code of the extension is freely available.

Installation

First off, the new extension system makes the install (and future updates) a breeze. I set the extension up to be compatible with the PostgreSQL Extension Network so you can use make/make install to get the files in the correct location. It is available on pgxn, but it is marked testing, so it won’t show up in searches yet. Would like to get some feedback from others before making it v1.0 stable – http://pgxn.org/dist/pg_jobmon. (Update: version 1.0 (and greater) have been available for quite some time in both github & pgxn) After that, it’s just a matter of using the new 9.1 extension management commands. I’d recommend creating a schema instead of installing it to public. I’m assuming the schema jobmon in all examples, but you can choose whichever one you want. Note that the schema cannot be changed after install without reinstalling the extension (it is not relocatable). This extension has table data as well, so re-installation can possibly result in the loss of configuration and logging data without taking precautions.

How to Log Steps

pg_jobmon has some pretty basic functions for doing job logging so getting it up and working isn’t very difficult at all. To start off, some variables to hold job and step ids will need to be defined for any function that requires logging

To start your job logging for a function call the add_job(text) function, storing the ID it generates to a variable. What you enter for JOB NAME will be used in the jobmon.job_log table as the job_name column. All values for the job name are automatically capitalized for consistency and to make searching a little more predictable as to what the values will be. I’ll be using a new, specialized replication extension I’m working on, mimeo, to show examples of how pg_jobmon is used. Specifically, the snap replication which does a full copy of a table from one postgresql instance to another.

This creates an entry in the jobmon.job_log table. pg_jobmon has a set of “show” functions that can make reading the job logs easier. I’ll talk more about them at the end, but I’ll be using them in these examples as well.

From here on out, job logging is primarily done with two functions: add_step(bigint, text) and update_step(bigint, text, text). add_step() is used to start tracking the progress of each step of the function that you’d like to monitor. The first argument is the job_id generated from add_job(). The text should be a short description of the step. Note that this description stays constant and is not changed as part of the update. add_step() returns a step_id that should be stored.

This creates the first step for our job and is logged to the jobmon.job_detail table under the given job_id.

Later in the function, after this step should have been completed, we update that step with the current status using update_step(). The first argument is the step_id that is to be updated. The next argument is a simple status. in this case everything should be ok if it reached this point, so it’s set to ‘OK’. (You can set whatever status message you wish here. Would recommend being consistent with the config table mentioned later.) The final argument is a message that can give more details as to results of this step. This was a fairly simple step, so just saying ‘done’ is good enough. You can update a status or message several times for a single step if needed, showing the current progress of longer running steps.

When you’re done with this step and want to move on to the next, just call the add_step() function again to create a new step ID.

For this step we’ll provide a little more information to the message log. I’ve added a GET DIAGNOSTICS call to the portion where the INSERT is done to get the row count of this snap job. This is added to the message log so we can make sure rows were actually copied into the snap table.

Handling Errors

If your job has a serious problem and needs to fail out completely, you can use the jobmon.fail_job(bigint) function. Provide it the job_id and it will set the status in jobmon.job_log to ‘CRITICAL’. Before calling this function you should do a call to jobmon.update_step() to provide details as to why the failure happened to the log. Note that this function DOES NOT stop your function from running. It only provides the functionality to log the failure. Adding these steps into an EXCEPTION is the common method we use, but it can be done anywhere you need

Speaking of exceptions, unless you handle the QUERY_CANCELED exception by adding logging instructions to it, manually cancelling your normally logged function will leave the final results in the log table as unfinished. If you want to cancel logged functions and properly make note of it in the logs, use the jobmon.cancel_job(bigint) function. Just provide it the job_id of the currently running function and it will do the following:

  • cancel the currently running job using pg_cancel_backend() (pid is stored in job_log table)
  • update the last step to set the status to CRITICAL
  • set the last step message to Manually cancelled via call to jobmon.cancel_job()
  • call the fail_job() function to close out the job as CRITICAL

Configuration

A database superuser can just use these logging functions right away after extension creation. Non-superusers will require a bit of configuration. First, create a role that you can grant the following permissions to:

Set your pg_hba.conf file to allow this user to connect locally to the database then add the rolename and password to the dblink_mapping table. Note the password is stored in the clear, so just give this role the limited permissions it needs and nothing else and control access to this table.

One thing to note about the close_job, fail_job, and cancel_job functions is that they use another configuration table, job_status_text, to set statuses to either OK or CRITICAL (this extension was made for use with Nagios, hence these text values). If you’d like different default values for your jobs, just update the error_text column, keeping the following code meanings in mind.

1 always means the job ran without issue. 2 can be used as an intermediary status (next blog post on monitoring will go into more detail on this). And 3 means a complete failure of the job.
Update 2012/07/17: Version 0.3.4 has changed the column names of this table to make things a little more consistent with the monitoring portion of the extension

I’ll get into more detail on this with my monitoring post, but thought I’d at least mention this here since it involves values inserted to the logging table. I haven’t thought of any use cases for further status codes and these functions won’t use any others you may add to the table. If anyone thinks of any others that they feel should be added and used, I’m open to suggestions.

Reading Logs

Lastly for logging, you’ve seen examples of the show functions throughout this post. I won’t get into too much more detail on each of them since they are fairly self explanatory once a few bits of info are known ahead of time. First, all job names passed as parameters are automatically capitalized to be consistent with add_job(), so you don’t have to. Next, most have a default limit on the number of rows returned and this can be changed by passing an optional integer parameter. Also, all the show functions are set to return full row sets, so you can query and filter them just like a regular table. Of course, you can always query the job_log and job_detail tables directly as well. These were just created to try and make some of the more common queries we were writing against them easier to do. See the docs for more info.

Upcoming Features

A coworker mentioned a handy job logging function could be one that actually accepts a query to run and simply logs the number of rows affected. To quote him

(Update: The above function has been added as well as an “sql_job” function that can log an entire, single step job with a single call)

This covers the logging portion of this extension. Functions similar to these have been in use with several of our clients for quite a while. These have only been changed to be incorporated into an extension or add some new capability. The monitoring tables and function are still fairly new and should be in working order. Their basic use is covered in the readme file, but as I said in the beginning, I will be doing another blog post to cover their use. Would really appreciate any feedback, good and bad, from anyone that starts using it.This is also my first venture into making an extension, so if anyone has any feedback in that area, it would be much appreciated.

Written by Keith

June 4th, 2012 at 4:11 pm