Keith's Ramblings…

WARNING: If accidentally read, induce vomiting

PG Jobmon Reaches 1.0.0

without comments

We’ve been running PG Jobmon in our production databases at OmniTI for a while now and it’s been working very well. Found an issue with the monitoring function, polished some other features up and figured it was finally time to give it 1.0.0 stable once that was all fixed.

Since my last post, I did get the simple job logging functions added that I mentioned (back in 0.4.0). To just simply log the number of rows affected by a single query call, you can use sql_step(job_id, action, sql). It returns a boolean to let you be able to test whether the step was successful or not.

The result in the job_detail table then looks like

If you just want to log a single query as a complete job, you can use sql_job(job_name, sql).

The result is similar to the simple step function, but makes an entire job log with the given name and a single step entry with how many rows were affected.

The other big change that came with 1.0.0 is with the check_job_status() monitoring function. It now no longer requires an interval argument, and it’s recommended not to pass one unless you really need to. The thing that brought this change about was if you passed an interval smaller than the highest threshold value set in the job_check_config table, it could return some confusing results. I could’ve made it return a clearer result, but honestly it really makes no sense at all to pass a smaller interval than what you’ve set to be monitored for in the configuration table. So now it throws an error if you do so. And if you just use the version that takes no argument, it automatically gets the largest interval threshold you’ve configured and uses that. So if you get a new longer job to monitor in the future, you no longer have to update your monitoring processes to account for it as long as you use the no-argument version.

The processes of monitoring for job problems has also been modified to allow check_job_status() to be able to raise notices when a job produces three level 2 (WARNING) alerts in a row as well. Previously this only happened with critical failures. The fail_job() function can now take an optional alert level argument to allow this to happen. This is useful for non-critical issues that, for example, shouldn’t cause a page to your oncall, but should still be looked into. For an example, see the next version of my Mimeo extension (>=0.9.0) that can send you a notification when a single replication batch has hit the configured row limit and may possibly cause replication to fall behind if it continues for too long. It’s an easily fixed problem and shouldn’t cause anyone to have to wake up in the middle of the night.

I’ve been using this extension extensively with several other extensions I’ve been working on that really need a good monitoring process available to make sure they are working properly. The first, mentioned above, is called Mimeo, which does specialized, per-table replication between PostgreSQL databases. The other is PG Partman, a table partition manager for time & serial based partitioning. Both of these processes would be pretty useless without some way to easily let you know they are working right and provide an audit trail when something goes wrong and you’re not around. The autonomous step logging that PG Jobmon provides, without rolling back all the logged steps when something goes wrong, does this wonderfully. Yes that kind of monitoring could be built right into those extensions, but I’ve always liked the part of the Unix philosophy of doing one thing and doing it well and allowing those smaller powerful tools to work together to a greater affect. And we’ve found that having PG Jobmon available for any other functions we want to monitor very helpful since it’s not tied directly into the other extensions.

PG Jobmon – https://github.com/omniti-labs/pg_jobmon
Mimeo – https://github.com/keithf4/mimeo
PG Partman – https://github.com/keithf4/pg_partman

Written by Keith

December 31st, 2012 at 11:05 am