Keith's Ramblings…

Archive for the ‘monitoring’ tag

Checking for PostgreSQL Bloat (old)

with 2 comments

UPDATE: A newer version of this post has been made along with a rewrite of the script to use pgstattuple – http://www.keithf4.com/checking-for-postgresql-bloat/

One of the on-going issues with PostgreSQL that every administrator must deal with at some point is table and/or index bloat. The MVCC architecture of PostgreSQL lends itself to requiring some extra monitoring & maintenance, especially in large enterprise systems. I’ll save the description of why bloat happens until further down in this post (for those that are curious) and get to the point of this article first.

So far the best check for bloat I’ve come across is the one contained in the check_postgres suite from Bucardo. And in places where we can get check_postgres set up and used, it works great. But sometimes we just want to check the bloat on a system and also have a little more fine grained control of the output. So recently I pulled out the rather complex bloat check query from check_postgres and adapted it to a command-line python script: pg_bloat_check.py

While I first started working on this script, just running the query by itself on some of our client systems, I also realized that it can be quite an expensive query to run on larger systems. It also brings back both table and index bloat together and other parts of the script split that to make the distinction. To allow this check to provide data more efficiently and separate table & index bloat, my script first creates a view to use. For postgresql 9.3+, this can be created as a materialized view so running the script to get bloat data back is fast and causes no contention in the system.

The other thing I realized is that bloat percentage alone is a poor indicator of actual system health. Very small tables may always have a higher than average bloat, or there may always be 1 or 2 pages considered waste, and in reality that  has next to zero impact on database performance. Constantly “debloating” them is more a waste of time than the space used. So I added in additional filters that also look at the number of pages in objects and the actual wasted space of those objects. This allows the final output of the bloat report to provide a more accurate representation of where there may actually be problems that need to be looked into.

Another option is a filter for individual tables or indexes to be ignored. If you understand why bloat happens, you will come across cases where a table is stuck at a certain bloat point at all times, no matter how many times you VACUUM FULL it or run pg_repack on it (those two things do remove it, but it quickly comes back). This happens with tables that have a specific level of churn with the rows being inserted, updated & deleted. The number of rows being updated/deleted is balanced with the number of rows being inserted/updated as well as the autovacuum schedule to mark space for reuse. Removing the bloat from tables like this can actually cause decreased performance because instead of re-using the space that VACUUM marks as available, Postgres has to again allocate more pages to that object from disk first before the data can be added. So bloat is actually not always a bad thing and the nature of MVCC can lead to improved write performance on some tables.

The “simple” output format is suitable for just using on the command-line or piping out to an email program like mailx to get a status report sent to you.

System tables like those you see at the bottom are extremely small and you can see the space they “waste” is negligible. I left the simple format like this so it just provides a quick glance for monitoring purposes. I also provide another output method in the form of a python dictionary that has some additional details. It’s also in a format that can be easily fed to other python scripts or anything else that needs a predictable, structured format.

You can see this provides some more details on how many pages are actually in the given objects vs how many are considered wasted space. All the system tables only have 1 page of wasted space so the given percentage varies because of how many pages they actually have. You have several options for filtering those last, unnecessary entries. The first would be to filter out the entire pg_catalog schema, but I don’t recommend that because catalog bloat can actually be a problem on databases that have extensive DDL changes over time. The other option is to use –min_pages to exclude any table or index that has less than a certain number of pages. I think the best option here would actually be –min_wasted_pages and set to something like 10.

You can use all these filters in combination as well to also avoid low percentage bloat like that top entry. Be careful doing this, though, since if you have an extremely large table and are trying to figure out where all potentially wasted space is, you could accidentally filter it out (10% is a low percentage but 10% of a 500GB table would be 50GB).

Why Bloat Happens

For those of you newer to PostgreSQL administration, and this is the first time you may be hearing about bloat, I figured I’d take the time to explain why this scenario exists and why tools like this are necessary (until they’re hopefully built into the database itself someday). It’s something most don’t understand unless someone first explains it to them or you run into the headaches it causes when it’s not monitored and you learn about it the hard way.

MVCC (multi-version concurrency control) is how Postgres has chosen to deal with multiple transactions/sessions hitting the same rows at (nearly) the same time. The documentation, along with wikipedia provide excellent and extensive explanations of how it all works, so I refer you there for all the details. Bloat is a result of one particular part of MVCC, concentrated around the handling of updates and deletes.

Whenever you delete a row, it’s not actually deleted, it is only marked as unavailable to all future transactions taking place after the delete occurs. The same happens with an update: the old version of a row is kept active until all currently running transactions have finished, then it is marked as unavailable. I emphasize the word unavailable because the row still exists on disk, it’s just not visible any longer. The VACUUM process in Postgres then comes along and marks any unavailable rows as space that is now available for future inserts or updates. The auto-vacuum process is configured to run VACUUM automatically after so many writes to a table (follow the link for the configuration options), so it’s not something you typically have to worry about doing manually very often (at least with more modern versions of Postgres).

People often assume that VACUUM is the process that should return the disk space to the file system. It does do this but only in very specific cases. That used space is contained in page files that make up the tables and indexes (called objects from now on) in the Postgres database system. Page files all have the same size and differently sized objects just have as many page files as they need. If VACUUM happens to mark every row in a page file as unavailable AND that page also happens to be the final page for the entire object, THEN the disk space is returned to the file system. If there is a single available row, or the page file is any other but the last one, the disk space is never returned by a normal VACUUM. This is bloat. Hopefully this explanation of what bloat actually is shows you how it can  sometimes be advantageous for certain usage patterns of tables as well, and why I’ve included the option to ignore objects in the report.

If you give the VACUUM command the special flag FULL, then all of that reusable space is returned to the file system. But VACUUM FULL does this by completely rewriting the entire table (and all its indexes) to new pages and takes an exclusive lock on the table the entire time it takes to run (CLUSTER does the same thing, but what that does is outside the scope of this post). For large tables in frequent use, this is problematic.  pg_repack has been the most common tool we’ve used to get around that. It recreates the table in the background, tracking changes to it, and then takes a brief lock to swap the old bloated table with the new one.

Why bloat is actually a problem when it gets out of hand is not just the disk space it uses up. Every time a query is run against a table, the visibility flags on individual rows and index entries is checked to see if is actually available to that transaction. On large tables (or small tables with a lot of bloat) that time spent checking those flags builds up. This is especially noticeable with indexes where you expect an index scan to improve your query performance and it seems to be making no difference or is actually worse than a sequential scan of the whole table. And this is why index bloat is checked independently of table bloat since a table could have little to no bloat, but one or more of its indexes could be badly bloated. Index bloat (as long as it’s not a primary key) is easier to solve because you can either just reindex that one index, or you can concurrently create a new index on the same column and then drop the old one when it’s done.

In all cases when you run VACUUM, it’s a good idea to run ANALYZE as well, either at the same time in one command or as two separate commands. This updates the internal statistics that Postgres uses when creating query plans. The number of live and dead rows in a table/index is a part of how Postgres decides to plan and run your queries. It’s a much smaller part of the plan than other statistics, but every little bit can help.

I hope this explanation of what bloat is, and how this tool can help with your database administration, has been helpful.

Written by Keith

August 25th, 2014 at 3:16 pm

Posted in PostgreSQL

Tagged with , ,

Monitoring Streaming Replica Lag Effectively

with 3 comments

One of the easiest ways to monitor replica lag when using streaming replication is to turn hot standby on your replica and use pg_last_xact_replay_timestamp() and/or the other recovery information functions. Here’s an example query to run on the replica systems to get the number of seconds behind it is:

The issue with this query is that while your replica(s) may be 100% caught up, the time interval being returned is always increasing until new write activity occurs on the primary that the replica can replay. This can cause your monitoring to give false positives that your replica is falling behind if you have things set up to ensure your replicas are no more than a few minutes behind. A side affect of this monitoring query can also give you an indication that writes to your primary have stopped for some reason.

One of our clients has a smaller sized database that doesn’t get quite as much write traffic as our typical clients do. But it still has failover replicas and still needs to be monitored just like our other larger clients to ensure it doesn’t fall too far behind. So, my coworker introduced me to the pg_stat_replication view that was added in PostgreSQL 9.1. Querying this from the primary returns information about streaming replication replicas connected to it.

He also provided a handy query to get back a simple, easy to understand numeric value to indicate replica lag. The issue I ran into using the query is that this view uses pg_stat_activity as one of its sources. If you’re not a superuser, you’re not going to get any statistics on sessions that aren’t your own (and hopefully you’re not using a superuser role as the role for your monitoring solution). So, instead I made a function with SECURITY DEFINER set, made a superuser role the owner, and gave my monitoring role EXECUTE privileges on the function.

Running this query gives back a few handy columns that should be good enough for most monitoring tools. You can easily add more columns from pg_stat_replication or any other tables you need to join against for more info.

UPDATE: If you’re running PostgreSQL 9.2+, there is a new, built-in function that avoids needing the above function all together and can just query pg_stat_replication directly.

Unfortunately, this function still requires superuser privileges to obtain all relevant data and most monitoring tools do not use a superuser role (I hope). So, in that case you do still need a SECURITY DEFINER function, but it can be a much much simpler one

This can also be useful to monitor replica lag when you don’t have hot standby turned on for your replicas to allow read-only queries.

Combining both of the replication monitoring solutions mentioned in this post should give you a much better overall picture of the status of your primary/replica systems.

Written by Keith

October 30th, 2013 at 10:51 am

Posted in PostgreSQL

Tagged with , ,

PG Jobmon Exception Handling

without comments

As a quick review for those unfamiliar with what PG Jobmon is, it’s an extension to allow autonomously logging steps within a function so that if the function fails, the individually logged steps are not rolled back and lost. This provides an audit trail and monitoring capabilities for functions critical to your database infrastructure. See my Projects page for the code and other posts tagged with “pg_jobmon” for more info.

One of the tricker issues I came across when making an extension out of the existing code that PG Jobmon was based on was getting useful errors back, both on the console and in the log tables. If an error happened before you logged the first step, or job logging even started, trying to handle  logging the errors in the exception block would cause some rather useless feedback, often hiding the real error. The below has become sort of a template for any function’s exception block where I use jobmon.

This accounts for when an error occurs before job logging started (call to add_job) or if it occurs between job logging starting and the first step being logged (between add_job and first call to add_step). Another call to RAISE EXCEPTION with the original SQL error is made after all that to ensure the real error is still reported back normally.

UPDATE (2013-02-07): I’ve added some additional error output to the actual raising of the exception to better show where the error is coming from. If you’d like that in your jobmon log as well, just add the variable into the update_step() call. The only downside to this is that GET STACKED DIAGNOSTIC only works on 9.2.

Written by Keith

January 11th, 2013 at 11:46 am

PG Jobmon Reaches 1.0.0

without comments

We’ve been running PG Jobmon in our production databases at OmniTI for a while now and it’s been working very well. Found an issue with the monitoring function, polished some other features up and figured it was finally time to give it 1.0.0 stable once that was all fixed.

Since my last post, I did get the simple job logging functions added that I mentioned (back in 0.4.0). To just simply log the number of rows affected by a single query call, you can use sql_step(job_id, action, sql). It returns a boolean to let you be able to test whether the step was successful or not.

The result in the job_detail table then looks like

If you just want to log a single query as a complete job, you can use sql_job(job_name, sql).

The result is similar to the simple step function, but makes an entire job log with the given name and a single step entry with how many rows were affected.

The other big change that came with 1.0.0 is with the check_job_status() monitoring function. It now no longer requires an interval argument, and it’s recommended not to pass one unless you really need to. The thing that brought this change about was if you passed an interval smaller than the highest threshold value set in the job_check_config table, it could return some confusing results. I could’ve made it return a clearer result, but honestly it really makes no sense at all to pass a smaller interval than what you’ve set to be monitored for in the configuration table. So now it throws an error if you do so. And if you just use the version that takes no argument, it automatically gets the largest interval threshold you’ve configured and uses that. So if you get a new longer job to monitor in the future, you no longer have to update your monitoring processes to account for it as long as you use the no-argument version.

The processes of monitoring for job problems has also been modified to allow check_job_status() to be able to raise notices when a job produces three level 2 (WARNING) alerts in a row as well. Previously this only happened with critical failures. The fail_job() function can now take an optional alert level argument to allow this to happen. This is useful for non-critical issues that, for example, shouldn’t cause a page to your oncall, but should still be looked into. For an example, see the next version of my Mimeo extension (>=0.9.0) that can send you a notification when a single replication batch has hit the configured row limit and may possibly cause replication to fall behind if it continues for too long. It’s an easily fixed problem and shouldn’t cause anyone to have to wake up in the middle of the night.

I’ve been using this extension extensively with several other extensions I’ve been working on that really need a good monitoring process available to make sure they are working properly. The first, mentioned above, is called Mimeo, which does specialized, per-table replication between PostgreSQL databases. The other is PG Partman, a table partition manager for time & serial based partitioning. Both of these processes would be pretty useless without some way to easily let you know they are working right and provide an audit trail when something goes wrong and you’re not around. The autonomous step logging that PG Jobmon provides, without rolling back all the logged steps when something goes wrong, does this wonderfully. Yes that kind of monitoring could be built right into those extensions, but I’ve always liked the part of the Unix philosophy of doing one thing and doing it well and allowing those smaller powerful tools to work together to a greater affect. And we’ve found that having PG Jobmon available for any other functions we want to monitor very helpful since it’s not tied directly into the other extensions.

PG Jobmon – https://github.com/omniti-labs/pg_jobmon
Mimeo – https://github.com/keithf4/mimeo
PG Partman – https://github.com/keithf4/pg_partman

Written by Keith

December 31st, 2012 at 11:05 am

PostgreSQL Job Logging & Monitor Extension – The Monitor

without comments

Following up to my post last month about pg_jobmon’s logging capability, this post will discuss its ability to let you monitor your jobs’ statuses. Like last month, I will again be assuming you’ve installed pg_jobmon to the jobmon schema.

The Monitoring Function

The main function that’s used to monitor logged jobs is check_job_status(interval). It returns row output, so it can easily be used as a query for monitoring software (more on that later). The interval argument is used to tell the function how far back to go when checking the job logs. To help the function run faster, this should be not much greater than the longest interval between any job that is listed in job_check_config (see below). For example, if you’re monitoring a special job that is run weekly, and nothing you’re monitoring runs with any greater interval of time (bi-weekly, monthly, etc), then you should call

I didn’t give it a strict ‘1 week’ interval just to give it a little buffer time. The above is what should be returned if everything is running fine. If not, the alert_text column will contain a list of all the jobs that returned either alert code 2 or 3. The alert_code field is the primary means of communicating the status of running jobs in the database. Use this in combination with the alert_code column in job_status_text to get a plaintext meaning to the code. As I noted last month, pg_jobmon uses three error codes to express the associated alert levels. The logger functions will use the alert_text values to update the status columns in job_log when a job is closed either successfully or failed.

Setup

By default, any job that has failed to run 3 consecutive times will cause the above function to return alert_code 3. There’s no additional setup to do if that sort of monitoring will work for your situation. If you need any sort of alerting other than that, you can use the job_check_config table to set that up.

Only one line needs to be added to this table for each job that needs different alerting criteria.

  • job_name – This is the EXACT job name as it appears in the job_log table. It is case sensitive, so jobs entered here should be in all caps. Easiest thing to do is just copy the job name right from the job_log table.
  • warn_threshold – This is the interval of time that can pass without the job running before alert_code 2 is returned by check_job_status()
  • error_threshold – This is the interval of time that can pass without the job running before alert_code 3 is returned by check_job_status()
  • active – Set this to TRUE if you want check_job_status() to actively monitor this job. Set to FALSE to disable checking without removing the data from the config table
  • sensitivity – This is the number of times the job can fail (status column in job_log is the text value of alert_code 3, CRITICAL by default) before alert_code 3 is returned by check_job_status(). Note that if you want this to return immediately on the first failure, set it to zero, not one.

The threshold times use the end_time column value from the job_log table to determine when that job last ran successfully. That means a job will have to have run successfully at least once for check_job_status() to return an all clear. This is why the default value for active is false, to avoid accidentally setting off an alert for a new job you just added to the monitor.

Here’s an example of one that runs once a day and should return alert 2 if not finished within 2 hours of its last run time and alert 3 if not run within 3 hours. Also a single failure of this job will cause the next run of check_job_status() to return alert_code 3 along with the failing job’s name.

Here’s an example of what a call to check_job_status() returns if a job hasn’t run in its expected timeframe

Third Party Monitoring Software

If you’ve used Nagios as your monitoring solution, the alert_text values probably looks familiar. This extension was designed initially for use in Nagios and kept its alert_text values from then. They seem to fit appropriately as defaults, and still work in our current monitoring software, so they were left like that. You’re free to update the job_status_text.alert_text values with whatever you wish, though, if your monitoring software needs something else.

The easiest way to get pg_jobmon working with Nagios is to use the check_postgres monitoring script. check_postgres has a custom_query option that works perfectly and already provides the output that Nagios needs. I’m not going to get into the finer details of configuring Nagios, but what follows are example command and service definitions that I’ve set up for my own system where Nagios and postgres run on the same machine. Adjust the paths and server values as needed for your systems.

The other monitoring service that we use, and that this monitor is even easier to set up with, is Circonus. All you need to do is define a Check with a similar query that Nagios used (note the extra text value as the first select field)

And then create a Rule using the regular expression matching options.

In the above image, the last returned value shows the full text result of running the query. So if there are several jobs that have failed, they will all show up in the Circonus alert that is sent out. For us, Severity 3 just sends an email out to the DBA team, so we can try and fix things before actually setting a page off. Sev 1 alerts both our on-call DBA and the on-call Sys Admin.

As I said before, the logging portion of this extension has been in existence for a while in several forms across different systems, so it’s pretty stable and probably won’t change too much in the near future. This extension is an attempt to bring some consistency and ease of maintenance across the systems we manage. The only thing I can see making a big impact in its design is if PostgreSQL actually gets built in autonomous functions. The monitoring portion is fairly new, and I’m hoping to find ways to make it more useful, but it seems to be working well so far in both the monitoring software packages I’ve tried it in. For either portion, I’m more than open to ideas and the code is always available for anyone to play with and contribute back.

Written by Keith

July 19th, 2012 at 10:59 am