Checking for PostgreSQL Bloat (old)

UPDATE: A newer version of this post has been made along with a rewrite of the script to use pgstattuple –

One of the on-going issues with PostgreSQL that every administrator must deal with at some point is table and/or index bloat. The MVCC architecture of PostgreSQL lends itself to requiring some extra monitoring & maintenance, especially in large enterprise systems. I’ll save the description of why bloat happens until further down in this post (for those that are curious) and get to the point of this article first.

So far the best check for bloat I’ve come across is the one contained in the check_postgres suite from Bucardo. And in places where we can get check_postgres set up and used, it works great. But sometimes we just want to check the bloat on a system and also have a little more fine grained control of the output. So recently I pulled out the rather complex bloat check query from check_postgres and adapted it to a command-line python script:

General options:
  -m {tables,indexes}, --mode {tables,indexes}
                        Provide bloat report for the following objects:
                        tables, indexes. Note that the "tables" mode does not
                        include any index bloat that may also exist in the
                        table. Default is "tables".
  -c CONNECTION, --connection CONNECTION
                        Connection string for use by psycopg. Defaults to
                        "host=" (local socket).
  -f {simple,dict}, --format {simple,dict}
                        Output formats. Simple is a plaintext version suitable
                        for any output (ex: console, pipe to email). Dict is a
                        python dictionary object, which may be useful if
                        taking input into another python script or something
                        that needs a more structured format. Dict also
                        provides more details about object pages. Default is
  -a MIN_PAGES, --min_pages MIN_PAGES
                        Minimum number of pages an object must have to be
                        included in the report. Default and minimum value is
  -A MIN_WASTED_PAGES, --min_wasted_pages MIN_WASTED_PAGES
                        Minimum number of wasted pages an object must have to
                        be included in the report. Default and minimum value
                        is 1.
  -z MIN_WASTED_SIZE, --min_wasted_size MIN_WASTED_SIZE
                        Minimum size of wasted space in bytes. Default and
                        minimum is 1.
                        Minimum percentage of wasted space an object must have
                        to be included in the report. Default and minimum
                        value is 0.1%.
  -n SCHEMA, --schema SCHEMA
                        Comma separated list of schema to include in report.
                        All other schemas will be ignored.
                        Comma separated list of schemas to exclude. If set
                        along with -n, schemas will be excluded then included.
                        Full path to file containing a return deliminated list
                        of objects to exclude from the report (tables and/or
                        indexes). All objects must be schema qualified.
                        Comments are allowed if the line is prepended with
  --view_schema VIEW_SCHEMA
                        Set the schema that the bloat report view is in if
                        it's not in the default search_path. Note this option
                        can also be set when running --create_view to set in
                        which schema you want the view created.

  --create_view         Create the required view that the bloat report uses.
                        Places view in default search_path schema unless
                        --view_schema is set.
  --create_mat_view     Same as --create_view, but creates it as materialized
                        view if your version of PostgreSQL supports it (9.3+).
                        Be aware that this script does not refresh the
                        materialized view automatically.

While I first started working on this script, just running the query by itself on some of our client systems, I also realized that it can be quite an expensive query to run on larger systems. It also brings back both table and index bloat together and other parts of the script split that to make the distinction. To allow this check to provide data more efficiently and separate table & index bloat, my script first creates a view to use. For postgresql 9.3+, this can be created as a materialized view so running the script to get bloat data back is fast and causes no contention in the system.

The other thing I realized is that bloat percentage alone is a poor indicator of actual system health. Very small tables may always have a higher than average bloat, or there may always be 1 or 2 pages considered waste, and in reality that  has next to zero impact on database performance. Constantly “debloating” them is more a waste of time than the space used. So I added in additional filters that also look at the number of pages in objects and the actual wasted space of those objects. This allows the final output of the bloat report to provide a more accurate representation of where there may actually be problems that need to be looked into.

Another option is a filter for individual tables or indexes to be ignored. If you understand why bloat happens, you will come across cases where a table is stuck at a certain bloat point at all times, no matter how many times you VACUUM FULL it or run pg_repack on it (those two things do remove it, but it quickly comes back). This happens with tables that have a specific level of churn with the rows being inserted, updated & deleted. The number of rows being updated/deleted is balanced with the number of rows being inserted/updated as well as the autovacuum schedule to mark space for reuse. Removing the bloat from tables like this can actually cause decreased performance because instead of re-using the space that VACUUM marks as available, Postgres has to again allocate more pages to that object from disk first before the data can be added. So bloat is actually not always a bad thing and the nature of MVCC can lead to improved write performance on some tables.

The “simple” output format is suitable for just using on the command-line or piping out to an email program like mailx to get a status report sent to you.

$ ./ --create_view -c "dbname=pgbench"

$ ./ -c "dbname=pgbench"
1. public.pgbench_accounts........................................................(1.5%) 19 MB wasted
2. public.pgbench_tellers.......................................................(90.4%) 680 kB wasted
3. public.pgbench_branches......................................................(98.0%) 384 kB wasted
4. pg_catalog.pg_amop.......................................................(25.0%) 8192 bytes wasted
5. pg_catalog.pg_amproc.....................................................(33.3%) 8192 bytes wasted
6. pg_catalog.pg_conversion.................................................(33.3%) 8192 bytes wasted
7. pg_catalog.pg_description.................................................(3.2%) 8192 bytes wasted
8. pg_catalog.pg_operator....................................................(7.1%) 8192 bytes wasted

System tables like those you see at the bottom are extremely small and you can see the space they “waste” is negligible. I left the simple format like this so it just provides a quick glance for monitoring purposes. I also provide another output method in the form of a python dictionary that has some additional details. It’s also in a format that can be easily fed to other python scripts or anything else that needs a predictable, structured format.

$ ./ -c "dbname=pgbench" -f dict
{'bloat_percent': '1.5%', 'objectname': 'pgbench_accounts', 'wasted_size': '19 MB', 'wasted_pages': 2407.0, 'total_pages': 163935, 'schemaname': 'public'}
{'bloat_percent': '90.4%', 'objectname': 'pgbench_tellers', 'wasted_size': '680 kB', 'wasted_pages': 85.0, 'total_pages': 94, 'schemaname': 'public'}
{'bloat_percent': '98.0%', 'objectname': 'pgbench_branches', 'wasted_size': '384 kB', 'wasted_pages': 48.0, 'total_pages': 49, 'schemaname': 'public'}
{'bloat_percent': '25.0%', 'objectname': 'pg_amop', 'wasted_size': '8192 bytes', 'wasted_pages': 1.0, 'total_pages': 4, 'schemaname': 'pg_catalog'}
{'bloat_percent': '33.3%', 'objectname': 'pg_amproc', 'wasted_size': '8192 bytes', 'wasted_pages': 1.0, 'total_pages': 3, 'schemaname': 'pg_catalog'}
{'bloat_percent': '33.3%', 'objectname': 'pg_conversion', 'wasted_size': '8192 bytes', 'wasted_pages': 1.0, 'total_pages': 3, 'schemaname': 'pg_catalog'}
{'bloat_percent': '3.2%', 'objectname': 'pg_description', 'wasted_size': '8192 bytes', 'wasted_pages': 1.0, 'total_pages': 31, 'schemaname': 'pg_catalog'}
{'bloat_percent': '7.1%', 'objectname': 'pg_operator', 'wasted_size': '8192 bytes', 'wasted_pages': 1.0, 'total_pages': 14, 'schemaname': 'pg_catalog'}

You can see this provides some more details on how many pages are actually in the given objects vs how many are considered wasted space. All the system tables only have 1 page of wasted space so the given percentage varies because of how many pages they actually have. You have several options for filtering those last, unnecessary entries. The first would be to filter out the entire pg_catalog schema, but I don’t recommend that because catalog bloat can actually be a problem on databases that have extensive DDL changes over time. The other option is to use –min_pages to exclude any table or index that has less than a certain number of pages. I think the best option here would actually be –min_wasted_pages and set to something like 10.

$ ./ -c "dbname=pgbench" -f dict --min_wasted_pages=10
{'bloat_percent': '1.5%', 'objectname': 'pgbench_accounts', 'wasted_size': '19 MB', 'wasted_pages': 2407.0, 'total_pages': 163935, 'schemaname': 'public'}
{'bloat_percent': '90.4%', 'objectname': 'pgbench_tellers', 'wasted_size': '680 kB', 'wasted_pages': 85.0, 'total_pages': 94, 'schemaname': 'public'}
{'bloat_percent': '98.0%', 'objectname': 'pgbench_branches', 'wasted_size': '384 kB', 'wasted_pages': 48.0, 'total_pages': 49, 'schemaname': 'public'}

You can use all these filters in combination as well to also avoid low percentage bloat like that top entry. Be careful doing this, though, since if you have an extremely large table and are trying to figure out where all potentially wasted space is, you could accidentally filter it out (10% is a low percentage but 10% of a 500GB table would be 50GB).

$ ./ -c "dbname=pgbench" -f dict --min_wasted_pages=10 --min_wasted_percentage=5
{'bloat_percent': '90.4%', 'objectname': 'pgbench_tellers', 'wasted_size': '680 kB', 'wasted_pages': 85.0, 'total_pages': 94, 'schemaname': 'public'}
{'bloat_percent': '98.0%', 'objectname': 'pgbench_branches', 'wasted_size': '384 kB', 'wasted_pages': 48.0, 'total_pages': 49, 'schemaname': 'public'}

$ ./ -c "dbname=pgbench" --min_wasted_pages=10 --min_wasted_percentage=5
1. public.pgbench_tellers.......................................................(90.4%) 680 kB wasted
2. public.pgbench_branches......................................................(98.0%) 384 kB wasted

Why Bloat Happens

For those of you newer to PostgreSQL administration, and this is the first time you may be hearing about bloat, I figured I’d take the time to explain why this scenario exists and why tools like this are necessary (until they’re hopefully built into the database itself someday). It’s something most don’t understand unless someone first explains it to them or you run into the headaches it causes when it’s not monitored and you learn about it the hard way.

MVCC (multi-version concurrency control) is how Postgres has chosen to deal with multiple transactions/sessions hitting the same rows at (nearly) the same time. The documentation, along with wikipedia provide excellent and extensive explanations of how it all works, so I refer you there for all the details. Bloat is a result of one particular part of MVCC, concentrated around the handling of updates and deletes.

Whenever you delete a row, it’s not actually deleted, it is only marked as unavailable to all future transactions taking place after the delete occurs. The same happens with an update: the old version of a row is kept active until all currently running transactions have finished, then it is marked as unavailable. I emphasize the word unavailable because the row still exists on disk, it’s just not visible any longer. The VACUUM process in Postgres then comes along and marks any unavailable rows as space that is now available for future inserts or updates. The auto-vacuum process is configured to run VACUUM automatically after so many writes to a table (follow the link for the configuration options), so it’s not something you typically have to worry about doing manually very often (at least with more modern versions of Postgres).

People often assume that VACUUM is the process that should return the disk space to the file system. It does do this but only in very specific cases. That used space is contained in page files that make up the tables and indexes (called objects from now on) in the Postgres database system. Page files all have the same size and differently sized objects just have as many page files as they need. If VACUUM happens to mark every row in a page file as unavailable AND that page also happens to be the final page for the entire object, THEN the disk space is returned to the file system. If there is a single available row, or the page file is any other but the last one, the disk space is never returned by a normal VACUUM. This is bloat. Hopefully this explanation of what bloat actually is shows you how it can  sometimes be advantageous for certain usage patterns of tables as well, and why I’ve included the option to ignore objects in the report.

If you give the VACUUM command the special flag FULL, then all of that reusable space is returned to the file system. But VACUUM FULL does this by completely rewriting the entire table (and all its indexes) to new pages and takes an exclusive lock on the table the entire time it takes to run (CLUSTER does the same thing, but what that does is outside the scope of this post). For large tables in frequent use, this is problematic.  pg_repack has been the most common tool we’ve used to get around that. It recreates the table in the background, tracking changes to it, and then takes a brief lock to swap the old bloated table with the new one.

Why bloat is actually a problem when it gets out of hand is not just the disk space it uses up. Every time a query is run against a table, the visibility flags on individual rows and index entries is checked to see if is actually available to that transaction. On large tables (or small tables with a lot of bloat) that time spent checking those flags builds up. This is especially noticeable with indexes where you expect an index scan to improve your query performance and it seems to be making no difference or is actually worse than a sequential scan of the whole table. And this is why index bloat is checked independently of table bloat since a table could have little to no bloat, but one or more of its indexes could be badly bloated. Index bloat (as long as it’s not a primary key) is easier to solve because you can either just reindex that one index, or you can concurrently create a new index on the same column and then drop the old one when it’s done.

In all cases when you run VACUUM, it’s a good idea to run ANALYZE as well, either at the same time in one command or as two separate commands. This updates the internal statistics that Postgres uses when creating query plans. The number of live and dead rows in a table/index is a part of how Postgres decides to plan and run your queries. It’s a much smaller part of the plan than other statistics, but every little bit can help.

I hope this explanation of what bloat is, and how this tool can help with your database administration, has been helpful.