Archive for the ‘bloat’ tag
Per-Table Autovacuum Tuning
A pattern that seems to drive my blog posts definitely seems to be the frequency of client questions. And that is definitely the case here again. Vacuum tuning to manage bloat and transaction id wraparound on production systems has been a hot topic and lately this has even been getting down to tuning autovacuum on the individual table basis. I’ve already discussed bloat pretty extensively in previous posts. While I’d like to get into the details of transaction ID wraparound, that really isn’t the focus of this post, so I’ll defer you to the documentation.
One setting I will discuss first though is autovacuum_freeze_max_age. Basically when any table’s max XID value reaches this, a more aggressive “emergency” autovacuum is kicked off. If many tables hit this at the same time (a common occurrence with data-warehouses that have many large, sparsely written tables), that can kick off some pretty high and long lasting IO caused by these autovacuums. I highly recommend increasing autovacuum_freeze_max_age from the default value of 200 million to 1 billion. However, I don’t recommend doing this unless you have some monitoring in place to watch for tables reaching both autovacuum_freeze_max_age and wraparound, the latter of which can shut your database down for some pretty extensive downtime.
So ideally, we want autovacuum running often enough on its own so you never encounter any of those bad situations above. The next thing I’d recommend tuning on any database before getting down to the table-level methods, no matter its size or traffic, are the default settings that control when autovacuum initially kicks in.
autovacuum_analyze_scale_factor = .10 autovacuum_analyze_threshold = 50 autovacuum_vacuum_scale_factor = .20 autovacuum_vacuum_threshold = 50
In my opinion, the defaults on these are not ideal for any situation. The scales are too high and the thresholds are too low. The scale factor settings are percentage values that say, “When this percentage of the table’s rows have changed (updates/deletes), run vacuum or analyze”. The threshold settings say that “When this many rows have changed, run vacuum or analyze”. These two settings are used together to determine the actual threshold that will kick in autovacuum. The formula taken from the documentation shows this:
vacuum initialization = autovacuum_vacuum_threshold + (autovacuum_vacuum_scale_factor * number of rows) analyze initialization = autovacuum_analyze_threshold + (autovacuum_analyze_scale_factor * number of rows)
So this means with the default settings, a 100,000 row table will have autovacuum kick in when 20050 rows have changed and autoanalyze will kick in when 10050 have changed. Not too bad, but my recommended default settings are are this:
autovacuum_analyze_scale_factor = .05 autovacuum_analyze_threshold = 500 autovacuum_vacuum_scale_factor = .10 autovacuum_vacuum_threshold = 500
This halves the scale factor settings so autovacuum is a bit more aggressive on how often it kicks in. My reason for increasing the threshold values is because when you have very small tables, you really don’t want autovacuum then being overly aggressive on them after decreasing the scale settings. Vacuuming too often may not necessarily be a bad thing, but it’s just wasted IO at that point.
So how about when you start getting into very large tables? Can you see why this could be a problem?
(800,000,000 rows * .2) + 50 = 160,000,050 (800,000,000 rows * .1) + 500 = 80,000,500
We would have to wait until nearly 160 million rows have changed before autovacuum would ever kick in. Even with my recommended defaults, 80 million rows is still quite a bit to wait for. For a high write rate table, this might work out ok, but for most cases this likely won’t be ideal. This affects bloat buildup since autovacuum may not run often enough to clear up available space for new/updated rows. And it could also cause the global XID value to increase quite high, more than likely reaching autovacuum_freeze_max_age before a normal autovacuum process would ever touch it. This is when we need to start tuning on a per-table basis and all four of the above settings can be set that way.
So the first question to ask is how often do we want autovacuum to run? I’ll start from a baseline of at least once per day for a starting point. This means we’ll need a way to see how many rows have changed per day.
mydb=# select * from pg_stat_all_tables where relname = 'mytable'; -[ RECORD 1 ]-------+------------------------------ relid | 32307 schemaname | public relname | mytable seq_scan | 25 seq_tup_read | 5042494092 idx_scan | 199790800 idx_tup_fetch | 3373819424 n_tup_ins | 342119654 n_tup_upd | 752699033 n_tup_del | 167007870 n_tup_hot_upd | 281342612 n_live_tup | 286999790 n_dead_tup | 9050790 n_mod_since_analyze | 762072 last_vacuum | 2017-08-17 16:14:25.447345-05 last_autovacuum | 2018-08-24 13:17:35.88936-05 last_analyze | 2018-08-16 20:32:13.265396-05 last_autoanalyze | 2018-08-31 07:43:45.551732-05 vacuum_count | 1 autovacuum_count | 43 analyze_count | 29 autoanalyze_count | 129
The pg_stat_all_table view (or pg_stat_user_tables to exclude system tables) can help us get this. PostgreSQL doesn’t keep long term trending data on table writes, but it does keep counters. Specifically the n_tup_upd & n_tup_del columns can tell us how many tuples/rows where changed since the statistics for this database were last reset. So if we can collect this data once per day around the same time, we can get an idea of the write rate that would affect when autovacuum would kick in. Here’s a simple command you can add to a cronjob to generate some CSV data you could then pull into a spreadsheet and quickly do calculations.
psql -U postgres -d mydb -c "COPY (select now() AS gather_time, * from pg_stat_all_tables where relname = 'mytable') To STDOUT WITH CSV DELIMITER ','" >> mytable_stats.csv
Just add n_tup_upd+n_tup_del and compare that to the previous day’s total value and you have your rows changed per day. If you want it to run more/less often, then just adjust your collection interval as needed. So let’s say there were around 300,000 to 500,000 rows changed per day on our giant 800mil row table. We would want autovacuum to run when at least 400,000 rows changed. This means we want to ignore the scale factor setting completely and rely only on the threshold value. That’s the other problem with the scale factor and large tables: as they grow larger and larger, autovacuum runs less and less since the % value of that table grows larger. So to set the above settings on a per table basis, we can use the ALTER TABLE command to adjust the above settings for just that table.
ALTER TABLE public.mytable SET (autovacuum_analyze_scale_factor = 0, autovacuum_vacuum_scale_factor = 0, autovacuum_vacuum_threshold = 400000, autovacuum_analyze_threshold = 100000 );
We usually want analyze to run more often than a vacuum so queries can have accurate statistics. Analyzes are fairly lightweight, even on large tables, so it’s ok to be more aggressive about when those run.
Once we have those settings in place, we can then use the same stats view to see if autovacuum & autoanalyze are running when we want them too. last_autoanalyze & last_autovacuum are the timestamp of when they last completed.
You may need it to vacuum more often than this depending on the turn-over of updates, deletes & inserts to keep bloat under control. To determine that I recommend setting up bloat monitoring. If you see that it keeps growing, you can look at vacuuming more often or even adjusting the fillfactor settings for the table or indexes.
Another suggestion that I will make is for databases that have any tables that get few-to-no writes, especially large ones. First, upgrade to a recent version of PostgreSQL if you’re not on at least 9.6. Then, schedule a regular VACUUM of your database, or at least the tables that are no longer being touched often. 9.6 introduced a feature that if all of the rows on a page are frozen, then autovacuum is able to completely skip over that page and not have to evaluate each individual row. This can be a HUGE saving on both time and IO. And the larger the table the more the savings you’ll see in the long run. For anyone that is running their databases in the cloud where IO is money, this can save you quite a bit there. Typically tuning autovacuum will preclude needing to schedule a manual vacuum, but if it turns out it’s needed in some cases, running during off-peak hours won’t hurt anything.
The other issue that people often encounter is long running autovacuums or autovacuum incurring higher IO costs than anticipated. I’ll be working on another blog post in the future that covers these issues. Hopefully this gives you a good start on how to better tune autovacuum for your growing databases.
Removing A Lot of Old Data (But Keeping Some Recent)
I’ve had this situation crop up a few times with clients and after a discussion on #postgresql on Freenode recently, decided a blog post may be in order. The pitfalls that lead me to this solution are useful to cover and it seems a useful set of steps to have documented and be able to share again later.
There comes a time for most people when you have a table that builds up quite a lot of rows and you then realize you didn’t actually need to keep all of it. But you can’t just run a TRUNCATE because you do want to keep some of the more recent data. If it’s just a few million small rows, it’s not a huge deal to just run a simple DELETE. But when that starts getting into the billions of rows, or your rows are very large (long text, bytea, etc), a simple DELETE may not be realistic.
The first, and arguably easiest way, to deal with this would be to run the DELETE in batches instead of one large transaction. This allows you to add a pause in between the batches to help control I/O. Another side affect of such large delete operations can be an excessively high amount of WAL generation. This not only contributes to I/O, this can also dramatically increase disk space usage. When you’re trying to delete data due to disk space constraints, this can end up making things even worse before they get better. And if you’re deleting quite a lot of data from over a long period of time that didn’t receive many deletes before, you likely won’t get much disk space back at the end without doing a VACUUM FULL or pg_repack (see my discussion on bloat here). And the biggest issue of all when disk space is critical is either one of those options requires that you have at least as much disk space available as a full copy of the unbloated table would take up. So if disk space is the crucial problem most people are trying to solve with removing old data, how can we easily do this?
The first option that’s most easily done on almost any RDBMS is to make a secondary table and have new data copied/redirected there. The I/O and WAL generation of moving the smaller, required amount of data is much less than removing all the old data. Then you can just drop the old table. And most people do this via a trigger-based method: have every write to the old table also write to the new one as well. Then once you’re sure things are working, you can take a brief lock on both the old and new tables and swap their names. PostgreSQL makes this last step much easier, having transactional DDL. I’ll cover the commands to do this later since my final solution is similar to this. The main problems surrounding this come when the table you’re trying to clean up is a very high-traffic table. Doing a trigger like this basically doubles all writes which could possibly cause some I/O issues. There’s also the fact of making sure you get that trigger function code right, otherwise all writes break. Can be a little nerve wracking on critical production systems. But there is a way to avoid both the additional I/O of double writes and the headache of writing triggers.
The actual solution we came up for this involves using the often overlooked feature of table inheritance in PostgreSQL. The steps basically go like this:
- Create a new table exactly like the first one
- Set ownership/privileges on the new table to match the old table
- Have the old table INHERIT the new table
- Swap old and new table names
- Move the data you want to keep from the old table to the new one
- Drop the old table
As soon as you do step 4, all new data is immediately going to the new table. Also, since the old table is a child of the new table, all your old data is still visible from the original table name. Moving the data from the old table to the new one is nearly transparent to any users of the table, the only issue being there may be some slightly slower queries during that transition period since the planner has to account for 2 tables. You can help mitigate this slightly by placing a constraint on the new table (before the swap when it’s empty) that only allows data in the new table’s window. This allows constraint exclusions to possibly ignore the old table while you get data moved. Placing a constraint on the old table probably wouldn’t help much since it would have to lock it for validation. Yes there will be additional I/O and disk usage from WAL while you move data from the old table to the new, but this can be controlled to a much greater degree since all new data isn’t also being written twice by a trigger. You can move the data in batches with any necessary pauses to control those I/O & WAL spikes.
The one thing this method does not account for is if there are updates to data in the old table that would technically place it in the new one while you’re in the process of moving the recent data you want to keep. That update will not be moved from the old table to do the new one. But in most cases where we’ve done this, those updates weren’t that critical since, if you’d done the original method of just deleting the old data, you would’ve been deleting that data that was getting updated anyway. But it’s something to be aware of if you go querying the old table and still see “new” data after you think you’re done migrating. It’s also something to be aware of if your application thinks it updated something when the old table was there and now it’s suddenly gone. You may need to stick with the trigger method above if that’s the case then.
An example of commands to do this is below. The LIKE clause to the CREATE TABLE statement is quite useful since this can automatically include all indexes, constraints, defaults, comments, and storage options. It does NOT include ownership or privileges however, so the one critical step in this piece is definitely #2 above. You can easily see a table’s privileges with the \dp option in psql. Also, explicitly obtaining the exclusive lock on both tables before doing the name switch ensures nothing weird happens during whatever brief moment could exist between the switch.
CREATE TABLE public.notifications_new (LIKE public.notifications INCLUDING ALL); ALTER TABLE public.notifications_new OWNER TO sysadmin; GRANT select ON public.notifications_new TO read_only GRANT select, insert, update, delete ON public.notifications TO app_user; GRANT all ON public.notifications TO admin; ALTER TABLE public.notifications INHERIT public.notifications_new; BEGIN; LOCK TABLE public.notifications IN ACCESS EXCLUSIVE MODE; LOCK TABLE public.notifications_new IN ACCESS EXCLUSIVE MODE; ALTER TABLE public.notifications RENAME TO notifications_old; ALTER TABLE public.notifications_new RENAME TO notifications; COMMIT; (or ROLLBACK; if there's a problem)
Once all these steps are done, you can then begin the process of moving your more recent data out of the old table and into the new via whichever method works best for you. One easy method to batch this is a CTE query that does the DELETE/INSERT with a SELECT in a single query to limit the rows moved.
WITH row_batch AS ( SELECT id FROM public.notifications_old WHERE updated_at >= '2016-10-18 00:00:00'::timestamp LIMIT 20000 ), delete_rows AS ( DELETE FROM public.notifications_old o USING row_batch b WHERE b.id = o.id RETURNING o.id, account_id, created_at, updated_at, resource_id, notifier_id, notifier_type) INSERT INTO public.notifications SELECT * FROM delete_rows;
And once that’s done, you can then DROP the old table, instantly recovering all that disk space with minimal WAL traffic and zero bloat aftermath!
Cleaning Up PostgreSQL Bloat
As a followup to my previous post on checking for bloat, I figured I’d share some methods for actually cleaning up bloat once you find it. I’ll also be providing some updates on the script I wrote due to issues I encountered and thanks to user feedback from people that have used it already.
First, as these examples will show, the most important thing you need to clean up bloat is extra disk space. This means it is critically important to monitor your disk space usage if bloat turns out to be an issue for you. And if your database is of any reasonably large size, and you regularly do updates & deletes, bloat will be an issue at some point. I’d say a goal is to always try and stay below 75% disk usage either by archiving and/or pruning old data that’s no longer needed. Or simply adding more disk space or migrating to new hardware all together. Having less 25% free can put you in a precarious situation where you may have a whole lot of disk space you can free up, but not enough room to actually do any cleanup at all or without possibly impacting performance in big ways (Ex. You have to drop & recreate a bloated index instead of rebuilding it concurrently, making previously fast queries extremely slow).
The above graph (y-axis terabytes) shows my recent adventures in bloat cleanup after using this new scan, and validates that what is reported by pg_bloat_check.py is actually bloat. In both this graph and the one below, there were no data purges going on and each of the significant line changes coincided exactly with a bloat cleanup session. You can see back on May 26-27th a huge drop in size. You can see an initial tiny drop followed by a fairly big increase then the huge drop. This is me first fixing one small, but very bloated index followed by running a pg_repack to take care of both table and a lot of index bloat. This is actually the group_members table I used as the example in my previous post. Over the next week or so I worked through roughly 80 bloated objects to recover about 270GB of disk space. Now, it may turn out that some of these objects will have their bloat return to their previous values quickly again and those could be candidates for exclusion from the regular report. But I figured I’d go through everything wasting more than a few hundred MB just so I can better assess what the actual normal bloat level of this database is. Here’s another example from another client that hadn’t really had any bloat monitoring in place at all before (that I was aware of anyway). It’s showing disk space available instead of total usage, hence the line going the opposite direction, and db12 is a slave of db11.
The easiest, but most intrusive, bloat removal method is to just run a VACUUM FULL on the given table. This will take an exclusive lock on the table (blocks all reads and writes) and completely rebuild the table to new underlying files on disk. This clears out 100% of the bloat in both the table and all indexes it contains at the expense of blocking all access for the duration. If you can afford the outage, it’s the easiest, most reliable method available. For very small tables this is likely your best option.
The next option is to use the REINDEX command. This can be run on several levels: INDEX, TABLE, DATABASE. 9.5 introduced the SCHEMA level as well. Running it on the TABLE level has the same consequence of likely locking the entire table for the duration, so if you’re going that route, you might as well just run a VACUUM FULL. Same for running at the DATABASE level, although if you’re running 9.5+, it did introduce parallel vacuuming to the vacuumdb console command, which would be much more efficient. When running on the INDEX level, things are a little more flexible. All writes are blocked to the table, but if a read-only query does not hit the index that you’re rebuilding, that is not blocked. If you can afford several shorter outages on a given table, or the index is rather small, this is the best route to take for bloat cleanup.
If you’ve got tables that can’t really afford long outages, then things start getting tricky. Before getting into pg_repack, I’d like to share some methods that can be used without third-party tools. Index bloat is the most common occurrence, so I’ll start with that.
If you’ve just got a plain old index (b-tree, gin or gist), there’s a combination of 3 commands that can clear up bloat with minimal downtime (depending on database activity). The CONCURRENTLY flag to the CREATE INDEX command allows an index to be built without blocking any reads or writes to the table. So say we had this bloated index
objectname | objecttype | dead_tuples | free_space | free_percent index_guests_on_state | index | 0 bytes | 9460 MB | 60.04
No dead tuples (so autovacuum is running efficiently) and 60% of the total index is free space that can be reclaimed. A handy command to get the definition of an index is pg_get_indexdef(regclass). In this case it’s a very easy index definition, but when you start getting into some really complicated functional or partial indexes, having a definition you can copy-n-paste is a lot safer.
# SELECT pg_get_indexdef('index_guests_on_state'::regclass); pg_get_indexdef ------------------------------------------------------------------ CREATE INDEX index_guests_on_state ON guests USING btree (state)
Now we can write our set of commands to rebuild the index
CREATE INDEX CONCURRENTLY index_guests_on_state_new ON guests USING btree (state); ANALYZE guests; DROP INDEX CONCURRENTLY index_guests_on_state; ANALYZE guests; ALTER INDEX index_guests_on_state_new RENAME TO index_guests_on_state;
I threw the ANALYZE calls in there just to ensure that the catalogs are up to date for any queries coming in during this rebuild. May not really be necessary, but I was doing this on a very busy table, so I’d rather be paranoid about it. Neither the CREATE nor the DROP command will block any other sessions that happen to come in while this is running. However, that final ALTER INDEX call can block other sessions coming in that try to use the given table. But the rename is optional and can be done at any time later. After the DROP command, your bloat has been cleaned up. While concurrent index creation does not block, there are some caveats with it, the major one being it can take much longer to rebuild the index. One of these for the second client above took 4.5 hours to complete. The documentation on building indexes concurrently goes into more detail on this, and how to deal with it possibly failing.
If you’re running this on a UNIQUE index, you may run into an issue if it was created as a UNIQUE CONSTRAINT vs a UNIQUE INDEX. Functionally, both are the same as far as PostgreSQL is concerned. And under the hood, creating a unique constraint will just create a unique index anyway. The big difference is you will not be able to drop a unique constraint concurrently. You will have to do an ALTER TABLE [..]. DROP CONSTRAINT […] call, which will require an exclusive lock, just like the RENAME above. Also, the index is more flexible since you can make a partial unique index as well. So it’s better to just make a unique index vs a constraint if possible.
PRIMARY KEYs are another special case. Functionally, they’re no different than a unique index with a NOT NULL constraint on the column. But they are marked specially in the catalog and some applications specifically look for them. You can do something very similar to the above, taking advantage of the USING clause to the ADD PRIMARY KEY command.
postgres@production=# CREATE UNIQUE INDEX CONCURRENTLY email_records_pkey_new ON public.email_records USING btree (id); CREATE INDEX Time: 2794263.993 ms postgres@production=# ANALYZE public.email_records; ANALYZE Time: 1658.590 ms postgres@production=# ALTER TABLE public.email_records DROP CONSTRAINT email_records_pkey; ALTER TABLE Time: 43.774 ms postgres@production=# ANALYZE public.email_records; ANALYZE Time: 1518.819 ms postgres@production=# ALTER INDEX public.email_records_pkey_new RENAME TO email_records_pkey; ALTER INDEX Time: 0.586 ms postgres@production=# ALTER TABLE public.email_records ADD PRIMARY KEY USING INDEX email_records_pkey; ALTER TABLE Time: 13.219 ms postgres@production=# ANALYZE public.email_records; ANALYZE Time: 1611.376 ms
I gave full command examples here so you can see the runtimes involved. The concurrent index creation took quite a while (about 46 minutes), but everything besides the analyze commands was sub-second. Giving the command to create a primary key an already existing unique index to use allows it to skip the creation and validation usually done with that command.
As always, there are caveats to this. If the primary key, or any unique index for that matter, has any FOREIGN KEY references to it, you will not be able to drop that index without first dropping the foreign key(s). If there’s only 1 or 2 of those, you can likely do this in a transaction surrounding the drop & recreation of the primary key with commands that also drop and recreate the foreign keys. But if you start getting more in there, that’s just taking a longer and longer outage for the foreign key validation which will lock all tables involved. And also increasing the likelyhood of an error in the DDL you’re writing to manage recreating everything. In that case, it may just be better to take the outage to rebuild the primary key with the REINDEX command.
In all cases where I can use the above methods, I always try to use those first. They’re the native methods built into the database and, as long as you don’t typo the DDL commands, not likely to be prone to any issues cropping up later down the road. And since index bloat is primarily where I see the worst problems, it solves most cases (the second graph above was all index bloat). If you’re unable to use any of them, though, the pg_repack tool is very handy for removing table bloat or handling situations with very busy or complicated tables that cannot take extended outages. It’s gotten pretty stable over the last year or so, but just seeing some of the bugs that were encountered with it previously, I use it as a last resort for bloat removal. Also, if you’re running low on disk space, you may not have enough room for pg_repack since it requires rebuilding the entire table and all indexes in secondary tables before it can remove the original bloated table. As I said above, I did use it where you see that initial huge drop in disk space on the first graph, but before that there was a rather large spike to get there. In that case, the table had many, many foreign keys & triggers and was a very busy table, so it was easier to let pg_repack handle it.
For table bloat, Depesz wrote some blog posts a while ago that are still relevant with some interesting methods of moving data around on disk. This can also be handy when you are very low on disk space.
Bloat Removal Without Table Swapping
Bloat Removal By Tuples Moving
Since I initially wrote my blog post, I’ve had some great feedback from people using pg_bloat_check.py already. I’ve gotten several bugs fixed as well as adding some new features with version 2.1.0 being the latest available as of this blog post. json is now the preferred, structured output method if you need to see more details outside of querying the stats table in the database. I also added some additional options with –exclude_object_file that allows for more fine grained filtering when you want to ignore certain objects in the regular report, but not forever in case they get out of hand. I updated the README with some examples of that since it’s a little more complex.
I also made note of the fact that this script isn’t something that’s made for real-time monitoring of bloat status. Since it’s doing full scans on both tables and indexes, this has the potential to force data out of shared buffers. So if you keep running it often, you may affect query performance of things that rely on data being readily available there. It’s best to run it maybe once a month or once a week at most during off-peak hours. If you have particularly troublesome tables you want to keep an eye on more regularly, the –tablename option allows you to scan just that specific table and nothing else. Once you’ve gotten the majority of your bloat issues cleaned up after your first few times running the script and see how bad things may be, bloat shouldn’t get out of hand that quickly that you need to run it that often. If it is, you may want to re-evaluate how you’re using PostgreSQL (Ex. MVCC makes it not great as a queuing system).
If anyone else has some handy tips for bloat cleanup, I’d definitely be interested in hearing them.
Checking for PostgreSQL Bloat
My post almost 2 years ago about checking for PostgreSQL bloat is still one of the most popular ones on my blog (according to Google Analytics anyway). Since that’s the case, I’ve gone and changed the URL to my old post and reused that one for this post. I’d rather people be directed to correct and current information as quickly as possible instead of adding an update to my old post pointing to a new one. I’ve included my summary on just what exactly bloat is again below since that seemed to be the most popular part.
UPDATE: I wrote a followup post on methods for actually Cleaning Up PostgreSQL Bloat once you’ve identified it.
The intent of the original post was to discuss a python script I’d written for monitoring bloat status: pg_bloat_check.py. Since that time, I’ve been noticing that the query used in v1.x of that script (obtained from the check_postgres.pl module) was not always accurate and was often not reporting on bloat that I knew for a fact was there (Ex: I just deleted over 300 million rows, vacuumed & analyzed the table and still no bloat? Sure it could happen, but highly unlikely). So I continued looking around and discovered the pgstattuple contrib module that comes with PostgreSQL. After discussing it with several of the core developers at recent PostgreSQL conferences (PGConfUS & PGCon) I believe this is a much, much better way to get an accurate assessment of the bloat situation. This encouraged me to do a near complete rewrite of my script and v2.0.0 is now available. It’s not a drop-in replacement for v1.x, so please check the –help for new options.
pgstattuple is a very simple, but powerful extension. It doesn’t require any additional libraries to be loaded and just adds a few functions you can call on database objects to get some statistics about them. The key function for bloat being the default one, pgstattuple(regclass), which returns information about live & dead tuples and free space contained in the given object. If you read the description below on what bloat actually is, you’ll see that those data points are exactly what we’re looking for. The difference between what this function is doing and what the check_postgres.pl query is doing is quite significant, though. The check_postgres query is doing its best to guess what is dead & free space based on the current statistics in the system catalogs. pgstattuple actually goes through and does a full scan on the given table or index to see what the actual situation is. This does mean this query can be very, very slow on large tables. The database I got the examples below from is 1.2TB and a full bloat check on it takes just under 1 hour. But with the inaccuracies I’ve seen being returned by the simpler query, this time can be well worth it. The script stores the statistics gathered in a table so they can be easily reviewed at any time and even used for monitoring purposes, just like check_postgres.
Before showing what the script can do, I just want to re-iterate some things from my old post because they’re important. Bloat percentage alone is a poor indicator of actual system health. Small tables may always have a higher than average bloat, or there may always be 1 or 2 pages considered waste, and in reality that has next to zero impact on database performance. Constantly “debloating” them is more a waste of time than the space used. So the script has some filters for object size, wasted space and wasted percentage. This allows the final output of the bloat report to provide a more accurate representation of where there may actually be problems that need to be looked into.
Another option is a filter for individual tables or indexes to be ignored. If you understand why bloat happens, you will come across cases where a table is stuck at a certain bloat point at all times, no matter how many times you VACUUM FULL it or run pg_repack on it (those two things do remove it, but it quickly comes back). This happens with tables that have a specific level of churn with the rows being inserted, updated & deleted. The number of rows being updated/deleted is balanced with the number of rows being inserted/updated as well as the autovacuum schedule to mark space for reuse. Removing the bloat from tables like this can actually cause decreased performance because instead of re-using the space that VACUUM marks as available, Postgres has to again allocate more pages to that object from disk first before the data can be added. So bloat is actually not always a bad thing and the nature of MVCC can lead to improved write performance on some tables. On to the new script!
So as an example of why this new, slower method can be worth it, here’s the bloat report for a table and its indexes from the old script using check_postgres
Old Table Bloat: 2. public.group_members.........................................................(9.6%) 4158 MB wasted Old Index Bloat: 1. public.group_members_id_pk..................................................(19.5%) 4753 MB wasted 3. public.group_members_user_id_idx.............................................(9.6%) 2085 MB wasted 5. public.group_members_deleted_at_idx..........................................(6.2%) 1305 MB wasted
Here’s the results from the statistic table in the new version
$ pg_bloat_check.py -c "dbname=prod" -t public.group_members kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ; objectname | object_size | reusable_space | dead_tuple_space | free_percent -----------------------------------+-------------+----------------+------------------+-------------- group_members | 42 GB | 16 GB | 4209 kB | 37.84 group_members_user_id_idx | 21 GB | 14 GB | 1130 kB | 64.79 group_members_id_pk | 24 GB | 16 GB | 4317 kB | 68.96 group_members_deleted_at_idx | 20 GB | 13 GB | 3025 kB | 63.77 group_members_group_id_user_id_un | 11 GB | 4356 MB | 6576 bytes | 38.06 group_members_group_id_idx | 17 GB | 9951 MB | 0 bytes | 56.8 group_members_updated_at_idx | 15 GB | 7424 MB | 0 bytes | 49.57
Yes, all those indexes did exist before. The old query just didn’t think they had any bloat at all. There’s also a nearly 4x difference in wasted space in the table alone. It’s only 37% of the table in this case, but if you’re trying to clean up bloat due to low disk space, 12GB can be a lot. Another really nice thing pgstattuple provides is a distinction between dead tuples and reusable (free) space. You can see the dead tuple space is quite low in this example. That means autovacuum is running efficiently on this table and marking dead rows from updates & deletes as re-usable. If you see dead tuples is high, that could indicate autovacuum is not running properly and you may need to adjust some of the vacuum tuning parameters that are available. In this case, even a normal vacuum was not freeing the reusable space back to the operating system. See below for why this is. This means either a VACUUM FULL or pg_repack run is required to reclaim it. Here’s the result from making a new index on user_id:
kfiske@prod=# CREATE INDEX concurrently ON group_members USING btree (user_id); CREATE INDEX Time: 5308849.412 ms $ pg_bloat_check.py -c "dbname=prod" -t public.group_members kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ; objectname | object_size | reusable_space | dead_tuple_space | free_percent -----------------------------------+-------------+----------------+------------------+-------------- group_members | 42 GB | 16 GB | 2954 kB | 37.84 group_members_user_id_idx | 21 GB | 14 GB | 1168 kB | 64.79 group_members_id_pk | 24 GB | 16 GB | 4317 kB | 68.96 group_members_deleted_at_idx | 20 GB | 13 GB | 3025 kB | 63.77 group_members_group_id_user_id_un | 11 GB | 4356 MB | 6784 bytes | 38.06 group_members_group_id_idx | 17 GB | 9951 MB | 0 bytes | 56.8 group_members_updated_at_idx | 15 GB | 7424 MB | 0 bytes | 49.57 group_members_user_id_idx1 | 8319 MB | 817 MB | 336 bytes | 9.83
You can see the new index group_members_user_id_idx1 is now down to only 9% wasted space and much smaller. Here’s the result after running pg_repack to clear both the table and all index bloat:
kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ; objectname | object_size | reusable_space | dead_tuple_space | free_percent -----------------------------------+-------------+----------------+------------------+-------------- group_members | 25 GB | 27 MB | 79 kB | 0.1 group_members_id_pk | 8319 MB | 818 MB | 0 bytes | 9.83 group_members_user_id_idx | 8319 MB | 818 MB | 0 bytes | 9.83 group_members_deleted_at_idx | 8319 MB | 818 MB | 0 bytes | 9.83 group_members_group_id_user_id_un | 7818 MB | 768 MB | 0 bytes | 9.83 group_members_group_id_idx | 8319 MB | 818 MB | 0 bytes | 9.83 group_members_updated_at_idx | 8318 MB | 818 MB | 0 bytes | 9.83 (7 rows)
PostgreSQL 9.5 introduced the pgstattuple_approx(regclass) function which tries to take advantage of some visibility map statistics to increase the speed of gathering tuple statistics but possibly sacrificing some accuracy since it’s not hitting each individual tuple. It only works on tables, though. This option is available with the script using the –quick argument. There’s also the pgstatindex(regclass) that gives some more details on index pages and how the data in them is laid out, but I haven’t found a use for that in the script yet.
The same output options the old script had are still available: –simple to provide a text summary useful for emails & –dict which is a python dictionary that provides a structured output and also greater details on the raw statistics (basically just the data straight from the table). UPDATE: As of version 2.1.0 of the script, the –json & –jsonpretty options have been added and are the preferred structured output format unless you actually need a python dictionary. The table inside the database provides a new, easy method for reviewing the bloat information as well, but just be aware this is rebuilt from scratch every time the script runs. There’s also a new option which I used above (-t, –tablename) that you can use to get the bloat information on just a single table. See the –help for more information on all the options that are available.
Why Bloat Happens
For those of you newer to PostgreSQL administration, and this is the first time you may be hearing about bloat, I figured I’d take the time to explain why this scenario exists and why tools like this are necessary (until they’re hopefully built into the database itself someday). It’s something most don’t understand unless someone first explains it to them or you run into the headaches it causes when it’s not monitored and you learn about it the hard way.
MVCC (multi-version concurrency control) is how Postgres has chosen to deal with multiple transactions/sessions hitting the same rows at (nearly) the same time. The documentation, along with wikipedia provide excellent and extensive explanations of how it all works, so I refer you there for all the details. Bloat is a result of one particular part of MVCC, concentrated around the handling of updates and deletes.
Whenever you delete a row, it’s not actually deleted, it is only marked as unavailable to all future transactions taking place after the delete occurs. The same happens with an update: the old version of a row is kept active until all currently running transactions have finished, then it is marked as unavailable. I emphasize the word unavailable because the row still exists on disk, it’s just not visible any longer. The VACUUM process in Postgres then comes along and marks any unavailable rows as space that is now available for future inserts or updates. The auto-vacuum process is configured to run VACUUM automatically after so many writes to a table (follow the link for the configuration options), so it’s not something you typically have to worry about doing manually very often (at least with more modern versions of Postgres).
People often assume that VACUUM is the process that should return the disk space to the file system. It does do this but only in very specific cases. That used space is contained in page files that make up the tables and indexes (called objects from now on) in the Postgres database system. Page files all have the same size and differently sized objects just have as many page files as they need. If VACUUM happens to mark every row in a page file as unavailable AND that page also happens to be the final page for the entire object, THEN the disk space is returned to the file system. If there is a single available row, or the page file is any other but the last one, the disk space is never returned by a normal VACUUM. This is bloat. Hopefully this explanation of what bloat actually is shows you how it can sometimes be advantageous for certain usage patterns of tables as well, and why I’ve included the option to ignore objects in the report.
If you give the VACUUM command the special flag FULL, then all of that reusable space is returned to the file system. But VACUUM FULL does this by completely rewriting the entire table (and all its indexes) to new pages and takes an exclusive lock on the table the entire time it takes to run (CLUSTER does the same thing, but what that does is outside the scope of this post). For large tables in frequent use, this is problematic. pg_repack has been the most common tool we’ve used to get around that. It recreates the table in the background, tracking changes to it, and then takes a brief lock to swap the old bloated table with the new one.
Why bloat is actually a problem when it gets out of hand is not just the disk space it uses up. Every time a query is run against a table, the visibility flags on individual rows and index entries is checked to see if is actually available to that transaction. On large tables (or small tables with a lot of bloat) that time spent checking those flags builds up. This is especially noticeable with indexes where you expect an index scan to improve your query performance and it seems to be making no difference or is actually worse than a sequential scan of the whole table. And this is why index bloat is checked independently of table bloat since a table could have little to no bloat, but one or more of its indexes could be badly bloated. Index bloat (as long as it’s not a primary key) is easier to solve because you can either just reindex that one index, or you can concurrently create a new index on the same column and then drop the old one when it’s done.
In all cases when you run VACUUM, it’s a good idea to run ANALYZE as well, either at the same time in one command or as two separate commands. This updates the internal statistics that Postgres uses when creating query plans. The number of live and dead rows in a table/index is a part of how Postgres decides to plan and run your queries. It’s a much smaller part of the plan than other statistics, but every little bit can help.
I hope this explanation of what bloat is, and how this tool can help with your database administration, has been helpful.
Checking for PostgreSQL Bloat (old)
UPDATE: A newer version of this post has been made along with a rewrite of the script to use pgstattuple – http://www.keithf4.com/checking-for-postgresql-bloat/
One of the on-going issues with PostgreSQL that every administrator must deal with at some point is table and/or index bloat. The MVCC architecture of PostgreSQL lends itself to requiring some extra monitoring & maintenance, especially in large enterprise systems. I’ll save the description of why bloat happens until further down in this post (for those that are curious) and get to the point of this article first.
So far the best check for bloat I’ve come across is the one contained in the check_postgres suite from Bucardo. And in places where we can get check_postgres set up and used, it works great. But sometimes we just want to check the bloat on a system and also have a little more fine grained control of the output. So recently I pulled out the rather complex bloat check query from check_postgres and adapted it to a command-line python script: pg_bloat_check.py
General options: -m {tables,indexes}, --mode {tables,indexes} Provide bloat report for the following objects: tables, indexes. Note that the "tables" mode does not include any index bloat that may also exist in the table. Default is "tables". -c CONNECTION, --connection CONNECTION Connection string for use by psycopg. Defaults to "host=" (local socket). -f {simple,dict}, --format {simple,dict} Output formats. Simple is a plaintext version suitable for any output (ex: console, pipe to email). Dict is a python dictionary object, which may be useful if taking input into another python script or something that needs a more structured format. Dict also provides more details about object pages. Default is simple. -a MIN_PAGES, --min_pages MIN_PAGES Minimum number of pages an object must have to be included in the report. Default and minimum value is 1. -A MIN_WASTED_PAGES, --min_wasted_pages MIN_WASTED_PAGES Minimum number of wasted pages an object must have to be included in the report. Default and minimum value is 1. -z MIN_WASTED_SIZE, --min_wasted_size MIN_WASTED_SIZE Minimum size of wasted space in bytes. Default and minimum is 1. -p MIN_WASTED_PERCENTAGE, --min_wasted_percentage MIN_WASTED_PERCENTAGE Minimum percentage of wasted space an object must have to be included in the report. Default and minimum value is 0.1%. -n SCHEMA, --schema SCHEMA Comma separated list of schema to include in report. All other schemas will be ignored. -N EXCLUDE_SCHEMA, --exclude_schema EXCLUDE_SCHEMA Comma separated list of schemas to exclude. If set along with -n, schemas will be excluded then included. -e EXCLUDE_OBJECT_FILE, --exclude_object_file EXCLUDE_OBJECT_FILE Full path to file containing a return deliminated list of objects to exclude from the report (tables and/or indexes). All objects must be schema qualified. Comments are allowed if the line is prepended with "#". --view_schema VIEW_SCHEMA Set the schema that the bloat report view is in if it's not in the default search_path. Note this option can also be set when running --create_view to set in which schema you want the view created. Setup: --create_view Create the required view that the bloat report uses. Places view in default search_path schema unless --view_schema is set. --create_mat_view Same as --create_view, but creates it as materialized view if your version of PostgreSQL supports it (9.3+). Be aware that this script does not refresh the materialized view automatically.
While I first started working on this script, just running the query by itself on some of our client systems, I also realized that it can be quite an expensive query to run on larger systems. It also brings back both table and index bloat together and other parts of the script split that to make the distinction. To allow this check to provide data more efficiently and separate table & index bloat, my script first creates a view to use. For postgresql 9.3+, this can be created as a materialized view so running the script to get bloat data back is fast and causes no contention in the system.
The other thing I realized is that bloat percentage alone is a poor indicator of actual system health. Very small tables may always have a higher than average bloat, or there may always be 1 or 2 pages considered waste, and in reality that has next to zero impact on database performance. Constantly “debloating” them is more a waste of time than the space used. So I added in additional filters that also look at the number of pages in objects and the actual wasted space of those objects. This allows the final output of the bloat report to provide a more accurate representation of where there may actually be problems that need to be looked into.
Another option is a filter for individual tables or indexes to be ignored. If you understand why bloat happens, you will come across cases where a table is stuck at a certain bloat point at all times, no matter how many times you VACUUM FULL it or run pg_repack on it (those two things do remove it, but it quickly comes back). This happens with tables that have a specific level of churn with the rows being inserted, updated & deleted. The number of rows being updated/deleted is balanced with the number of rows being inserted/updated as well as the autovacuum schedule to mark space for reuse. Removing the bloat from tables like this can actually cause decreased performance because instead of re-using the space that VACUUM marks as available, Postgres has to again allocate more pages to that object from disk first before the data can be added. So bloat is actually not always a bad thing and the nature of MVCC can lead to improved write performance on some tables.
The “simple” output format is suitable for just using on the command-line or piping out to an email program like mailx to get a status report sent to you.
$ ./pg_bloat_check.py --create_view -c "dbname=pgbench" $ ./pg_bloat_check.py -c "dbname=pgbench" 1. public.pgbench_accounts........................................................(1.5%) 19 MB wasted 2. public.pgbench_tellers.......................................................(90.4%) 680 kB wasted 3. public.pgbench_branches......................................................(98.0%) 384 kB wasted 4. pg_catalog.pg_amop.......................................................(25.0%) 8192 bytes wasted 5. pg_catalog.pg_amproc.....................................................(33.3%) 8192 bytes wasted 6. pg_catalog.pg_conversion.................................................(33.3%) 8192 bytes wasted 7. pg_catalog.pg_description.................................................(3.2%) 8192 bytes wasted 8. pg_catalog.pg_operator....................................................(7.1%) 8192 bytes wasted
System tables like those you see at the bottom are extremely small and you can see the space they “waste” is negligible. I left the simple format like this so it just provides a quick glance for monitoring purposes. I also provide another output method in the form of a python dictionary that has some additional details. It’s also in a format that can be easily fed to other python scripts or anything else that needs a predictable, structured format.
$ ./pg_bloat_check.py -c "dbname=pgbench" -f dict {'bloat_percent': '1.5%', 'objectname': 'pgbench_accounts', 'wasted_size': '19 MB', 'wasted_pages': 2407.0, 'total_pages': 163935, 'schemaname': 'public'} {'bloat_percent': '90.4%', 'objectname': 'pgbench_tellers', 'wasted_size': '680 kB', 'wasted_pages': 85.0, 'total_pages': 94, 'schemaname': 'public'} {'bloat_percent': '98.0%', 'objectname': 'pgbench_branches', 'wasted_size': '384 kB', 'wasted_pages': 48.0, 'total_pages': 49, 'schemaname': 'public'} {'bloat_percent': '25.0%', 'objectname': 'pg_amop', 'wasted_size': '8192 bytes', 'wasted_pages': 1.0, 'total_pages': 4, 'schemaname': 'pg_catalog'} {'bloat_percent': '33.3%', 'objectname': 'pg_amproc', 'wasted_size': '8192 bytes', 'wasted_pages': 1.0, 'total_pages': 3, 'schemaname': 'pg_catalog'} {'bloat_percent': '33.3%', 'objectname': 'pg_conversion', 'wasted_size': '8192 bytes', 'wasted_pages': 1.0, 'total_pages': 3, 'schemaname': 'pg_catalog'} {'bloat_percent': '3.2%', 'objectname': 'pg_description', 'wasted_size': '8192 bytes', 'wasted_pages': 1.0, 'total_pages': 31, 'schemaname': 'pg_catalog'} {'bloat_percent': '7.1%', 'objectname': 'pg_operator', 'wasted_size': '8192 bytes', 'wasted_pages': 1.0, 'total_pages': 14, 'schemaname': 'pg_catalog'}
You can see this provides some more details on how many pages are actually in the given objects vs how many are considered wasted space. All the system tables only have 1 page of wasted space so the given percentage varies because of how many pages they actually have. You have several options for filtering those last, unnecessary entries. The first would be to filter out the entire pg_catalog schema, but I don’t recommend that because catalog bloat can actually be a problem on databases that have extensive DDL changes over time. The other option is to use –min_pages to exclude any table or index that has less than a certain number of pages. I think the best option here would actually be –min_wasted_pages and set to something like 10.
$ ./pg_bloat_check.py -c "dbname=pgbench" -f dict --min_wasted_pages=10 {'bloat_percent': '1.5%', 'objectname': 'pgbench_accounts', 'wasted_size': '19 MB', 'wasted_pages': 2407.0, 'total_pages': 163935, 'schemaname': 'public'} {'bloat_percent': '90.4%', 'objectname': 'pgbench_tellers', 'wasted_size': '680 kB', 'wasted_pages': 85.0, 'total_pages': 94, 'schemaname': 'public'} {'bloat_percent': '98.0%', 'objectname': 'pgbench_branches', 'wasted_size': '384 kB', 'wasted_pages': 48.0, 'total_pages': 49, 'schemaname': 'public'}
You can use all these filters in combination as well to also avoid low percentage bloat like that top entry. Be careful doing this, though, since if you have an extremely large table and are trying to figure out where all potentially wasted space is, you could accidentally filter it out (10% is a low percentage but 10% of a 500GB table would be 50GB).
$ ./pg_bloat_check.py -c "dbname=pgbench" -f dict --min_wasted_pages=10 --min_wasted_percentage=5 {'bloat_percent': '90.4%', 'objectname': 'pgbench_tellers', 'wasted_size': '680 kB', 'wasted_pages': 85.0, 'total_pages': 94, 'schemaname': 'public'} {'bloat_percent': '98.0%', 'objectname': 'pgbench_branches', 'wasted_size': '384 kB', 'wasted_pages': 48.0, 'total_pages': 49, 'schemaname': 'public'} $ ./pg_bloat_check.py -c "dbname=pgbench" --min_wasted_pages=10 --min_wasted_percentage=5 1. public.pgbench_tellers.......................................................(90.4%) 680 kB wasted 2. public.pgbench_branches......................................................(98.0%) 384 kB wasted
Why Bloat Happens
For those of you newer to PostgreSQL administration, and this is the first time you may be hearing about bloat, I figured I’d take the time to explain why this scenario exists and why tools like this are necessary (until they’re hopefully built into the database itself someday). It’s something most don’t understand unless someone first explains it to them or you run into the headaches it causes when it’s not monitored and you learn about it the hard way.
MVCC (multi-version concurrency control) is how Postgres has chosen to deal with multiple transactions/sessions hitting the same rows at (nearly) the same time. The documentation, along with wikipedia provide excellent and extensive explanations of how it all works, so I refer you there for all the details. Bloat is a result of one particular part of MVCC, concentrated around the handling of updates and deletes.
Whenever you delete a row, it’s not actually deleted, it is only marked as unavailable to all future transactions taking place after the delete occurs. The same happens with an update: the old version of a row is kept active until all currently running transactions have finished, then it is marked as unavailable. I emphasize the word unavailable because the row still exists on disk, it’s just not visible any longer. The VACUUM process in Postgres then comes along and marks any unavailable rows as space that is now available for future inserts or updates. The auto-vacuum process is configured to run VACUUM automatically after so many writes to a table (follow the link for the configuration options), so it’s not something you typically have to worry about doing manually very often (at least with more modern versions of Postgres).
People often assume that VACUUM is the process that should return the disk space to the file system. It does do this but only in very specific cases. That used space is contained in page files that make up the tables and indexes (called objects from now on) in the Postgres database system. Page files all have the same size and differently sized objects just have as many page files as they need. If VACUUM happens to mark every row in a page file as unavailable AND that page also happens to be the final page for the entire object, THEN the disk space is returned to the file system. If there is a single available row, or the page file is any other but the last one, the disk space is never returned by a normal VACUUM. This is bloat. Hopefully this explanation of what bloat actually is shows you how it can sometimes be advantageous for certain usage patterns of tables as well, and why I’ve included the option to ignore objects in the report.
If you give the VACUUM command the special flag FULL, then all of that reusable space is returned to the file system. But VACUUM FULL does this by completely rewriting the entire table (and all its indexes) to new pages and takes an exclusive lock on the table the entire time it takes to run (CLUSTER does the same thing, but what that does is outside the scope of this post). For large tables in frequent use, this is problematic. pg_repack has been the most common tool we’ve used to get around that. It recreates the table in the background, tracking changes to it, and then takes a brief lock to swap the old bloated table with the new one.
Why bloat is actually a problem when it gets out of hand is not just the disk space it uses up. Every time a query is run against a table, the visibility flags on individual rows and index entries is checked to see if is actually available to that transaction. On large tables (or small tables with a lot of bloat) that time spent checking those flags builds up. This is especially noticeable with indexes where you expect an index scan to improve your query performance and it seems to be making no difference or is actually worse than a sequential scan of the whole table. And this is why index bloat is checked independently of table bloat since a table could have little to no bloat, but one or more of its indexes could be badly bloated. Index bloat (as long as it’s not a primary key) is easier to solve because you can either just reindex that one index, or you can concurrently create a new index on the same column and then drop the old one when it’s done.
In all cases when you run VACUUM, it’s a good idea to run ANALYZE as well, either at the same time in one command or as two separate commands. This updates the internal statistics that Postgres uses when creating query plans. The number of live and dead rows in a table/index is a part of how Postgres decides to plan and run your queries. It’s a much smaller part of the plan than other statistics, but every little bit can help.
I hope this explanation of what bloat is, and how this tool can help with your database administration, has been helpful.