Keith's Ramblings…

Archive for the ‘tuning’ tag

Per-Table Autovacuum Tuning

with 3 comments

A pattern that seems to drive my blog posts definitely seems to be the frequency of client questions. And that is definitely the case here again. Vacuum tuning to manage bloat and transaction id wraparound on production systems has been a hot topic and lately this has even been getting down to tuning autovacuum on the individual table basis. I’ve already discussed bloat pretty extensively in previous posts. While I’d like to get into the details of transaction ID wraparound, that really isn’t the focus of this post, so I’ll defer you to the documentation.

One setting I will discuss first though is autovacuum_freeze_max_age. Basically when any table’s max XID value reaches this, a more aggressive “emergency” autovacuum is kicked off. If many tables hit this at the same time (a common occurrence with data-warehouses that have many large, sparsely written tables), that can kick off some pretty high and long lasting IO caused by these autovacuums. I highly recommend increasing autovacuum_freeze_max_age from the default value of 200 million to 1 billion. However, I don’t recommend doing this unless you have some monitoring in place to watch for tables reaching both autovacuum_freeze_max_age and wraparound, the latter of which can shut your database down for some pretty extensive downtime.

So ideally, we want autovacuum running often enough on its own so you never encounter any of those bad situations above. The next thing I’d recommend tuning on any database before getting down to the table-level methods, no matter its size or traffic, are the default settings that control when autovacuum initially kicks in.

autovacuum_analyze_scale_factor = .10
autovacuum_analyze_threshold = 50
autovacuum_vacuum_scale_factor = .20
autovacuum_vacuum_threshold = 50

In my opinion, the defaults on these are not ideal for any situation. The scales are too high and the thresholds are too low. The scale factor settings are percentage values that say, “When this percentage of the table’s rows have changed (updates/deletes), run vacuum or analyze”. The threshold settings say that “When this many rows have changed, run vacuum or analyze”. These two settings are used together to determine the actual threshold that will kick in autovacuum. The formula taken from the documentation shows this:

vacuum initialization = autovacuum_vacuum_threshold + (autovacuum_vacuum_scale_factor * number of rows)
analyze initialization = autovacuum_analyze_threshold + (autovacuum_analyze_scale_factor * number of rows)

So this means with the default settings, a 100,000 row table will have autovacuum kick in when 20050 rows have changed and autoanalyze will kick in when 10050 have changed. Not too bad, but my recommended default settings are are this:

autovacuum_analyze_scale_factor = .05
autovacuum_analyze_threshold = 500
autovacuum_vacuum_scale_factor = .10
autovacuum_vacuum_threshold = 500

This halves the scale factor settings so autovacuum is a bit more aggressive on how often it kicks in. My reason for increasing the threshold values is because when you have very small tables, you really don’t want autovacuum then being overly aggressive on them after decreasing the scale settings. Vacuuming too often may not necessarily be a bad thing, but it’s just wasted IO at that point.

So how about when you start getting into very large tables? Can you see why this could be a problem?

(800,000,000 rows * .2) + 50 = 160,000,050
(800,000,000 rows * .1) + 500 = 80,000,500

We would have to wait until nearly 160 million rows have changed before autovacuum would ever kick in. Even with my recommended defaults, 80 million rows is still quite a bit to wait for. For a high write rate table, this might work out ok, but for most cases this likely won’t be ideal. This affects bloat buildup since autovacuum may not run often enough to clear up available space for new/updated rows. And it could also cause the global XID value to increase quite high, more than likely reaching autovacuum_freeze_max_age before a normal autovacuum process would ever touch it. This is when we need to start tuning on a per-table basis and all four of the above settings can be set that way.

So the first question to ask is how often do we want autovacuum to run? I’ll start from a baseline of at least once per day for a starting point. This means we’ll need a way to see how many rows have changed per day.

mydb=# select * from pg_stat_all_tables where relname = 'mytable';
-[ RECORD 1 ]-------+------------------------------
relid | 32307
schemaname | public
relname | mytable
seq_scan | 25
seq_tup_read | 5042494092
idx_scan | 199790800
idx_tup_fetch | 3373819424
n_tup_ins | 342119654
n_tup_upd | 752699033
n_tup_del | 167007870
n_tup_hot_upd | 281342612
n_live_tup | 286999790
n_dead_tup | 9050790
n_mod_since_analyze | 762072
last_vacuum | 2017-08-17 16:14:25.447345-05
last_autovacuum | 2018-08-24 13:17:35.88936-05
last_analyze | 2018-08-16 20:32:13.265396-05
last_autoanalyze | 2018-08-31 07:43:45.551732-05
vacuum_count | 1
autovacuum_count | 43
analyze_count | 29
autoanalyze_count | 129

The pg_stat_all_table view (or pg_stat_user_tables to exclude system tables) can help us get this. PostgreSQL doesn’t keep long term trending data on table writes, but it does keep counters. Specifically the n_tup_upd & n_tup_del columns can tell us how many tuples/rows where changed since the statistics for this database were last reset. So if we can collect this data once per day around the same time, we can get an idea of the write rate that would affect when autovacuum would kick in. Here’s a simple command you can add to a cronjob to generate some CSV data you could then pull into a spreadsheet and quickly do calculations.

psql -U postgres -d mydb -c "COPY (select now() AS gather_time, * from pg_stat_all_tables where relname = 'mytable') To STDOUT WITH CSV DELIMITER ','" >> mytable_stats.csv

Just add n_tup_upd+n_tup_del and compare that to the previous day’s total value and you have your rows changed per day. If you want it to run more/less often, then just adjust your collection interval as needed. So let’s say there were around 300,000 to 500,000 rows changed per day on our giant 800mil row table. We would want autovacuum to run when at least 400,000 rows changed. This means we want to ignore the scale factor setting completely and rely only on the threshold value. That’s the other problem with the scale factor and large tables: as they grow larger and larger, autovacuum runs less and less since the % value of that table grows larger. So to set the above settings on a per table basis, we can use the ALTER TABLE command to adjust the above settings for just that table.

ALTER TABLE public.mytable SET (autovacuum_analyze_scale_factor = 0, autovacuum_vacuum_scale_factor = 0, autovacuum_vacuum_threshold = 400000, autovacuum_analyze_threshold = 100000 );

We usually want analyze to run more often than a vacuum so queries can have accurate statistics. Analyzes are fairly lightweight, even on large tables, so it’s ok to be more aggressive about when those run.

Once we have those settings in place, we can then use the same stats view to see if autovacuum & autoanalyze are running when we want them too. last_autoanalyzelast_autovacuum are the timestamp of when they last completed.

You may need it to vacuum more often than this depending on the turn-over of updates, deletes & inserts to keep bloat under control. To determine that I recommend setting up bloat monitoring. If you see that it keeps growing, you can look at vacuuming more often or even adjusting the fillfactor settings for the table or indexes.

Another suggestion that I will make is for databases that have any tables that get few-to-no writes, especially large ones. First, upgrade to a recent version of PostgreSQL if you’re not on at least 9.6. Then, schedule a regular VACUUM of your database, or at least the tables that are no longer being touched often. 9.6 introduced a feature that if all of the rows on a page are frozen, then autovacuum is able to completely skip over that page and not have to evaluate each individual row. This can be a HUGE saving on both time and IO. And the larger the table the more the savings you’ll see in the long run. For anyone that is running their databases in the cloud where IO is money, this can save you quite a bit there. Typically tuning autovacuum will preclude needing to schedule a manual vacuum, but if it turns out it’s needed in some cases, running during off-peak hours won’t hurt anything.

The other issue that people often encounter is long running autovacuums or autovacuum incurring higher IO costs than anticipated. I’ll be working on another blog post in the future that covers these issues. Hopefully this gives you a good start on how to better tune autovacuum for your growing databases.

Written by Keith

October 1st, 2018 at 2:43 pm

Posted in PostgreSQL

Tagged with , , , ,

Cleaning Up PostgreSQL Bloat

with 7 comments

As a followup to my previous post on checking for bloat, I figured I’d share some methods for actually cleaning up bloat once you find it. I’ll also be providing some updates on the script I wrote due to issues I encountered and thanks to user feedback from people that have used it already.

First, as these examples will show, the most important thing you need to clean up bloat is extra disk space. This means it is critically important to monitor your disk space usage if bloat turns out to be an issue for you. And if your database is of any reasonably large size, and you regularly do updates & deletes, bloat will be an issue at some point. I’d say a goal is to always try and stay below 75% disk usage either by archiving and/or pruning old data that’s no longer needed. Or simply adding more disk space or migrating to new hardware all together. Having less 25% free can put you in a precarious situation where you may have a whole lot of disk space you can free up, but not enough room to actually do any cleanup at all or without possibly impacting performance in big ways (Ex. You have to drop & recreate a bloated index instead of rebuilding it concurrently, making previously fast queries extremely slow).

gbloat

The above graph (y-axis terabytes) shows my recent adventures in bloat cleanup after using this new scan, and validates that what is reported by pg_bloat_check.py is actually bloat. In both this graph and the one below, there were no data purges going on and each of the significant line changes coincided exactly with a bloat cleanup session. You can see back on May 26-27th a huge drop in size. You can see an initial tiny drop followed by a fairly big increase then the huge drop. This is me first fixing one small, but very bloated index followed by running a pg_repack to take care of both table and a lot of index bloat. This is actually the group_members table I used as the example in my previous post. Over the next week or so I worked through roughly 80 bloated objects to recover about 270GB of disk space. Now, it may turn out that some of these objects will have their bloat return to their previous values quickly again and those could be candidates for exclusion from the regular report. But I figured I’d go through everything wasting more than a few hundred MB just so I can better assess what the actual normal bloat level of this database is. Here’s another example from another client that hadn’t really had any bloat monitoring in place at all before (that I was aware of anyway). It’s showing disk space available instead of total usage, hence the line going the opposite direction, and db12 is a slave of db11.

pp_bloat

The easiest, but most intrusive, bloat removal method is to just run a VACUUM FULL on the given table. This will take an exclusive lock on the table (blocks all reads and writes) and completely rebuild the table to new underlying files on disk. This clears out 100% of the bloat in both the table and all indexes it contains at the expense of blocking all access for the duration. If you can afford the outage, it’s the easiest, most reliable method available. For very small tables this is likely your best option.

The next option is to use the REINDEX command. This can be run on several levels: INDEX, TABLE, DATABASE. 9.5 introduced the SCHEMA level as well. Running it on the TABLE level has the same consequence of likely locking the entire table for the duration, so if you’re going that route, you might as well just run a VACUUM FULL. Same for running at the DATABASE level, although if you’re running 9.5+, it did introduce parallel vacuuming to the vacuumdb console command, which would be much more efficient. When running on the INDEX level, things are a little more flexible. All writes are blocked to the table, but if a read-only query does not hit the index that you’re rebuilding, that is not blocked. If you can afford several shorter outages on a given table, or the index is rather small, this is the best route to take for bloat cleanup.

If you’ve got tables that can’t really afford long outages, then things start getting tricky. Before getting into pg_repack, I’d like to share some methods that can be used without third-party tools. Index bloat is the most common occurrence, so I’ll start with that.

If you’ve just got a plain old index (b-tree, gin or gist), there’s a combination of 3 commands that can clear up bloat with minimal downtime (depending on database activity). The CONCURRENTLY flag to the CREATE INDEX command allows an index to be built without blocking any reads or writes to the table. So say we had this bloated index

      objectname                          | objecttype | dead_tuples | free_space | free_percent
 index_guests_on_state                    | index      | 0 bytes     | 9460 MB    |        60.04

No dead tuples (so autovacuum is running efficiently) and 60% of the total index is free space that can be reclaimed. A handy command to get the definition of an index is pg_get_indexdef(regclass). In this case it’s a very easy index definition, but when you start getting into some really complicated functional or partial indexes, having a definition you can copy-n-paste is a lot safer.

# SELECT pg_get_indexdef('index_guests_on_state'::regclass);
                         pg_get_indexdef
------------------------------------------------------------------
 CREATE INDEX index_guests_on_state ON guests USING btree (state)

Now we can write our set of commands to rebuild the index

CREATE INDEX CONCURRENTLY index_guests_on_state_new ON guests USING btree (state);
ANALYZE guests;
DROP INDEX CONCURRENTLY index_guests_on_state;
ANALYZE guests;
ALTER INDEX index_guests_on_state_new RENAME TO index_guests_on_state;

I threw the ANALYZE calls in there just to ensure that the catalogs are up to date for any queries coming in during this rebuild. May not really be necessary, but I was doing this on a very busy table, so I’d rather be paranoid about it. Neither the CREATE nor the DROP command will block any other sessions that happen to come in while this is running. However, that final ALTER INDEX call can block other sessions coming in that try to use the given table. But the rename is optional and can be done at any time later. After the DROP command, your bloat has been cleaned up. While concurrent index creation does not block, there are some caveats with it, the major one being it can take much longer to rebuild the index. One of these for the second client above took 4.5 hours to complete. The documentation on building indexes concurrently goes into more detail on this, and how to deal with it possibly failing.

If you’re running this on a UNIQUE index, you may run into an issue if it was created as a UNIQUE CONSTRAINT vs a UNIQUE INDEX. Functionally, both are the same as far as PostgreSQL is concerned. And under the hood, creating a unique constraint will just create a unique index anyway. The big difference is you will not be able to drop a unique constraint concurrently. You will have to do an ALTER TABLE [..]. DROP CONSTRAINT […] call, which will require an exclusive lock, just like the RENAME above. Also, the index is more flexible since you can make a partial unique index as well. So it’s better to just make a unique index vs a constraint if possible.

PRIMARY KEYs are another special case. Functionally, they’re no different than a unique index with a NOT NULL constraint on the column. But they are marked specially in the catalog and some applications specifically look for them. You can do something very similar to the above, taking advantage of the USING clause to the ADD PRIMARY KEY command.

postgres@production=# CREATE UNIQUE INDEX CONCURRENTLY email_records_pkey_new ON public.email_records USING btree (id);
CREATE INDEX
Time: 2794263.993 ms
postgres@production=# ANALYZE public.email_records;
ANALYZE
Time: 1658.590 ms
postgres@production=# ALTER TABLE public.email_records DROP CONSTRAINT email_records_pkey;
ALTER TABLE
Time: 43.774 ms
postgres@production=# ANALYZE public.email_records;
ANALYZE
Time: 1518.819 ms
postgres@production=# ALTER INDEX public.email_records_pkey_new RENAME TO email_records_pkey;
ALTER INDEX
Time: 0.586 ms
postgres@production=# ALTER TABLE public.email_records ADD PRIMARY KEY USING INDEX email_records_pkey;
ALTER TABLE
Time: 13.219 ms
postgres@production=# ANALYZE public.email_records;
ANALYZE
Time: 1611.376 ms

I gave full command examples here so you can see the runtimes involved. The concurrent index creation took quite a while (about 46 minutes), but everything besides the analyze commands was sub-second. Giving the command to create a primary key an already existing unique index to use allows it to skip the creation and validation usually done with that command.

As always, there are caveats to this. If the primary key, or any unique index for that matter, has any FOREIGN KEY references to it, you will not be able to drop that index without first dropping the foreign key(s). If there’s only 1 or 2 of those, you can likely do this in a transaction surrounding the drop & recreation of the primary key with commands that also drop and recreate the foreign keys. But if you start getting more in there, that’s just taking a longer and longer outage for the foreign key validation which will lock all tables involved. And also increasing the likelyhood of an error in the DDL you’re writing to manage recreating everything. In that case, it may just be better to take the outage to rebuild the primary key with the REINDEX command.

In all cases where I can use the above methods, I always try to use those first. They’re the native methods built into the database and, as long as you don’t typo the DDL commands, not likely to be prone to any issues cropping up later down the road. And since index bloat is primarily where I see the worst problems, it solves most cases (the second graph above was all index bloat). If you’re unable to use any of them, though, the pg_repack tool is very handy for removing table bloat or handling situations with very busy or complicated tables that cannot take extended outages. It’s gotten pretty stable over the last year or so, but just seeing some of the bugs that were encountered with it previously, I use it as a last resort for bloat removal. Also, if you’re running low on disk space, you may not have enough room for pg_repack since it requires rebuilding the entire table and all indexes in secondary tables before it can remove the original bloated table. As I said above, I did use it where you see that initial huge drop in disk space on the first graph, but before that there was a rather large spike to get there. In that case, the table had many, many foreign keys & triggers and was a very busy table, so it was easier to let pg_repack handle it.

For table bloat, Depesz wrote some blog posts a while ago that are still relevant with some interesting methods of moving data around on disk. This can also be handy when you are very low on disk space.

Bloat Removal Without Table Swapping

Bloat Removal By Tuples Moving

Since I initially wrote my blog post, I’ve had some great feedback from people using pg_bloat_check.py already. I’ve gotten several bugs fixed as well as adding some new features with version 2.1.0 being the latest available as of this blog post. json is now the preferred, structured output method if you need to see more details outside of querying the stats table in the database. I also added some additional options with –exclude_object_file  that allows for more fine grained filtering when you want to ignore certain objects in the regular report, but not forever in case they get out of hand. I updated the README with some examples of that since it’s a little more complex.

I also made note of the fact that this script isn’t something that’s made for real-time monitoring of bloat status. Since it’s doing full scans on both tables and indexes, this has the potential to force data out of shared buffers. So if you keep running it often, you may affect query performance of things that rely on data being readily available there. It’s best to run it maybe once a month or once a week at most during off-peak hours. If you have particularly troublesome tables you want to keep an eye on more regularly, the –tablename option allows you to scan just that specific table and nothing else. Once you’ve gotten the majority of your bloat issues cleaned up after your first few times running the script and see how bad things may be, bloat shouldn’t get out of hand that quickly that you need to run it that often. If it is, you may want to re-evaluate how you’re using PostgreSQL (Ex. MVCC makes it not great as a queuing system).

If anyone else has some handy tips for bloat cleanup, I’d definitely be interested in hearing them.

Written by Keith

June 8th, 2016 at 5:25 pm

Posted in PostgreSQL

Tagged with , , , ,

Checking for PostgreSQL Bloat

with 6 comments

My post almost 2 years ago about checking for PostgreSQL bloat is still one of the most popular ones on my blog (according to Google Analytics anyway). Since that’s the case, I’ve gone and changed the URL to my old post and reused that one for this post. I’d rather people be directed to correct and current information as quickly as possible instead of adding an update to my old post pointing to a new one. I’ve included my summary on just what exactly bloat is again below since that seemed to be the most popular part.

UPDATE: I wrote a followup post on methods for actually Cleaning Up PostgreSQL Bloat once you’ve identified it.

The intent of the original post was to discuss a python script I’d written for monitoring bloat status: pg_bloat_check.py. Since that time, I’ve been noticing that the query used in v1.x of that script (obtained from the check_postgres.pl module) was not always accurate and was often not reporting on bloat that I knew for a fact was there (Ex: I just deleted over 300 million rows, vacuumed & analyzed the table and still no bloat? Sure it could happen, but highly unlikely). So I continued looking around and discovered the pgstattuple contrib module that comes with PostgreSQL. After discussing it with several of the core developers at recent PostgreSQL conferences (PGConfUS & PGCon) I believe this is a much, much better way to get an accurate assessment of the bloat situation. This encouraged me to do a near complete rewrite of my script and v2.0.0 is now available. It’s not a drop-in replacement for v1.x, so please check the –help for new options.

pgstattuple is a very simple, but powerful extension. It doesn’t require any additional libraries to be loaded and just adds a few functions you can call on database objects to get some statistics about them. The key function for bloat being the default one, pgstattuple(regclass), which returns information about live & dead tuples and free space contained in the given object. If you read the description below on what bloat actually is, you’ll see that those data points are exactly what we’re looking for. The difference between what this function is doing and what the check_postgres.pl query is doing is quite significant, though. The check_postgres query is doing its best to guess what is dead & free space based on the current statistics in the system catalogs. pgstattuple actually goes through and does a full scan on the given table or index to see what the actual situation is. This does mean this query can be very, very slow on large tables. The database I got the examples below from is 1.2TB and a full bloat check on it takes just under 1 hour. But with the inaccuracies I’ve seen being returned by the simpler query, this time can be well worth it. The script stores the statistics gathered in a table so they can be easily reviewed at any time and even used for monitoring purposes, just like check_postgres.

Before showing what the script can do, I just want to re-iterate some things from my old post because they’re important. Bloat percentage alone is a poor indicator of actual system health. Small tables may always have a higher than average bloat, or there may always be 1 or 2 pages considered waste, and in reality that has next to zero impact on database performance. Constantly “debloating” them is more a waste of time than the space used. So the script has some filters for object size, wasted space and wasted percentage. This allows the final output of the bloat report to provide a more accurate representation of where there may actually be problems that need to be looked into.

Another option is a filter for individual tables or indexes to be ignored. If you understand why bloat happens, you will come across cases where a table is stuck at a certain bloat point at all times, no matter how many times you VACUUM FULL it or run pg_repack on it (those two things do remove it, but it quickly comes back). This happens with tables that have a specific level of churn with the rows being inserted, updated & deleted. The number of rows being updated/deleted is balanced with the number of rows being inserted/updated as well as the autovacuum schedule to mark space for reuse. Removing the bloat from tables like this can actually cause decreased performance because instead of re-using the space that VACUUM marks as available, Postgres has to again allocate more pages to that object from disk first before the data can be added. So bloat is actually not always a bad thing and the nature of MVCC can lead to improved write performance on some tables. On to the new script!

So as an example of why this new, slower method can be worth it, here’s the bloat report for a table and its indexes from the old script using check_postgres

Old Table Bloat:
2. public.group_members.........................................................(9.6%) 4158 MB wasted

Old Index Bloat:
1. public.group_members_id_pk..................................................(19.5%) 4753 MB wasted
3. public.group_members_user_id_idx.............................................(9.6%) 2085 MB wasted
5. public.group_members_deleted_at_idx..........................................(6.2%) 1305 MB wasted

Here’s the results from the statistic table in the new version

$ pg_bloat_check.py -c "dbname=prod" -t public.group_members

kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ;
            objectname             | object_size | reusable_space | dead_tuple_space | free_percent
-----------------------------------+-------------+----------------+------------------+--------------
 group_members                     | 42 GB       | 16 GB          | 4209 kB          |        37.84
 group_members_user_id_idx         | 21 GB       | 14 GB          | 1130 kB          |        64.79
 group_members_id_pk               | 24 GB       | 16 GB          | 4317 kB          |        68.96
 group_members_deleted_at_idx      | 20 GB       | 13 GB          | 3025 kB          |        63.77
 group_members_group_id_user_id_un | 11 GB       | 4356 MB        | 6576 bytes       |        38.06
 group_members_group_id_idx        | 17 GB       | 9951 MB        | 0 bytes          |         56.8
 group_members_updated_at_idx      | 15 GB       | 7424 MB        | 0 bytes          |        49.57

Yes, all those indexes did exist before. The old query just didn’t think they had any bloat at all. There’s also a nearly 4x difference in wasted space in the table alone. It’s only 37% of the table in this case, but if you’re trying to clean up bloat due to low disk space, 12GB can be a lot. Another really nice thing pgstattuple provides is a distinction between dead tuples and reusable (free) space. You can see the dead tuple space is quite low in this example. That means autovacuum is running efficiently on this table and marking dead rows from updates & deletes as re-usable. If you see dead tuples is high, that could indicate autovacuum is not running properly and you may need to adjust some of the vacuum tuning parameters that are available. In this case, even a normal vacuum was not freeing the reusable space back to the operating system. See below for why this is. This means either a VACUUM FULL or pg_repack run is required to reclaim it. Here’s the result from making a new index on user_id:

kfiske@prod=# CREATE INDEX concurrently ON group_members USING btree (user_id);
CREATE INDEX
Time: 5308849.412 ms

$ pg_bloat_check.py -c "dbname=prod" -t public.group_members

kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ;
            objectname             | object_size | reusable_space | dead_tuple_space | free_percent 
-----------------------------------+-------------+----------------+------------------+--------------
 group_members                     | 42 GB       | 16 GB          | 2954 kB          |        37.84
 group_members_user_id_idx         | 21 GB       | 14 GB          | 1168 kB          |        64.79
 group_members_id_pk               | 24 GB       | 16 GB          | 4317 kB          |        68.96
 group_members_deleted_at_idx      | 20 GB       | 13 GB          | 3025 kB          |        63.77
 group_members_group_id_user_id_un | 11 GB       | 4356 MB        | 6784 bytes       |        38.06
 group_members_group_id_idx        | 17 GB       | 9951 MB        | 0 bytes          |         56.8
 group_members_updated_at_idx      | 15 GB       | 7424 MB        | 0 bytes          |        49.57
 group_members_user_id_idx1        | 8319 MB     | 817 MB         | 336 bytes        |         9.83

You can see the new index group_members_user_id_idx1 is now down to only 9% wasted space and much smaller. Here’s the result after running pg_repack to clear both the table and all index bloat:

kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ;
            objectname             | object_size | reusable_space | dead_tuple_space | free_percent 
-----------------------------------+-------------+----------------+------------------+--------------
 group_members                     | 25 GB       | 27 MB          | 79 kB            |          0.1
 group_members_id_pk               | 8319 MB     | 818 MB         | 0 bytes          |         9.83
 group_members_user_id_idx         | 8319 MB     | 818 MB         | 0 bytes          |         9.83
 group_members_deleted_at_idx      | 8319 MB     | 818 MB         | 0 bytes          |         9.83
 group_members_group_id_user_id_un | 7818 MB     | 768 MB         | 0 bytes          |         9.83
 group_members_group_id_idx        | 8319 MB     | 818 MB         | 0 bytes          |         9.83
 group_members_updated_at_idx      | 8318 MB     | 818 MB         | 0 bytes          |         9.83
(7 rows)

PostgreSQL 9.5 introduced the pgstattuple_approx(regclass) function which tries to take advantage of some visibility map statistics to increase the speed of gathering tuple statistics but possibly sacrificing some accuracy since it’s not hitting each individual tuple. It only works on tables, though. This option is available with the script using the –quick argument. There’s also the pgstatindex(regclass) that gives some more details on index pages and how the data in them is laid out, but I haven’t found a use for that in the script yet.

The same output options the old script had are still available: –simple to provide a text summary useful for emails & –dict which is a python dictionary that provides a structured output and also greater details on the raw statistics (basically just the data straight from the table). UPDATE: As of version 2.1.0 of the script, the –json & –jsonpretty options have been added and are the preferred structured output format unless you actually need a python dictionary. The table inside the database provides a new, easy method for reviewing the bloat information as well, but just be aware this is rebuilt from scratch every time the script runs. There’s also a new option which I used above (-t, –tablename) that you can use to get the bloat information on just a single table. See the –help for more information on all the options that are available.

Why Bloat Happens

For those of you newer to PostgreSQL administration, and this is the first time you may be hearing about bloat, I figured I’d take the time to explain why this scenario exists and why tools like this are necessary (until they’re hopefully built into the database itself someday). It’s something most don’t understand unless someone first explains it to them or you run into the headaches it causes when it’s not monitored and you learn about it the hard way.

MVCC (multi-version concurrency control) is how Postgres has chosen to deal with multiple transactions/sessions hitting the same rows at (nearly) the same time. The documentation, along with wikipedia provide excellent and extensive explanations of how it all works, so I refer you there for all the details. Bloat is a result of one particular part of MVCC, concentrated around the handling of updates and deletes.

Whenever you delete a row, it’s not actually deleted, it is only marked as unavailable to all future transactions taking place after the delete occurs. The same happens with an update: the old version of a row is kept active until all currently running transactions have finished, then it is marked as unavailable. I emphasize the word unavailable because the row still exists on disk, it’s just not visible any longer. The VACUUM process in Postgres then comes along and marks any unavailable rows as space that is now available for future inserts or updates. The auto-vacuum process is configured to run VACUUM automatically after so many writes to a table (follow the link for the configuration options), so it’s not something you typically have to worry about doing manually very often (at least with more modern versions of Postgres).

People often assume that VACUUM is the process that should return the disk space to the file system. It does do this but only in very specific cases. That used space is contained in page files that make up the tables and indexes (called objects from now on) in the Postgres database system. Page files all have the same size and differently sized objects just have as many page files as they need. If VACUUM happens to mark every row in a page file as unavailable AND that page also happens to be the final page for the entire object, THEN the disk space is returned to the file system. If there is a single available row, or the page file is any other but the last one, the disk space is never returned by a normal VACUUM. This is bloat. Hopefully this explanation of what bloat actually is shows you how it can sometimes be advantageous for certain usage patterns of tables as well, and why I’ve included the option to ignore objects in the report.

If you give the VACUUM command the special flag FULL, then all of that reusable space is returned to the file system. But VACUUM FULL does this by completely rewriting the entire table (and all its indexes) to new pages and takes an exclusive lock on the table the entire time it takes to run (CLUSTER does the same thing, but what that does is outside the scope of this post). For large tables in frequent use, this is problematic. pg_repack has been the most common tool we’ve used to get around that. It recreates the table in the background, tracking changes to it, and then takes a brief lock to swap the old bloated table with the new one.

Why bloat is actually a problem when it gets out of hand is not just the disk space it uses up. Every time a query is run against a table, the visibility flags on individual rows and index entries is checked to see if is actually available to that transaction. On large tables (or small tables with a lot of bloat) that time spent checking those flags builds up. This is especially noticeable with indexes where you expect an index scan to improve your query performance and it seems to be making no difference or is actually worse than a sequential scan of the whole table. And this is why index bloat is checked independently of table bloat since a table could have little to no bloat, but one or more of its indexes could be badly bloated. Index bloat (as long as it’s not a primary key) is easier to solve because you can either just reindex that one index, or you can concurrently create a new index on the same column and then drop the old one when it’s done.

In all cases when you run VACUUM, it’s a good idea to run ANALYZE as well, either at the same time in one command or as two separate commands. This updates the internal statistics that Postgres uses when creating query plans. The number of live and dead rows in a table/index is a part of how Postgres decides to plan and run your queries. It’s a much smaller part of the plan than other statistics, but every little bit can help.

I hope this explanation of what bloat is, and how this tool can help with your database administration, has been helpful.

Written by Keith

May 27th, 2016 at 11:55 am

Posted in PostgreSQL

Tagged with , , ,

A Small Database Does Not Mean Small shared_buffers

without comments

As a followup to my previous blog post, A Large Database Does Not Mean Large shared_buffers, I had some more interesting findings applying the queries in that blog post to another client recently. I assume you have read that one already and don’t repeat any of what I explained previously, so if you haven’t read that one and aren’t familiar with the pg_buffercache extension, I highly recommend you go read that one first.

Another mantra often heard in PostgreSQL circles is that you usually don’t want to set shared_buffers higher than 8GB. I will admit, that for a majority of users, that is great advice and a good starting point (and a whole lot more useful than the default 32MB). There are also issues around double-buffering and allowing the kernel to do what it can probably do better than PostgreSQL as far as managing page reads/writes (a topic way out of the scope of this blog post). But if you investigate further into how PostgreSQL is using its shared memory and what your high demand data blocks actually are, you can possibly find benefit in setting it higher. Especially when you can clearly see what PostgreSQL thinks it needs most often. Or if you can just fit the whole thing into memory, as I stated before.

The client in these examples has shared_buffers set to 24Gb and the total database size is 145GB (111GB in the primary followed by 28GB, 5GB,  270MB & 150MB). I say small in the title of this post, but both large and small are relative terms and for my typical work this is a small database. And a setting that is 17% of the total size is larger than normal, so along with being a catchy followup name, the results do fit the title.

So I ran the basic query at the end of my previous post to see what the “ideal” minimal is. I ran this several times over about a half-hour period and, unlike the databases in my previous post, it did not deviate much.

database=# SELECT pg_size_pretty(count(*) * 8192) as ideal_shared_buffers
FROM pg_buffercache b
WHERE usagecount >= 3;
 ideal_shared_buffers 
----------------------
 18 GB

Much higher than I previously encountered and with a much smaller database too. The value did deviate slightly, but it never changed from the rounded, pretty value of 18GB. So I investigated further. First the primary, 111GB database:

database=# SELECT c.relname
   , pg_size_pretty(count(*) * 8192) as buffered
   , round(100.0 * count(*) / ( SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer,1) AS buffers_percent
   , round(100.0 * count(*) * 8192 / pg_relation_size(c.oid),1) AS percent_of_relation
 FROM pg_class c
 INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
 INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
 WHERE pg_relation_size(c.oid) > 0
 GROUP BY c.oid, c.relname
 ORDER BY 3 DESC
 LIMIT 10;
          relname          | buffered | buffers_percent | percent_of_relation 
---------------------------+----------+-----------------+---------------------
 group_members             | 8697 MB  |            35.4 |                73.9
 order_items               | 1391 MB  |             5.7 |               100.0
 orders                    | 1258 MB  |             5.1 |               100.0
 users                     | 812 MB   |             3.3 |               100.0
 units                     | 801 MB   |             3.3 |               100.0
 images                    | 599 MB   |             2.4 |                71.5
 group_members_user_id_idx | 481 MB   |             2.0 |                10.9
 user_list_map             | 264 MB   |             1.1 |               100.0
 products                  | 202 MB   |             0.8 |               100.0

A good amount of the large tables had a significant amount of themselves in shared buffers. I looked at the top table here to see if it may be having problems keeping its high demand usage blocks in memory

database=# SELECT pg_size_pretty(count(*) * 8192)
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE c.oid::regclass = 'group_members'::regclass
AND usagecount >= 3;
 pg_size_pretty 
----------------
 6606 MB

Actually looks ok. It’s got about 2GB of space to be able to swap out lower priority blocks for higher ones if needed. How about those next two 100% tables?

database=# SELECT pg_size_pretty(count(*) * 8192)
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE c.oid::regclass = 'order_items'::regclass
AND usagecount >= 3;
 pg_size_pretty 
----------------
 1391 MB
(1 row)

database=# SELECT pg_size_pretty(count(*) * 8192)
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE c.oid::regclass = 'orders'::regclass
AND usagecount >= 3;
 pg_size_pretty 
----------------
 1258 MB
(1 row)

I actually increased the usagecount parameter for both these tables all the way up to 5 and that only lowered the amount by a 2-3MB. So these are some pretty heavily used tables. For a client that does online order processing, this would seem to make sense for the context of this table. But it could also indicate a problem as well. This could mean there are queries doing a whole lot of sequential scans on this table and they might not need to be doing so. If that’s not something that’s readily apparent in the code accessing the database, I would then suggest turning to something like pgbadger for more in-depth query analysis to see where problems may be.

You may have noticed this doesn’t account for all the memory usage seen in the first query. Time to dive into the other databases (the 28GB one).

database=# \c mailer 
mailer=# SELECT c.relname
  , pg_size_pretty(count(*) * 8192) as buffered
  , round(100.0 * count(*) / ( SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer,1) AS buffers_percent
  , round(100.0 * count(*) * 8192 / pg_relation_size(c.oid),1) AS percent_of_relation
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE pg_relation_size(c.oid) > 0
GROUP BY c.oid, c.relname
ORDER BY 3 DESC
LIMIT 10;
            relname             | buffered | buffers_percent | percent_of_relation 
--------------------------------+----------+-----------------+---------------------
 messages_pkey                  | 1769 MB  |             7.2 |                88.7
 messages                       | 1200 MB  |             4.9 |               100.0
 subject_text                   | 261 MB   |             1.1 |                41.9
 messages_mailing_id_idx        | 259 MB   |             1.1 |                15.4
 subject_text_pkey              | 104 MB   |             0.4 |               100.0
 messages_created_at_idx        | 26 MB    |             0.1 |                 1.2
 messages_recipient_id_idx      | 30 MB    |             0.1 |                 1.7
 pg_attrdef_adrelid_adnum_index | 16 kB    |             0.0 |               100.0
 pg_index_indrelid_index        | 40 kB    |             0.0 |                35.7
 pg_namespace_oid_index         | 16 kB    |             0.0 |               100.0
(10 rows)

That primary key is taking up a lot of space and almost all of it seems to be in memory. But again, how much of it is really high usage?

mailer=# SELECT pg_size_pretty(count(*) * 8192)
 FROM pg_class c
 INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
 INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
 WHERE c.oid::regclass = 'messages_pkey'::regclass
 AND usagecount >= 3;
 pg_size_pretty 
----------------
 722 MB

Not nearly as much as is in shared_buffers. So no justification for an increase here. How about the messages table?

mailer=# SELECT pg_size_pretty(count(*) * 8192)
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE c.oid::regclass = 'messages'::regclass
AND usagecount >= 5;
 pg_size_pretty 
----------------
 1200 MB

The whole thing is in very high demand! And there’s plenty of space for it it be there. The remainder of the majority of the space was a table similar to this in yet another one of the databases in the cluster.

So this PostgreSQL cluster seems to have some pretty good justification for having a shared_buffers 3x higher than what is typically suggested. It’s not actually using all of what’s available (only 18 of 24GB) and there’s still a significant amount in shared_buffers that’s got a usagecount below 3. My guidance to the client was to leave shared_buffers where it was, but to keep an eye on the tables like orders, order_items & messages. If the high usage of those tables is justified and they start increasing in size significantly, then this evaluation should be done again to see if shared_buffers should possibly be increased to keep that high demand data readily available in memory.

The pg_buffercache extension has been a great help with fine tuning one of the more important settings in PostgreSQL. Hopefully this helps clarify more how to evaluate shared_buffers usage and figuring out an ideal setting. And to be honest, I’m hoping that someone that reads this is in a position to better experiment with actually changing the shared_buffers value in situations like this to see if it really can make a difference in performance. As someone commented on my previous post, shared_buffers is a pretty invasive setting to change, not only because it requires a restart, but because you don’t want to screw up your performance on an active production machine. But you need the kind of activity that will be on an active production machine to accurately evaluate such settings. Reproducing such activity outside of production is really challenging.

So, looking for feedback and for anyone else to either validate or find fault with my experimentations so far.

 

Written by Keith

October 8th, 2014 at 2:25 pm

A Large Database Does Not Mean Large shared_buffers

with 13 comments

A co-worker of mine did a blog post last year that I’ve found incredibly useful when assisting clients with getting shared_buffers tuned accurately.

Setting shared_buffers the hard way

You can follow his queries there for using pg_buffercache to find out how your shared_buffers are actually being used. But I had an incident recently that I thought would be interesting to share that shows how shared_buffers may not need to be set nearly as high as you believe it should. Or it can equally show you that you that you definitely need to increase it. Object names have been sanitized to protect the innocent.

IMPORTANT NOTE: An important caveat to using this extension is that when you query pg_buffercache it puts a lock on the shared buffers in order to get an accurate assessment of usage. For the most part, this lock is brief and you probably won’t even notice anything. But on extremely busy systems you may notice some queries take a bit longer than usual. This is not something to be monitored in real-time for any extended period of time. It’s mostly something to be checked on occasionally when you’re noticing performance problems. If you do set up automatic monitoring for it, do not be aggressive.

To set the stage, the database total size is roughly 260GB and the use case is high data ingestion with some reporting done on just the most recent data at the time. shared_buffers is set to 8GB. The other thing to note is that this is the only database in the cluster. pg_buffercache has info for all databases in the cluster, but when you join against pg_class to get object information, you can only do this on individual database at a time.

database=# SELECT c.relname
  , pg_size_pretty(count(*) * 8192) as buffered
  , round(100.0 * count(*) / ( SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer,1) AS buffers_percent
  , round(100.0 * count(*) * 8192 / pg_relation_size(c.oid),1) AS percent_of_relation
 FROM pg_class c
 INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
 INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
 WHERE pg_relation_size(c.oid) > 0
 GROUP BY c.oid, c.relname
 ORDER BY 3 DESC
 LIMIT 10;
               relname               | buffered | buffers_percent | percent_of_relation
-------------------------------------+----------+-----------------+---------------------
 table1                              | 7479 MB  |            91.3 |                 9.3
 table2                              | 362 MB   |             4.4 |               100.0
 table3                              | 311 MB   |             3.8 |                 0.8
 table4                              | 21 MB    |             0.3 |               100.0
 pg_attrdef_adrelid_adnum_index      | 16 kB    |             0.0 |               100.0
 table4                              | 152 kB   |             0.0 |                 7.7
 index5                              | 16 kB    |             0.0 |                14.3
 pg_index_indrelid_index             | 40 kB    |             0.0 |                 8.8
 pg_depend_depender_index            | 56 kB    |             0.0 |                 1.0
 pg_cast_source_target_index         | 16 kB    |             0.0 |               100.0

You can see that table1 is taking up a vast majority of the space here and it’s a large table, so only 9% of it is actually in shared_buffers. What’s more interesting though is how much of the space for that table is actually in high demand.

database=# SELECT pg_size_pretty(count(*) * 8192) 
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE c.oid::regclass = 'table1'::regclass
AND usagecount >= 2;
  pg_size_pretty
----------------------
 2016 kB

Data blocks that go into and come out of postgres all go through shared_buffers. Just to review the blog post I linked to, whenever a block is used in shared memory, it increments a clock-sweep algorithm that ranges from 1-5, 5 being extremely high use data blocks. This means high usage blocks are likely to be kept in shared_buffers (if there’s room) and low usage blocks will get moved out if space for higher usage ones is needed. We believe that a simple insert or update sets a usagecount of 1. So, now we look at the difference when usage count is dropped to that.

database=# SELECT pg_size_pretty(count(*) * 8192) 
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE c.oid::regclass = 'table1'::regclass
AND usagecount >= 1;
 pg_size_pretty
----------------------
 4946 MB

So the shared_buffers is actually getting filled mostly by the data ingestion process, but relatively very little of it is of any further use afterwards. If anything of greater importance was needed in shared_buffers, there’s plenty of higher priority space and that inserted data would quickly get flushed out of shared memory due to having a low usagecount.

So with having pg_buffercache installed, we’ve found that the below query seems to be a good estimate on an optimal, minimum shared_buffers setting

database=# SELECT pg_size_pretty(count(*) * 8192) as ideal_shared_buffers
 FROM pg_class c
 INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
 INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
 WHERE usagecount >= 3;
 ideal_shared_buffers
----------------------
 640 MB

This is the sort of query you would run after you have had your database running through your expected workload for a while. Also, note my use of the key word minimal. This does not account for unexpected spikes in shared_buffers usage that may occur during a session of reporting queries or something like that. So you definitely want to set it higher than this, but it can at least show you how effectively postgres is using its shared memory. In general we’ve found the typical suggestion of 8GB to be a great starting point for shared_buffers.

So, in the end, the purpose of this post was to show that shared_buffers is something that needs further investigation to really set optimally and there is a pretty easy method to figuring it out once you know where to look.

UPDATE:

I’ve written a followup post to this one as well: A Small Database Does Not Mean Small shared_buffers.

So, as someone commented below, you don’t really need to join against pg_class & pg_database to get the ideal suggested minimum. This also avoids having to manually do totals across multiple databases in the cluster. The reason for joining against those two was to be able to identify which databases and objects the blocks in shared buffers were associated with. pg_class can only identify the objects of in the database you’re in.

Also, for really high traffic databases with fluctuating query activity, the suggested minimum query isn’t something you can run just once. It has to be run multiple times because the values can vary drastically.  Below are the results of running the shorter query just a few times in less than a 1 minute time period on a different client of ours that has a much different traffic pattern (OLTP) than the one above. There’s 46 databases in the cluster with a total size of roughly 900GB, with 800GB in one database, 30GB in the next largest and quickly getting smaller from there. For this one we actually have shared_buffers set down to 4GB and it’s been working great for years.

kfiske@database=# SELECT pg_size_pretty(count(*) * 8192) as ideal_shared_buffers
FROM pg_buffercache b
WHERE usagecount >= 3;
 ideal_shared_buffers 
----------------------
 1431 MB
(1 row)

Time: 259.196 ms
kfiske@database=# SELECT pg_size_pretty(count(*) * 8192) as ideal_shared_buffers
FROM pg_buffercache b
WHERE usagecount >= 3;
 ideal_shared_buffers 
----------------------
 1566 MB
(1 row)

Time: 495.255 ms
kfiske@database=# SELECT pg_size_pretty(count(*) * 8192) as ideal_shared_buffers
FROM pg_buffercache b
WHERE usagecount >= 3;
 ideal_shared_buffers 
----------------------
 1217 MB
(1 row)

Time: 278.755 ms
kfiske@database=# SELECT pg_size_pretty(count(*) * 8192) as ideal_shared_buffers
FROM pg_buffercache b
WHERE usagecount >= 3;
 ideal_shared_buffers 
----------------------
 1092 MB
(1 row)

Time: 260.278 ms
kfiske@database=# SELECT pg_size_pretty(count(*) * 8192) as ideal_shared_buffers
FROM pg_buffercache b
WHERE usagecount >= 3;
 ideal_shared_buffers 
----------------------
 999 MB
(1 row)

Time: 251.809 ms

 

Written by Keith

September 11th, 2014 at 2:53 pm

Posted in PostgreSQL

Tagged with , ,