Keith's Ramblings…

Managing Transaction ID Exhaustion (Wraparound) in PostgreSQL

without comments

One of the most critical topics to understand when administering a PostgresSQL database is the concept of transaction IDs (TXID) and that they can be exhausted if not monitored properly. However, this blog post isn’t going to go into the details of what it TXID exhaustion actually is. The Routine Vacuuming section of the documentation is probably one of the most important to read and understand so I will refer you there. What this blog post is going to cover is an easy way to monitor for it and what can be done to prevent it ever being a problem.


Most people initially aware of the issue consider the TXID wraparound itself to be the problem that they’re monitoring for, but it’s technically the exhaustion of the TXIDs that’s the real issue. PostgreSQL is technically capable of handling things just fine if the TXID value wraps around. However, if the wraparound point is reached, and the TXIDs are close to being used up, that is why wraparound itself is such a cause for concern.

The following query is the one that we use at Crunchy Data in our PGmonitor tool to provide very simple data points to trend/alert on.

WITH max_age AS ( 
    SELECT 2000000000 as max_old_xid
        , setting AS autovacuum_freeze_max_age 
        FROM pg_catalog.pg_settings 
        WHERE name = 'autovacuum_freeze_max_age' )
, per_database_stats AS ( 
    SELECT datname
        , m.max_old_xid::int
        , m.autovacuum_freeze_max_age::int
        , age(d.datfrozenxid) AS oldest_current_xid 
    FROM pg_catalog.pg_database d 
    JOIN max_age m ON (true) 
    WHERE d.datallowconn ) 
SELECT max(oldest_current_xid) AS oldest_current_xid
    , max(ROUND(100*(oldest_current_xid/max_old_xid::float))) AS percent_towards_wraparound
    , max(ROUND(100*(oldest_current_xid/autovacuum_freeze_max_age::float))) AS percent_towards_emergency_autovac 
FROM per_database_stats

The percent_towards_wraparound metric is the one that is really critical that an alert be set up for. Since it is using the age() function to determine the TXID value, it is taking into account if they are actually at the point of exhaustion to see if the wraparound is a real concern. If exhaustion is ever reached, the database will be forced to shut down and could cause an indeterminate amount of downtime in order to be fixed. This query has a little bit of a buffer in it since the upper boundary it checks (2 billion exactly) is less than the actual max integer value that causes the exhaustion. But it’s close enough that this alert hitting 100% should be acted upon immediately.

The percent_towards_emergency_autovac metric is the additional value we recommend monitoring for, especially for systems that have never had this monitored for before (see notes on Recent Freezing Benefits below about when this alert priority can be lowered or removed). This watches for the database’s highest TXID value reaching autovacuum_freeze_max_age. This is a user-tunable value that has a default value of 200 million and when any table’s highest TXID value reaches it, a higher priority autovacuum kicks in on that table. You’ll recognize this special vacuum session because in pg_stat_activity it will be labelled (to prevent wraparound). It is higher priority in the sense that it will run even if autovacuum is disabled and if that vacuum is manually cancelled, it will almost immediately restart again. It also takes some different internal, low-level locks, so it could cause slightly higher contention on those tables depending on how they’re being used during the emergency vacuum. If you do run into contention/locking issues and they can be narrowed down to the emergency vacuum being the cause, it’s perfectly safe to cancel it to allow your other transactions to finish. Just be aware that it will keep restarting until either that wraparound vacuum is able to be completed successfully or a manual vacuum is run

For databases with a high transaction rate, it could be beneficial to increase autovacuum_freeze_max_age to avoid that emergency vacuum period coming around quite so often. The main concern with increasing this is that it can increase the storage requirements in the pg_xact and pg_commit_ts folders in the data directory. Again, please read the Routine Vacuuming documentation linked above for what these storage requirements are when you adjust this setting. I’ve often set this value to 1 billion without much issue, but only when I’m sure wraparound is being monitored for and the disk space is available.

So when either of these alerts go off, how do you fix it?

Easy Fix

The easiest (but not necessarily the quickest) way to get the highest TXID age value back down is to force a vacuum on the entire database cluster. And the best way to do this cluster-wide vacuum is the vacuumdb binary utility that comes with PostgreSQL.

vacuumdb --all --freeze --jobs=2 --echo --analyze 

The –all option ensures all databases are vacuumed since the TXID is a global value. The –freeze option ensures a more aggressive vacuum is run to ensure as many tuples as possible are frozen in that table (see Routine Vacuuming for details on what freezing is). –jobs=2 allows multiple vacuums to be run in parallel. This should be set as high as your system can handle to speed things up, but be careful setting it too high since it causes additional IO and faster WAL generation (increased disk usage). –echo just provides some minimal feedback so you can see some progression. –analyze ensures that statistics are updated. If time is of concern to get this vacuum run finished, this can be left off and run as a separate step later using the –analyze-only option.

Recent Freezing Benefits

Another benefit of the –freeze option I’ll mention here can be a huge reduction in IO & WAL generation during future vacuum operations. PostgreSQL 9.6 introduced a feature that allows vacuum to be able to skip over a page if all tuples inside of it are marked as frozen. And PG11 improved on this even more for indexes. So if you have a lot of old tables that don’t get writes anymore, this makes it so that when they do need to be vacuumed for any reason, it’s a much, much less expensive operation. It also makes the percent_towards_emergency_autovac alert less of a concern since it won’t be quite as much of an unexpected activity spike. So once you have things tuned well, you could consider this alert a low priority warning or maybe even remove it and just worry about monitoring for wraparound itself.

Per-Table Fix

If you cannot afford to do a cluster-wide vacuum and want to just get the TXID age under control as quickly as possible, it’s certainly possible, but just involves more steps than just calling a single binary command. This is definitely the case if you’ve reached wraparound or are getting close.

The first thing to check is the high TXID age for each database to see which need attention. The query below provides that along with the current autovacuum_freeze_max_age value for comparison.

SELECT datname
    , age(datfrozenxid)
    , current_setting('autovacuum_freeze_max_age') 
FROM pg_database 

  datname  |    age    | current_setting 
 postgres  | 170604895 | 200000000
 mydb      | 169088197 | 200000000
 template0 | 142915024 | 200000000
 template1 | 142914999 | 200000000

Here you can see both the default postgres database and the user’s database are the most immediate concern to getting close to emergency vacuum. They’re thankfully nowhere near exhaustion, but you’d also see that here if the age was near 2.1 billion. Next we need to connect to the relevant database and see the state of the tables themselves.

postgres=# SELECT c.oid::regclass
    , age(c.relfrozenxid)
    , pg_size_pretty(pg_total_relation_size(c.oid)) 
FROM pg_class c
JOIN pg_namespace n on c.relnamespace = n.oid
WHERE relkind IN ('r', 't', 'm') 
AND n.nspname NOT IN ('pg_toast')

                    oid                     |    age    | pg_size_pretty 
 pg_proc                                    | 170606208 | 936 kB
 pg_description                             | 170606208 | 480 kB
 pg_depend                                  | 109192353 | 1336 kB
 pg_attribute                               | 109192353 | 728 kB
 pg_shdepend                                |  89774141 | 2976 MB
 pg_db_role_setting                         |  77594841 | 16 kB
 pg_tablespace                              |  77594841 | 72 kB
 pg_pltemplate                              |  77594841 | 56 kB
 pg_auth_members                            |  77594841 | 16 kB
 pg_authid                                  |  77594841 | 72 kB

Here you can see there’s only a few tables that have a high age, but in a more realistic table example, you’d likely see results closer to the 100 row limit with a higher age. Next we’ll want to vacuum just these specific tables. If it’s just a few tables, manually typing out the VACUUM commands isn’t a big deal. But if you have 100+, that could be tedious and prone to typos. So we can use some string concatenation along with some psql commands to generate some statements and place them in a file to then be run automatically for us.

\o /tmp/vacuum.sql
select 'vacuum freeze analyze verbose ' || oid::regclass || ';' from pg_class where relkind in ('r', 't', 'm') order by age(relfrozenxid) desc limit 100;
\set ECHO all
\i /tmp/vacuum.sql

I highly recommend checking out the psql documentation for what all these slash commands do, but basically what it does is:

  • Turn off column headers
  • Send all output after this to the /tmp/vacuum.sql file
  • Generate at most 100 VACUUM statement for each table that returns from the query
  • Turn off file output and turn column headers back on
  • Output each command run after this
  • Run the vacuum statements, one at a time, contained in the vacuum.sql file

Even if there’s more than 100 tables to clean up, I usually do it in batches of 100 just to give IO and WAL generation some time to calm down. I’ll usually try and get the max XID for any table down at least down to 50% of autovacuum_freeze_max_age, usually even down to 30-40% if it’s not going to be too much trouble. And also note that you will have to log into each database to run these VACUUM commands. You can log into template1 to fix it if you want, but you won’t be able to log into template0template0 can safely be ignored and let run to autovacuum_freeze_max_age since it’s extremely small and will finish nearly instantaneously. If you’re curious what these template databases are, check the documentation.


An earlier blog post of mine on Per-Table Auto-Vacuuming Tuning is another way to help prevent needing to do manual fixes like this for frequently written tables. And as I said earlier, if you’re on PG9.6 and later, you’ll likely only have to be concerned about doing this sort of manual intervention rarely and hopefully only ever once. So once you’ve gotten autovacuum tuned properly and your old, static data marked frozen, reaching autovacuum_freeze_max_age is not really much of a concern anymore since:

  • Your frequently used tables are being vacuumed when they’re supposed to be getting vacuumed
  • Your static tables have been marked frozen and vacuum can skip over them rather quickly

That means all you have to worry about monitoring for is the actual exhaustion. It won’t likely be an issue anymore unless the usage patterns of the database change, but it should ALWAYS be monitored for no matter what since reaching it is a guaranteed outage that you may not be ready for.

Written by Keith

April 12th, 2021 at 2:56 am

Posted in PostgreSQL

Per-Table Autovacuum Tuning

with 3 comments

A pattern that seems to drive my blog posts definitely seems to be the frequency of client questions. And that is definitely the case here again. Vacuum tuning to manage bloat and transaction id wraparound on production systems has been a hot topic and lately this has even been getting down to tuning autovacuum on the individual table basis. I’ve already discussed bloat pretty extensively in previous posts. While I’d like to get into the details of transaction ID wraparound, that really isn’t the focus of this post, so I’ll defer you to the documentation.

One setting I will discuss first though is autovacuum_freeze_max_age. Basically when any table’s max XID value reaches this, a more aggressive “emergency” autovacuum is kicked off. If many tables hit this at the same time (a common occurrence with data-warehouses that have many large, sparsely written tables), that can kick off some pretty high and long lasting IO caused by these autovacuums. I highly recommend increasing autovacuum_freeze_max_age from the default value of 200 million to 1 billion. However, I don’t recommend doing this unless you have some monitoring in place to watch for tables reaching both autovacuum_freeze_max_age and wraparound, the latter of which can shut your database down for some pretty extensive downtime.

So ideally, we want autovacuum running often enough on its own so you never encounter any of those bad situations above. The next thing I’d recommend tuning on any database before getting down to the table-level methods, no matter its size or traffic, are the default settings that control when autovacuum initially kicks in.

autovacuum_analyze_scale_factor = .10
autovacuum_analyze_threshold = 50
autovacuum_vacuum_scale_factor = .20
autovacuum_vacuum_threshold = 50

In my opinion, the defaults on these are not ideal for any situation. The scales are too high and the thresholds are too low. The scale factor settings are percentage values that say, “When this percentage of the table’s rows have changed (updates/deletes), run vacuum or analyze”. The threshold settings say that “When this many rows have changed, run vacuum or analyze”. These two settings are used together to determine the actual threshold that will kick in autovacuum. The formula taken from the documentation shows this:

vacuum initialization = autovacuum_vacuum_threshold + (autovacuum_vacuum_scale_factor * number of rows)
analyze initialization = autovacuum_analyze_threshold + (autovacuum_analyze_scale_factor * number of rows)

So this means with the default settings, a 100,000 row table will have autovacuum kick in when 20050 rows have changed and autoanalyze will kick in when 10050 have changed. Not too bad, but my recommended default settings are are this:

autovacuum_analyze_scale_factor = .05
autovacuum_analyze_threshold = 500
autovacuum_vacuum_scale_factor = .10
autovacuum_vacuum_threshold = 500

This halves the scale factor settings so autovacuum is a bit more aggressive on how often it kicks in. My reason for increasing the threshold values is because when you have very small tables, you really don’t want autovacuum then being overly aggressive on them after decreasing the scale settings. Vacuuming too often may not necessarily be a bad thing, but it’s just wasted IO at that point.

So how about when you start getting into very large tables? Can you see why this could be a problem?

(800,000,000 rows * .2) + 50 = 160,000,050
(800,000,000 rows * .1) + 500 = 80,000,500

We would have to wait until nearly 160 million rows have changed before autovacuum would ever kick in. Even with my recommended defaults, 80 million rows is still quite a bit to wait for. For a high write rate table, this might work out ok, but for most cases this likely won’t be ideal. This affects bloat buildup since autovacuum may not run often enough to clear up available space for new/updated rows. And it could also cause the global XID value to increase quite high, more than likely reaching autovacuum_freeze_max_age before a normal autovacuum process would ever touch it. This is when we need to start tuning on a per-table basis and all four of the above settings can be set that way.

So the first question to ask is how often do we want autovacuum to run? I’ll start from a baseline of at least once per day for a starting point. This means we’ll need a way to see how many rows have changed per day.

mydb=# select * from pg_stat_all_tables where relname = 'mytable';
-[ RECORD 1 ]-------+------------------------------
relid | 32307
schemaname | public
relname | mytable
seq_scan | 25
seq_tup_read | 5042494092
idx_scan | 199790800
idx_tup_fetch | 3373819424
n_tup_ins | 342119654
n_tup_upd | 752699033
n_tup_del | 167007870
n_tup_hot_upd | 281342612
n_live_tup | 286999790
n_dead_tup | 9050790
n_mod_since_analyze | 762072
last_vacuum | 2017-08-17 16:14:25.447345-05
last_autovacuum | 2018-08-24 13:17:35.88936-05
last_analyze | 2018-08-16 20:32:13.265396-05
last_autoanalyze | 2018-08-31 07:43:45.551732-05
vacuum_count | 1
autovacuum_count | 43
analyze_count | 29
autoanalyze_count | 129

The pg_stat_all_table view (or pg_stat_user_tables to exclude system tables) can help us get this. PostgreSQL doesn’t keep long term trending data on table writes, but it does keep counters. Specifically the n_tup_upd & n_tup_del columns can tell us how many tuples/rows where changed since the statistics for this database were last reset. So if we can collect this data once per day around the same time, we can get an idea of the write rate that would affect when autovacuum would kick in. Here’s a simple command you can add to a cronjob to generate some CSV data you could then pull into a spreadsheet and quickly do calculations.

psql -U postgres -d mydb -c "COPY (select now() AS gather_time, * from pg_stat_all_tables where relname = 'mytable') To STDOUT WITH CSV DELIMITER ','" >> mytable_stats.csv

Just add n_tup_upd+n_tup_del and compare that to the previous day’s total value and you have your rows changed per day. If you want it to run more/less often, then just adjust your collection interval as needed. So let’s say there were around 300,000 to 500,000 rows changed per day on our giant 800mil row table. We would want autovacuum to run when at least 400,000 rows changed. This means we want to ignore the scale factor setting completely and rely only on the threshold value. That’s the other problem with the scale factor and large tables: as they grow larger and larger, autovacuum runs less and less since the % value of that table grows larger. So to set the above settings on a per table basis, we can use the ALTER TABLE command to adjust the above settings for just that table.

ALTER TABLE public.mytable SET (autovacuum_analyze_scale_factor = 0, autovacuum_vacuum_scale_factor = 0, autovacuum_vacuum_threshold = 400000, autovacuum_analyze_threshold = 100000 );

We usually want analyze to run more often than a vacuum so queries can have accurate statistics. Analyzes are fairly lightweight, even on large tables, so it’s ok to be more aggressive about when those run.

Once we have those settings in place, we can then use the same stats view to see if autovacuum & autoanalyze are running when we want them too. last_autoanalyzelast_autovacuum are the timestamp of when they last completed.

You may need it to vacuum more often than this depending on the turn-over of updates, deletes & inserts to keep bloat under control. To determine that I recommend setting up bloat monitoring. If you see that it keeps growing, you can look at vacuuming more often or even adjusting the fillfactor settings for the table or indexes.

Another suggestion that I will make is for databases that have any tables that get few-to-no writes, especially large ones. First, upgrade to a recent version of PostgreSQL if you’re not on at least 9.6. Then, schedule a regular VACUUM of your database, or at least the tables that are no longer being touched often. 9.6 introduced a feature that if all of the rows on a page are frozen, then autovacuum is able to completely skip over that page and not have to evaluate each individual row. This can be a HUGE saving on both time and IO. And the larger the table the more the savings you’ll see in the long run. For anyone that is running their databases in the cloud where IO is money, this can save you quite a bit there. Typically tuning autovacuum will preclude needing to schedule a manual vacuum, but if it turns out it’s needed in some cases, running during off-peak hours won’t hurt anything.

The other issue that people often encounter is long running autovacuums or autovacuum incurring higher IO costs than anticipated. I’ll be working on another blog post in the future that covers these issues. Hopefully this gives you a good start on how to better tune autovacuum for your growing databases.

Written by Keith

October 1st, 2018 at 2:43 pm

Posted in PostgreSQL

Tagged with , , , ,

Removing A Lot of Old Data (But Keeping Some Recent)

with 13 comments

I’ve had this situation crop up a few times with clients and after a discussion on #postgresql on Freenode recently, decided a blog post may be in order. The pitfalls that lead me to this solution are useful to cover and it seems a useful set of steps to have documented and be able to share again later.

There comes a time for most people when you have a table that builds up quite a lot of rows and you then realize you didn’t actually need to keep all of it. But you can’t just run a TRUNCATE because you do want to keep some of the more recent data. If it’s just a few million small rows, it’s not a huge deal to just run a simple DELETE. But when that starts getting into the billions of rows, or your rows are very large (long text, bytea, etc), a simple DELETE may not be realistic.

The first, and arguably easiest way, to deal with this would be to run the DELETE in batches instead of one large transaction. This allows you to add a pause in between the batches to help control I/O. Another side affect of such large delete operations can be an excessively high amount of WAL generation. This not only contributes to I/O, this can also dramatically increase disk space usage. When you’re trying to delete data due to disk space constraints, this can end up making things even worse before they get better. And if you’re deleting quite a lot of data from over a long period of time that didn’t receive many deletes before, you likely won’t get much disk space back at the end without doing a VACUUM FULL or pg_repack (see my discussion on bloat here). And the biggest issue of all when disk space is critical is either one of those options requires that you have at least as much disk space available as a full copy of the unbloated table would take up. So if disk space is the crucial problem most people are trying to solve with removing old data, how can we easily do this?

The first option that’s most easily done on almost any RDBMS is to make a secondary table and have new data copied/redirected there. The I/O and WAL generation of moving the smaller, required amount of data is much less than removing all the old data. Then you can just drop the old table. And most people do this via a trigger-based method: have every write to the old table also write to the new one as well. Then once you’re sure things are working, you can take a brief lock on both the old and new tables and swap their names. PostgreSQL makes this last step much easier, having transactional DDL. I’ll cover the commands to do this later since my final solution is similar to this. The main problems surrounding this come when the table you’re trying to clean up is a very high-traffic table. Doing a trigger like this basically doubles all writes which could possibly cause some I/O issues. There’s also the fact of making sure you get that trigger function code right, otherwise all writes break. Can be a little nerve wracking on critical production systems. But there is a way to avoid both the additional I/O of double writes and the headache of writing triggers.

The actual solution we came up for this involves using the often overlooked feature of table inheritance in PostgreSQL. The steps basically go like this:

  1. Create a new table exactly like the first one
  2. Set ownership/privileges on the new table to match the old table
  3. Have the old table INHERIT the new table
  4. Swap old and new table names
  5. Move the data you want to keep from the old table to the new one
  6. Drop the old table

As soon as you do step 4, all new data is immediately going to the new table. Also, since the old table is a child of the new table, all your old data is still visible from the original table name. Moving the data from the old table to the new one is nearly transparent to any users of the table, the only issue being there may be some slightly slower queries during that transition period since the planner has to account for 2 tables. You can help mitigate this slightly by placing a constraint on the new table (before the swap when it’s empty) that only allows data in the new table’s window. This allows constraint exclusions to possibly ignore the old table while you get data moved. Placing a constraint on the old table probably wouldn’t help much since it would have to lock it for validation. Yes there will be additional I/O and disk usage from WAL while you move data from the old table to the new, but this can be controlled to a much greater degree since all new data isn’t also being written twice by a trigger. You can move the data in batches with any necessary pauses to control those I/O & WAL spikes.

The one thing this method does not account for is if there are updates to data in the old table that would technically place it in the new one while you’re in the process of moving the recent data you want to keep. That update will not be moved from the old table to do the new one. But in most cases where we’ve done this, those updates weren’t that critical since, if you’d done the original method of just deleting the old data, you would’ve been deleting that data that was getting updated anyway. But it’s something to be aware of if you go querying the old table and still see “new” data after you think you’re done migrating. It’s also something to be aware of if your application thinks it updated something when the old table was there and now it’s suddenly gone. You may need to stick with the trigger method above if that’s the case then.

An example of commands to do this is below. The LIKE clause to the CREATE TABLE statement is quite useful since this can automatically include all indexes, constraints, defaults, comments, and storage options. It does NOT include ownership or privileges however, so the one critical step in this piece is definitely #2 above. You can easily see a table’s privileges with the \dp option in psql. Also, explicitly obtaining the exclusive lock on both tables before doing the name switch ensures nothing weird happens during whatever brief moment could exist between the switch.

CREATE TABLE public.notifications_new (LIKE public.notifications INCLUDING ALL);

ALTER TABLE public.notifications_new OWNER TO sysadmin;

GRANT select ON public.notifications_new TO read_only
GRANT select, insert, update, delete ON public.notifications TO app_user;
GRANT all ON public.notifications TO admin;

ALTER TABLE public.notifications INHERIT public.notifications_new;

LOCK TABLE public.notifications_new IN ACCESS EXCLUSIVE MODE;
ALTER TABLE public.notifications RENAME TO notifications_old;
ALTER TABLE public.notifications_new RENAME TO notifications;

COMMIT;  (or ROLLBACK; if there's a problem)

Once all these steps are done, you can then begin the process of moving your more recent data out of the old table and into the new via whichever method works best for you. One easy method to batch this is a CTE query that does the DELETE/INSERT with a SELECT in a single query to limit the rows moved.

WITH row_batch AS (
    SELECT id FROM public.notifications_old WHERE updated_at >= '2016-10-18 00:00:00'::timestamp LIMIT 20000 ),
delete_rows AS (
    DELETE FROM public.notifications_old o USING row_batch b WHERE = RETURNING, account_id, created_at, updated_at, resource_id, notifier_id, notifier_type)
INSERT INTO public.notifications SELECT * FROM delete_rows;

And once that’s done, you can then DROP the old table, instantly recovering all that disk space with minimal WAL traffic and zero bloat aftermath!

Written by Keith

March 15th, 2017 at 11:32 am

Posted in PostgreSQL

Tagged with , ,

PostgreSQL 10 Built-in Partitioning

with 5 comments

Since I have a passing interest in partitioning in PostgreSQL, I figured I’d check out a recent commit to the development branch of PostgreSQL 10

Implement table partitioning –;a=commitdiff;h=f0e44751d7175fa3394da2c8f85e3ceb3cdbfe63

Table partitioning is like table inheritance and reuses much of the
existing infrastructure, but there are some important differences.
The parent is called a partitioned table and is always empty; it may
not have indexes or non-inherited constraints, since those make no
sense for a relation with no data of its own.  The children are called
partitions and contain all of the actual data.  Each partition has an
implicit partitioning constraint.  Multiple inheritance is not
allowed, and partitioning and inheritance can't be mixed.  Partitions
can't have extra columns and may not allow nulls unless the parent
does.  Tuples inserted into the parent are automatically routed to the
correct partition, so tuple-routing ON INSERT triggers are not needed.
Tuple routing isn't yet supported for partitions which are foreign
tables, and it doesn't handle updates that cross partition boundaries.

Currently, tables can be range-partitioned or list-partitioned.  List
partitioning is limited to a single column, but range partitioning can
involve multiple columns.  A partitioning "column" can be an

Because table partitioning is less general than table inheritance, it
is hoped that it will be easier to reason about properties of
partitions, and therefore that this will serve as a better foundation
for a variety of possible optimizations, including query planner
optimizations.  The tuple routing based which this patch does based on
the implicit partitioning constraints is an example of this, but it
seems likely that many other useful optimizations are also possible.

Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat,
Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova,
Rushabh Lathia, Erik Rijkers, among others.  Minor revisions by me.

After many years of waiting, one of the major features missing from PostgreSQL is finally getting its first major step forward with the inclusion of a built in partitioning option. The syntax and usage is fairly straight forward so let’s jump straight into it with the examples from the documentation (slightly modified)

The basic syntax has two parts, one for initially creating a partitioned set

CREATE TABLE table_name ( ... )
[ PARTITION BY { RANGE | LIST } ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [, ... ] ) ]

And another for adding a child table to that partition set

CREATE TABLE table_name
    PARTITION OF parent_table [ (
  { column_name [ column_constraint [ ... ] ]
    | table_constraint }
    [, ... ]
) ] FOR VALUES partition_bound_spec

and partition_bound_spec is:

{ IN ( expression [, ...] ) |
  FROM ( { expression | UNBOUNDED } [, ...] ) TO ( { expression | UNBOUNDED } [, ...] ) }

A ranged partition set is likely the most common use-case and is what pg_partman provides for time & id partitioning. Each child table is given a specific upper and lower bound of data. Once nice thing is that it can support multiple columns as well.

CREATE TABLE measurement (
    logdate         date not null,
    peaktemp        int default 1,
    unitsales       int

ALTER TABLE measurement ADD CHECK (peaktemp > 0);
CREATE TABLE measurement_y2016m07
    PARTITION OF measurement (
    unitsales DEFAULT 0
) FOR VALUES FROM ('2016-07-01') TO ('2016-08-01');

keith@keith=# \d+ measurement
                                 Table "public.measurement"
  Column   |  Type   | Collation | Nullable | Default | Storage | Stats target | Description 
 logdate   | date    |           | not null |         | plain   |              | 
 peaktemp  | integer |           |          | 1       | plain   |              | 
 unitsales | integer |           |          |         | plain   |              | 
Partition key: RANGE (logdate)
Check constraints:
    "measurement_peaktemp_check" CHECK (peaktemp > 0)
Partitions: measurement_y2016m07 FOR VALUES FROM ('2016-07-01') TO ('2016-08-01')

keith@keith=# \d measurement_y2016m07 
         Table "public.measurement_y2016m07"
  Column   |  Type   | Collation | Nullable | Default 
 logdate   | date    |           | not null | 
 peaktemp  | integer |           |          | 1
 unitsales | integer |           |          | 0
Partition of: measurement FOR VALUES FROM ('2016-07-01') TO ('2016-08-01')
Check constraints:
    "measurement_peaktemp_check" CHECK (peaktemp > 0)

You can see that unlike all partitioning methods up until now, there is no user-visible trigger involved. All the data routing is handled internally which greatly simplifies setting things up and opens up greater opportunities for optimization. Also, constraints and some other properties are automatically inherited from the parent. But since indexes cannot be defined on the parent, they have to be defined per child. Above you can see you can define different defaults per child as well as other properties.

The other partitioning type is LIST and this is for explicitly defining values to go into specific children.

    city_id         bigserial not null,
    name         text not null,
    population   int
) PARTITION BY LIST (initcap(name));

CREATE TABLE cities_west
    PARTITION OF cities (
    CONSTRAINT city_id_nonzero CHECK (city_id != 0)
) FOR VALUES IN ('Los Angeles', 'San Francisco');

keith@keith=# \d+ cities
                                                     Table "public.cities"
   Column   |  Type   | Collation | Nullable |                 Default                 | Storage  | Stats target | Description 
 city_id    | bigint  |           | not null | nextval('cities_city_id_seq'::regclass) | plain    |              | 
 name       | text    |           | not null |                                         | extended |              | 
 population | integer |           |          |                                         | plain    |              | 
Partition key: LIST (initcap(name))
Partitions: cities_west FOR VALUES IN ('Los Angeles', 'San Francisco')

keith@keith=# \d+ cities_west 
                                                  Table "public.cities_west"
   Column   |  Type   | Collation | Nullable |                 Default                 | Storage  | Stats target | Description 
 city_id    | bigint  |           | not null | nextval('cities_city_id_seq'::regclass) | plain    |              | 
 name       | text    |           | not null |                                         | extended |              | 
 population | integer |           |          |                                         | plain    |              | 
Partition of: cities FOR VALUES IN ('Los Angeles', 'San Francisco')
Check constraints:
    "city_id_nonzero" CHECK (city_id <> 0)

Here you can see the default of the sequence is inherited and an example of defining a constraint on the child. So what happens if you try to insert data and a child table is not defined for it?

keith@keith=# insert into cities (name, population) values ('Baltimore', 2003);
2016-12-09 17:36:57.999 EST [15839] ERROR:  no partition of relation "cities" found for row
2016-12-09 17:36:57.999 EST [15839] DETAIL:  Failing row contains (1, Baltimore, 2003).
2016-12-09 17:36:57.999 EST [15839] STATEMENT:  insert into cities (name, population) values ('Baltimore', 2003);
ERROR:  no partition of relation "cities" found for row
DETAIL:  Failing row contains (1, Baltimore, 2003).
Time: 0.641 ms
keith@keith=# insert into cities (name, population) values ('Los Angeles', 2003);
Time: 3.127 ms
keith@keith=# select * from cities;
 city_id |    name     | population 
       2 | Los Angeles |       2003
(1 row)

Time: 0.525 ms

Invalid data is currently rejected. It also seems to have incremented the sequence as well (I re-ran this again to be sure). Currently I’m not sure if there’s a way around the data rejection, and defining a trigger with a exception handler is not a good idea. As I found out the hard way in pg_partman, just the existence of an exception block in a trigger will also increase xid consumption since every row inserted will increment the global xid value. If this table has a high insert rate, you can quickly reach xid wraparound. I made a suggestion to the core devs to define a child table as the default and have any data that doesn’t match a child table go there instead.

Sub-partitioning is also possible and you can easily mix and match both range and list partitioning types. Here we redefine the cities_west table to be partitioned as well

CREATE TABLE cities_west
    PARTITION OF cities (
    CONSTRAINT city_id_nonzero CHECK (city_id != 0)
) FOR VALUES IN ('Los Angeles', 'San Francisco') PARTITION BY RANGE (population);

CREATE TABLE cities_west_10000_to_100000
    PARTITION OF cities_west FOR VALUES FROM (10000) TO (100000);

keith@keith=# \d+ cities
                                                     Table "public.cities"
   Column   |  Type   | Collation | Nullable |                 Default                 | Storage  | Stats target | Description 
 city_id    | bigint  |           | not null | nextval('cities_city_id_seq'::regclass) | plain    |              | 
 name       | text    |           | not null |                                         | extended |              | 
 population | integer |           |          |                                         | plain    |              | 
Partition key: LIST (initcap(name))
Partitions: cities_west FOR VALUES IN ('Los Angeles', 'San Francisco')

keith@keith=# \d+ cities_west
                                                  Table "public.cities_west"
   Column   |  Type   | Collation | Nullable |                 Default                 | Storage  | Stats target | Description 
 city_id    | bigint  |           | not null | nextval('cities_city_id_seq'::regclass) | plain    |              | 
 name       | text    |           | not null |                                         | extended |              | 
 population | integer |           | not null |                                         | plain    |              | 
Partition of: cities FOR VALUES IN ('Los Angeles', 'San Francisco')
Partition key: RANGE (population)
Check constraints:
    "city_id_nonzero" CHECK (city_id <> 0)
Partitions: cities_west_10000_to_100000 FOR VALUES FROM (10000) TO (100000)

keith@keith=# \d+ cities_west_10000_to_100000 
                                          Table "public.cities_west_10000_to_100000"
   Column   |  Type   | Collation | Nullable |                 Default                 | Storage  | Stats target | Description 
 city_id    | bigint  |           | not null | nextval('cities_city_id_seq'::regclass) | plain    |              | 
 name       | text    |           | not null |                                         | extended |              | 
 population | integer |           | not null |                                         | plain    |              | 
Partition of: cities_west FOR VALUES FROM (10000) TO (100000)
Check constraints:
    "city_id_nonzero" CHECK (city_id <> 0)

Global indexes do not yet exist in PostgreSQL, so the issue of primary/unique keys being enforced across a partition set is still an issue. And while you can define foreign keys on individual children to other tables, defining them back to a partitioned set will not work. Also, foreign keys cannot be defined on the parent table with this new feature, so they’ll have to be set per child. Thankfully they just return an error if you try any of these things instead of allowing them and causing confusion the way most current partition setups do.

keith@keith=# CREATE TABLE fk_test (unitsales integer);
Time: 3.818 ms
keith@keith=# CREATE TABLE measurement (
keith(#     logdate         date not null,
keith(#     peaktemp        int default 1,
keith(#     unitsales       int REFERENCES fk_test (unit_sales)
keith(# ) PARTITION BY RANGE (logdate);
2016-12-12 11:34:14.485 EST [5051] ERROR:  foreign key constraints are not supported on partitioned tables at character 122
2016-12-12 11:34:14.485 EST [5051] STATEMENT:  CREATE TABLE measurement (
	    logdate         date not null,
	    peaktemp        int default 1,
	    unitsales       int REFERENCES fk_test (unit_sales)
	) PARTITION BY RANGE (logdate);
ERROR:  foreign key constraints are not supported on partitioned tables
LINE 4:     unitsales       int REFERENCES fk_test (unit_sales)

keith@keith=# CREATE TABLE fk_test (unitsales integer REFERENCES measurement (unitsales));
2016-12-12 11:06:51.741 EST [5051] ERROR:  cannot reference partitioned table "measurement"
2016-12-12 11:06:51.741 EST [5051] STATEMENT:  CREATE TABLE fk_test (unitsales integer REFERENCES measurement (unitsales));
ERROR:  cannot reference partitioned table "measurement"
Time: 1.794 ms

To get rid of a partitioned set, you’d use the CASCADE option just like with an inheritance set

keith@keith=# drop table cities cascade;
NOTICE:  drop cascades to 2 other objects
DETAIL:  drop cascades to table cities_west
drop cascades to table cities_west_10000_to_100000

Currently the TRUNCATE command run on the parent table will truncate the entire partition set, but I did see discussion on the mailing lists of a WHERE clause to only target specific children. That would be a great feature, but even better would be a WHERE clause to the DROP TABLE command to make retention maintenance to drop old tables much easier.

You may have noticed there’s no mention of ongoing creation of new partitions when needed. This is where I see pg_partman still being useful at this time. I’ve only just barely started playing with this feature to see how I can incorporate it. My main method of determining what characteristics were inherited to children (indexes, constraints, defaults, FKs etc) was done by defining them on the parent, but the current limitations on what can be defined on the parent with this new feature make this trickier to handle. Whether I’ll keep my old methods and just add this on as a new feature or do another major code revision will have to wait to be seen. It will all depend if I can find a way to implement all existing features I’ve added to pg_partman since I’d rather not take any steps backwards if I don’t have to. Reading the relevant threads on the mailing lists, it seems they have a lot more features for partitioning to be added in, so I may wait a bit before diving into incorporating this into pg_partman too tightly since I may over complicate things on my part due to a current lack of features. This email from Robert Haas gives me hope this this will be a well maintained core feature –

From the moment I started writing pg_partman, I knew built-in partitioning wasn’t too far around the corner. But we build the tools that we need at that time. I’m looking forward to seeing how this feature develops and I hope I can help out where possible to overcome any limitations encountered. Many thanks to Amit Langote and the review team!

Written by Keith

December 12th, 2016 at 11:48 am

Posted in PostgreSQL

Tagged with , ,

Cleaning Up PostgreSQL Bloat

with 7 comments

As a followup to my previous post on checking for bloat, I figured I’d share some methods for actually cleaning up bloat once you find it. I’ll also be providing some updates on the script I wrote due to issues I encountered and thanks to user feedback from people that have used it already.

First, as these examples will show, the most important thing you need to clean up bloat is extra disk space. This means it is critically important to monitor your disk space usage if bloat turns out to be an issue for you. And if your database is of any reasonably large size, and you regularly do updates & deletes, bloat will be an issue at some point. I’d say a goal is to always try and stay below 75% disk usage either by archiving and/or pruning old data that’s no longer needed. Or simply adding more disk space or migrating to new hardware all together. Having less 25% free can put you in a precarious situation where you may have a whole lot of disk space you can free up, but not enough room to actually do any cleanup at all or without possibly impacting performance in big ways (Ex. You have to drop & recreate a bloated index instead of rebuilding it concurrently, making previously fast queries extremely slow).


The above graph (y-axis terabytes) shows my recent adventures in bloat cleanup after using this new scan, and validates that what is reported by is actually bloat. In both this graph and the one below, there were no data purges going on and each of the significant line changes coincided exactly with a bloat cleanup session. You can see back on May 26-27th a huge drop in size. You can see an initial tiny drop followed by a fairly big increase then the huge drop. This is me first fixing one small, but very bloated index followed by running a pg_repack to take care of both table and a lot of index bloat. This is actually the group_members table I used as the example in my previous post. Over the next week or so I worked through roughly 80 bloated objects to recover about 270GB of disk space. Now, it may turn out that some of these objects will have their bloat return to their previous values quickly again and those could be candidates for exclusion from the regular report. But I figured I’d go through everything wasting more than a few hundred MB just so I can better assess what the actual normal bloat level of this database is. Here’s another example from another client that hadn’t really had any bloat monitoring in place at all before (that I was aware of anyway). It’s showing disk space available instead of total usage, hence the line going the opposite direction, and db12 is a slave of db11.


The easiest, but most intrusive, bloat removal method is to just run a VACUUM FULL on the given table. This will take an exclusive lock on the table (blocks all reads and writes) and completely rebuild the table to new underlying files on disk. This clears out 100% of the bloat in both the table and all indexes it contains at the expense of blocking all access for the duration. If you can afford the outage, it’s the easiest, most reliable method available. For very small tables this is likely your best option.

The next option is to use the REINDEX command. This can be run on several levels: INDEX, TABLE, DATABASE. 9.5 introduced the SCHEMA level as well. Running it on the TABLE level has the same consequence of likely locking the entire table for the duration, so if you’re going that route, you might as well just run a VACUUM FULL. Same for running at the DATABASE level, although if you’re running 9.5+, it did introduce parallel vacuuming to the vacuumdb console command, which would be much more efficient. When running on the INDEX level, things are a little more flexible. All writes are blocked to the table, but if a read-only query does not hit the index that you’re rebuilding, that is not blocked. If you can afford several shorter outages on a given table, or the index is rather small, this is the best route to take for bloat cleanup.

If you’ve got tables that can’t really afford long outages, then things start getting tricky. Before getting into pg_repack, I’d like to share some methods that can be used without third-party tools. Index bloat is the most common occurrence, so I’ll start with that.

If you’ve just got a plain old index (b-tree, gin or gist), there’s a combination of 3 commands that can clear up bloat with minimal downtime (depending on database activity). The CONCURRENTLY flag to the CREATE INDEX command allows an index to be built without blocking any reads or writes to the table. So say we had this bloated index

      objectname                          | objecttype | dead_tuples | free_space | free_percent
 index_guests_on_state                    | index      | 0 bytes     | 9460 MB    |        60.04

No dead tuples (so autovacuum is running efficiently) and 60% of the total index is free space that can be reclaimed. A handy command to get the definition of an index is pg_get_indexdef(regclass). In this case it’s a very easy index definition, but when you start getting into some really complicated functional or partial indexes, having a definition you can copy-n-paste is a lot safer.

# SELECT pg_get_indexdef('index_guests_on_state'::regclass);
 CREATE INDEX index_guests_on_state ON guests USING btree (state)

Now we can write our set of commands to rebuild the index

CREATE INDEX CONCURRENTLY index_guests_on_state_new ON guests USING btree (state);
ANALYZE guests;
DROP INDEX CONCURRENTLY index_guests_on_state;
ANALYZE guests;
ALTER INDEX index_guests_on_state_new RENAME TO index_guests_on_state;

I threw the ANALYZE calls in there just to ensure that the catalogs are up to date for any queries coming in during this rebuild. May not really be necessary, but I was doing this on a very busy table, so I’d rather be paranoid about it. Neither the CREATE nor the DROP command will block any other sessions that happen to come in while this is running. However, that final ALTER INDEX call can block other sessions coming in that try to use the given table. But the rename is optional and can be done at any time later. After the DROP command, your bloat has been cleaned up. While concurrent index creation does not block, there are some caveats with it, the major one being it can take much longer to rebuild the index. One of these for the second client above took 4.5 hours to complete. The documentation on building indexes concurrently goes into more detail on this, and how to deal with it possibly failing.

If you’re running this on a UNIQUE index, you may run into an issue if it was created as a UNIQUE CONSTRAINT vs a UNIQUE INDEX. Functionally, both are the same as far as PostgreSQL is concerned. And under the hood, creating a unique constraint will just create a unique index anyway. The big difference is you will not be able to drop a unique constraint concurrently. You will have to do an ALTER TABLE [..]. DROP CONSTRAINT […] call, which will require an exclusive lock, just like the RENAME above. Also, the index is more flexible since you can make a partial unique index as well. So it’s better to just make a unique index vs a constraint if possible.

PRIMARY KEYs are another special case. Functionally, they’re no different than a unique index with a NOT NULL constraint on the column. But they are marked specially in the catalog and some applications specifically look for them. You can do something very similar to the above, taking advantage of the USING clause to the ADD PRIMARY KEY command.

postgres@production=# CREATE UNIQUE INDEX CONCURRENTLY email_records_pkey_new ON public.email_records USING btree (id);
Time: 2794263.993 ms
postgres@production=# ANALYZE public.email_records;
Time: 1658.590 ms
postgres@production=# ALTER TABLE public.email_records DROP CONSTRAINT email_records_pkey;
Time: 43.774 ms
postgres@production=# ANALYZE public.email_records;
Time: 1518.819 ms
postgres@production=# ALTER INDEX public.email_records_pkey_new RENAME TO email_records_pkey;
Time: 0.586 ms
postgres@production=# ALTER TABLE public.email_records ADD PRIMARY KEY USING INDEX email_records_pkey;
Time: 13.219 ms
postgres@production=# ANALYZE public.email_records;
Time: 1611.376 ms

I gave full command examples here so you can see the runtimes involved. The concurrent index creation took quite a while (about 46 minutes), but everything besides the analyze commands was sub-second. Giving the command to create a primary key an already existing unique index to use allows it to skip the creation and validation usually done with that command.

As always, there are caveats to this. If the primary key, or any unique index for that matter, has any FOREIGN KEY references to it, you will not be able to drop that index without first dropping the foreign key(s). If there’s only 1 or 2 of those, you can likely do this in a transaction surrounding the drop & recreation of the primary key with commands that also drop and recreate the foreign keys. But if you start getting more in there, that’s just taking a longer and longer outage for the foreign key validation which will lock all tables involved. And also increasing the likelyhood of an error in the DDL you’re writing to manage recreating everything. In that case, it may just be better to take the outage to rebuild the primary key with the REINDEX command.

In all cases where I can use the above methods, I always try to use those first. They’re the native methods built into the database and, as long as you don’t typo the DDL commands, not likely to be prone to any issues cropping up later down the road. And since index bloat is primarily where I see the worst problems, it solves most cases (the second graph above was all index bloat). If you’re unable to use any of them, though, the pg_repack tool is very handy for removing table bloat or handling situations with very busy or complicated tables that cannot take extended outages. It’s gotten pretty stable over the last year or so, but just seeing some of the bugs that were encountered with it previously, I use it as a last resort for bloat removal. Also, if you’re running low on disk space, you may not have enough room for pg_repack since it requires rebuilding the entire table and all indexes in secondary tables before it can remove the original bloated table. As I said above, I did use it where you see that initial huge drop in disk space on the first graph, but before that there was a rather large spike to get there. In that case, the table had many, many foreign keys & triggers and was a very busy table, so it was easier to let pg_repack handle it.

For table bloat, Depesz wrote some blog posts a while ago that are still relevant with some interesting methods of moving data around on disk. This can also be handy when you are very low on disk space.

Bloat Removal Without Table Swapping

Bloat Removal By Tuples Moving

Since I initially wrote my blog post, I’ve had some great feedback from people using already. I’ve gotten several bugs fixed as well as adding some new features with version 2.1.0 being the latest available as of this blog post. json is now the preferred, structured output method if you need to see more details outside of querying the stats table in the database. I also added some additional options with –exclude_object_file  that allows for more fine grained filtering when you want to ignore certain objects in the regular report, but not forever in case they get out of hand. I updated the README with some examples of that since it’s a little more complex.

I also made note of the fact that this script isn’t something that’s made for real-time monitoring of bloat status. Since it’s doing full scans on both tables and indexes, this has the potential to force data out of shared buffers. So if you keep running it often, you may affect query performance of things that rely on data being readily available there. It’s best to run it maybe once a month or once a week at most during off-peak hours. If you have particularly troublesome tables you want to keep an eye on more regularly, the –tablename option allows you to scan just that specific table and nothing else. Once you’ve gotten the majority of your bloat issues cleaned up after your first few times running the script and see how bad things may be, bloat shouldn’t get out of hand that quickly that you need to run it that often. If it is, you may want to re-evaluate how you’re using PostgreSQL (Ex. MVCC makes it not great as a queuing system).

If anyone else has some handy tips for bloat cleanup, I’d definitely be interested in hearing them.

Written by Keith

June 8th, 2016 at 5:25 pm

Posted in PostgreSQL

Tagged with , , , ,