Keith's Ramblings…

Archive for the ‘partitioning’ tag

Managing Constraint Exclusion in Table Partitioning

with 8 comments

One of the biggest advantages of table partitioning in databases is taking advantage of a feature called constraint exclusion. The PostgreSQL docs explain this best:

With constraint exclusion enabled, the planner will examine the constraints of each partition and try to prove that the partition need not be scanned because it could not contain any rows meeting the query’s WHERE clause. When the planner can prove this, it excludes the partition from the query plan.

Using the following table partitioned on col1, some examples of how this works will follow

Without constraint exclusion, doing a simple SELECT * for a smaller subset of the partition set still does a scan across all tables (constraint exclusion is turned on by default with the setting “partition”).

Even though this is doing INDEX scans, doing this against a partition set with much larger amounts of data would be extremely expensive and cause significant delays. With constraint exclusion on, you can see the plan is greatly simplified and only the tables that need to be queried actually are

However, when you create a partitioned set, usually only the column that has a constraint on it is the column that is controlling the partitioning. As soon as you do a query with a WHERE condition on one of the other columns, you lose all the advantages you thought you gained. In this case its even worse because it causes full sequential scans across the whole set. Indexes could help, but again, not if your partition set is very large.

You could try and add constraints before you insert the data, but often that is either hard or impossible. Another alternative is to go back and create constraints on old partitions that are no longer being edited based on the data they contain. That is the solution I built into pg_partman with the 1.5.0 release.

The second constraint is the new one that was created based on the data currently in the child table. The configuration column constraint_cols is an ARRAY that can contain as many columns for the given table that you’d like to have constraints set for. The above is how you can go back and add additional constraints to an already existing partition set. The python script will apply the constraint to all relevant child partitions. It determines which children to apply the constraints to by the optimize_constraint pg_partman configuration value for that partition set (prior to v2.2.0 and in the old 1.x series, this was based on the premake value). The default is 30, so it will not apply constraints to child tables newer than the current and the previous 30. In this example, I set this value to 4 so it will apply constraints to child tables older than the previous 4 .

So, while it won’t apply constraints to all the old children, it can at least allow constraint exclusion to potentially exclude a huge majority of them. In this case it was able to exclude partitions _p30, _p40 & _p50. As more partitions are added, more constraints would get added to older partitions and at most the tables that would need to be scanned would be the current one, the last 4, the empty future ones and whichever partitions contained the relevant data. On a partition set with thousands of partitions and/or partitions with millions of rows each, that would have significant performance benefits.

The big caveat with this is that it would prevent edits on any of these older tables if the value would change the constraint boundaries. If there comes a time when you must edit this data, some included functions can help. The drop_constraints() and apply_constraints() functions can drop then reapply all constraints managed by pg_partman on a given child table (see docs for their parameters). They’ve been designed to cleanly handle if constraints created by pg_partman already exist or if the columns contain only NULL values. These functions only work on a single child table at a time. If you need to drop/apply to the entire partition set, the python script used  above can make that much easier.

You can set this up during partition creation with a parameter to the create_parent() function. Or it can be added at any time later like I did above. Once the additional constraint columns are configured, the normal partition maintenance procedures that pg_partman runs will take care of creating the constraints for you.

This is a pretty tricky situation to handle and its one of the things that makes partitioning a big challenge to set up correctly. And it’s something most people don’t realize until they’ve already gotten their partitioning set up and only encounter the issue when they start noticing the performance impacts of scanning their now very large partition set. Hopefully this feature will allow more people to take advantage of partitioning when long term storage of data is required within PostgreSQL. If anyone has any feedback or suggestions for making this feature better, I’d greatly appreciate it.

Written by Keith

January 8th, 2014 at 11:21 am

Table Partitioning & Long Names

with 2 comments

While working on a way to reapply indexes to existing partitions in my partition management extension, it dawned on me that PostgreSQL’s default object name length limit of 63 characters could cause some interesting issues for child table names when setting up partitioning. Since most partitioning relies on a suffix on the end of the child table names for uniqueness and to easily identify their contents, the automatic truncation of a long child name could really cause some big problems. So I set about seeing if I could handle this with pg_partman.

After several weeks of rewriting a lot of the core functionality of object naming, I think I’ve handled this as best as I can. This applied not only to the child table names, but also to the trigger function and trigger name. Below is a simplified version of the function that’s included in the extension for general use in any situation.

This takes whatever object name is given, checks the length with or without the provided suffix, and returns an object name with a valid length. If the object name would be too long with the suffix, the original name is truncated and the suffix is added. You can also just provide an object name and no suffix and it will check that as well. If no truncation is necessary, you get your original name back with the schema and/or suffix applied if given. The schema name of an object does not count against this length limit, but a parameter is provided so you can get a schema qualified name returned.

Here is an example when no truncation is needed:

Here’s an example where truncation would be needed

The only issue I’ve seen with this method is that if you have a lot of really long, similarly named tables, you can still run into naming conflicts. Especially with serial based partitioning where the original table name is slowly getting truncated more and more over time. But I think handling it like this is preferred to having the suffix unexpectedly truncated, which could cause conflicts within the same partition set.

It’s edge cases like this that drove me to try and make something to handle partitioning easier in PostgreSQL until it’s built in. I’ll be giving a talk at PG Open next week about the development I’ve been doing and how the new extensions system has made it much easier to contribute to PostgreSQL without knowing a single line of C!


Written by Keith

September 9th, 2013 at 12:18 pm

PostgreSQL Partition Manager

with 9 comments

Update June 11, 2015: Version 2.0.0 of pg_partman has been released and changed some of what is discussed below, most significantly there are no longer distinct Static & Dynamic partition trigger functions. Please see my updated post,, as well as the CHANGELOG included with the extension.

I’ve finally gotten a lot of the planned features implemented for another extension, so it’s about time I wrote a blog post on it. One of the bigger features missing from PostgreSQL that some other database systems have is table partition management. That’s not to say there isn’t a way to do it, or that it isn’t being worked on, but at the current time it’s not the easiest thing to get going from scratch.

The most common method of partitioning that I’ve seen is either time or serial/id based. There are of course other partitioning methods used, such as alphabetically or via UUID. But those methods are usually a once-and-done setup and don’t require as much on-going maintenance. Time & serial based partitioning require new tables and/or functions to be created indefinitely and it can be quite a challenge to set it up and handle all edge cases consistently.  So that is what my PG Partition Manager extension does. It currently only handles inserts to the parent table, but I’ll be trying to get update support in the future. It automates several processes involved in partitioning:

  • Parent & child table creation
  • Trigger function creation and updating
  • Dropping unneeded partitions
  • Moving existing data into partitions
  • Undoing partitioning and moving data back into the parent.

I went with triggers vs rules for handling inserts on the parent. Another blog post better explains the differences, so please read there if you’re more curious about them. Child tables are always pre-created before they are needed to handle the race condition with triggers and creating partitions on the fly.

Static vs Dynamic

The other major difference of methods for partitioning with triggers is whether you statically list each child table in the trigger function or use a dynamic function that determines which child table to insert to based on the data in the statement. The first, static method leads to better performance since the query plan in the function can be cached. However, the more child tables that are added to the conditional statement, the slower performance becomes, so this method is usually limited to partition sets where only recent data is inserted. The dynamic method uses an EXECUTE statement in the function to designate which child table to use on the fly, so it cannot be cached. But, it is capable of handling inserts to any child table in the set at any time. This extension supports both of these methods.

Time & Serial Partitioning

For time-based partitioning, the following time periods are supported: yearly, quarterly, monthly, weekly, daily, hourly, half-hourly, quarter-hourly. If anyone needs decade, century, or a smaller regular time interval I can probably get that added fairly easily. Just put in a request on github. Outside of that, dynamic or non-regular time periods are bit more challenging, so I don’t know if I’ll ever be doing that. The tables are pre-created using a cronjob to call a maintenance function. The number of child table to pre-create is configurable. That premake number also designates how many child tables the static partitioning method will handle, both forward and backward in time. So if you partition daily with a premake value of 4 (the default) on July 4th, it will handle inserting data from June 30th to July 8th.

Serial/id-based partitioning is done using a fixed integer interval that each child table will contain based on an integer column’s value (typically the primary key). Pre-creation is not managed via cron since it’s not predictable how quickly or slowly a partition will fill up. At the cost of a slight performance hit vs the time-based method, the current partition column value is checked at each insert and when it reaches 50% of maximum, the next partition is made. Partitions are still pre-made in the same manner, though, so the next partition being created is typically several intervals ahead to avoid race conditions of the next child table not existing for an insert.

For both time & serial based partitioning, if for some reason the required child doesn’t exist, it is inserted into the parent. A check_parent() function is included so you can monitor all partition sets managed by the extension for data getting put into the parent table.

To make configuring indexes, constraints, defaults & permissions easier, all of those table properties are inherited from the parent. Just set what you need on the parent, and all future partitions will obtain those properties. Due to performance issues, privileges are not retroactively added to all prior partitions if you change them on an existing partition set. But there is a function included to do that for you when your systems can handle it.

Partitioning Existing Data

Tables with existing data can also be partitioned with minimal impact to a production system. When you create a partition set, a brief, exclusive lock is taken on the given parent table to add the required trigger and child tables. Any new data inserted from then on is properly partitioned. All existing data is left in the new parent table and an included python script can move that data in batches, creating all child tables that are needed.

To give you an example of how well this works, I successfully partitioned a high-traffic, 74 million row table with zero noticeable interruption. It took about a 2-3 second lock on parent table which caused inserts to wait but they all successfully completed once new partitions & triggers were in place. I then partitioned the existing data using a 1 hour interval to avoid large, long insert batches and the 5 second wait option to ease the load on the streaming slave a bit. This averaged 50,000 – 200,000 rows per batch and the slave never fell more than 100 seconds behind and quickly caught up once it was done. Here is the script command that did that:

There is also an included python script that can undo a partition set, moving all data from child tables back into the parent. This script can actually work on any inheritance table setup in postgres, not just ones managed by the extension.


One of the biggest advantages of partitioning is being able to remove unneeded data without expensive delete commands and/or table bloat. Using the extension’s retention configuration options, the same maintenance function that manages pre-making time-based partitions can also automatically drop partitions you no longer need. By default, tables are just uninherited from the parent and not actually dropped, but you can also have it drop just the indexes from the old child table to recover space, move the table to a different schema or actually drop the table. Moving the table to a new schema also allows the use of another included python script that can use pg_dump to extract the tables in that schema out along with a SHA-512 hash before actually dropping them from the database. This allows you to keep old data around without having it take up space in the database itself.


To provide monitoring of any issues with partition creation and and audit trail, my pg_jobmon extension works with pg_partman as well. It is optional and will automatically be used if it is available. The premake value was set to 4 by default because pg_jobmon’s default is to alert when a function fails to run 3 times in a row. That way it just works by default to raise an alert before the last partition is used up.


I’ve gotten a lot of positive feedback from people when I’ve given Lightning Talks at recent PostgreSQL conferences about this extension. It’s also made common partitioning solutions for clients a lot easier to implement. Not having partitioning built into PostgreSQL still has its shortcomings, one of the biggest being primary/unique keys not being enforced across the entire partition set unless that column happens to be the one the partition constraints are based on. Or lack of constraint exclusion support outside of the partition column. But, along with update support, I’m working on some other options that won’t provide a perfect solution to these issues, but will hopefully make trying to manage them easier.

Written by Keith

July 8th, 2013 at 11:48 am