PostgreSQL Partition Manager

2013-07-08 1419 words 7 minutes

Contents

Update June 11, 2015: Version 2.0.0 of pg_partman has been released and changed some of what is discussed below, most significantly there are no longer distinct Static & Dynamic partition trigger functions. Please see my updated post, http://www.keithf4.com/pg-partition-manager-v2-0-0, as well as the CHANGELOG included with the extension.

I’ve finally gotten a lot of the planned features implemented for another extension, so it’s about time I wrote a blog post on it. One of the bigger features missing from PostgreSQL that some other database systems have is table partition management. That’s not to say there isn’t a way to do it, or that it isn’t being worked on, but at the current time it’s not the easiest thing to get going from scratch.

The most common method of partitioning that I’ve seen is either time or serial/id based. There are of course other partitioning methods used, such as alphabetically or via UUID. But those methods are usually a once-and-done setup and don’t require as much on-going maintenance. Time & serial based partitioning require new tables and/or functions to be created indefinitely and it can be quite a challenge to set it up and handle all edge cases consistently. So that is what my PG Partition Manager extension does. It currently only handles inserts to the parent table, but I’ll be trying to get update support in the future. It automates several processes involved in partitioning:

Parent & child table creation
Trigger function creation and updating
Dropping unneeded partitions
Moving existing data into partitions
Undoing partitioning and moving data back into the parent.

I went with triggers vs rules for handling inserts on the parent. Another blog post better explains the differences, so please read there if you’re more curious about them. Child tables are always pre-created before they are needed to handle the race condition with triggers and creating partitions on the fly.

Static vs Dynamic

The other major difference of methods for partitioning with triggers is whether you statically list each child table in the trigger function or use a dynamic function that determines which child table to insert to based on the data in the statement. The first, static method leads to better performance since the query plan in the function can be cached. However, the more child tables that are added to the conditional statement, the slower performance becomes, so this method is usually limited to partition sets where only recent data is inserted. The dynamic method uses an EXECUTE statement in the function to designate which child table to use on the fly, so it cannot be cached. But, it is capable of handling inserts to any child table in the set at any time. This extension supports both of these methods.

Time & Serial Partitioning

For time-based partitioning, the following time periods are supported: yearly, quarterly, monthly, weekly, daily, hourly, half-hourly, quarter-hourly. If anyone needs decade, century, or a smaller regular time interval I can probably get that added fairly easily. Just put in a request on github. Outside of that, dynamic or non-regular time periods are bit more challenging, so I don’t know if I’ll ever be doing that. The tables are pre-created using a cronjob to call a maintenance function. The number of child table to pre-create is configurable. That premake number also designates how many child tables the static partitioning method will handle, both forward and backward in time. So if you partition daily with a premake value of 4 (the default) on July 4th, it will handle inserting data from June 30th to July 8th.

Serial/id-based partitioning is done using a fixed integer interval that each child table will contain based on an integer column’s value (typically the primary key). Pre-creation is not managed via cron since it’s not predictable how quickly or slowly a partition will fill up. At the cost of a slight performance hit vs the time-based method, the current partition column value is checked at each insert and when it reaches 50% of maximum, the next partition is made. Partitions are still pre-made in the same manner, though, so the next partition being created is typically several intervals ahead to avoid race conditions of the next child table not existing for an insert.

For both time & serial based partitioning, if for some reason the required child doesn’t exist, it is inserted into the parent. A check_parent() function is included so you can monitor all partition sets managed by the extension for data getting put into the parent table.

To make configuring indexes, constraints, defaults & permissions easier, all of those table properties are inherited from the parent. Just set what you need on the parent, and all future partitions will obtain those properties. Due to performance issues, privileges are not retroactively added to all prior partitions if you change them on an existing partition set. But there is a function included to do that for you when your systems can handle it.

Partitioning Existing Data

Tables with existing data can also be partitioned with minimal impact to a production system. When you create a partition set, a brief, exclusive lock is taken on the given parent table to add the required trigger and child tables. Any new data inserted from then on is properly partitioned. All existing data is left in the new parent table and an included python script can move that data in batches, creating all child tables that are needed.

To give you an example of how well this works, I successfully partitioned a high-traffic, 74 million row table with zero noticeable interruption. It took about a 2-3 second lock on parent table which caused inserts to wait but they all successfully completed once new partitions & triggers were in place. I then partitioned the existing data using a 1 hour interval to avoid large, long insert batches and the 5 second wait option to ease the load on the streaming slave a bit. This averaged 50,000 – 200,000 rows per batch and the slave never fell more than 100 seconds behind and quickly caught up once it was done. Here is the script command that did that:

python partition_data.py -p "tracking.hits" -t time -c "host=localhost dbname=mydb" -i "1 hour" -w 5

There is also an included python script that can undo a partition set, moving all data from child tables back into the parent. This script can actually work on any inheritance table setup in postgres, not just ones managed by the extension.

Retention

One of the biggest advantages of partitioning is being able to remove unneeded data without expensive delete commands and/or table bloat. Using the extension’s retention configuration options, the same maintenance function that manages pre-making time-based partitions can also automatically drop partitions you no longer need. By default, tables are just uninherited from the parent and not actually dropped, but you can also have it drop just the indexes from the old child table to recover space, move the table to a different schema or actually drop the table. Moving the table to a new schema also allows the use of another included python script that can use pg_dump to extract the tables in that schema out along with a SHA-512 hash before actually dropping them from the database. This allows you to keep old data around without having it take up space in the database itself.

Monitoring

To provide monitoring of any issues with partition creation and and audit trail, my pg_jobmon extension works with pg_partman as well. It is optional and will automatically be used if it is available. The premake value was set to 4 by default because pg_jobmon’s default is to alert when a function fails to run 3 times in a row. That way it just works by default to raise an alert before the last partition is used up.

Conclusion

I’ve gotten a lot of positive feedback from people when I’ve given Lightning Talks at recent PostgreSQL conferences about this extension. It’s also made common partitioning solutions for clients a lot easier to implement. Not having partitioning built into PostgreSQL still has its shortcomings, one of the biggest being primary/unique keys not being enforced across the entire partition set unless that column happens to be the one the partition constraints are based on. Or lack of constraint exclusion support outside of the partition column. But, along with update support, I’m working on some other options that won’t provide a perfect solution to these issues, but will hopefully make trying to manage them easier.