Keith's Ramblings…

Archive for the ‘extensions’ tag

Managing Constraint Exclusion in Table Partitioning

with 8 comments

One of the biggest advantages of table partitioning in databases is taking advantage of a feature called constraint exclusion. The PostgreSQL docs explain this best:

With constraint exclusion enabled, the planner will examine the constraints of each partition and try to prove that the partition need not be scanned because it could not contain any rows meeting the query’s WHERE clause. When the planner can prove this, it excludes the partition from the query plan.

Using the following table partitioned on col1, some examples of how this works will follow

Without constraint exclusion, doing a simple SELECT * for a smaller subset of the partition set still does a scan across all tables (constraint exclusion is turned on by default with the setting “partition”).

Even though this is doing INDEX scans, doing this against a partition set with much larger amounts of data would be extremely expensive and cause significant delays. With constraint exclusion on, you can see the plan is greatly simplified and only the tables that need to be queried actually are

However, when you create a partitioned set, usually only the column that has a constraint on it is the column that is controlling the partitioning. As soon as you do a query with a WHERE condition on one of the other columns, you lose all the advantages you thought you gained. In this case its even worse because it causes full sequential scans across the whole set. Indexes could help, but again, not if your partition set is very large.

You could try and add constraints before you insert the data, but often that is either hard or impossible. Another alternative is to go back and create constraints on old partitions that are no longer being edited based on the data they contain. That is the solution I built into pg_partman with the 1.5.0 release.

The second constraint is the new one that was created based on the data currently in the child table. The configuration column constraint_cols is an ARRAY that can contain as many columns for the given table that you’d like to have constraints set for. The above is how you can go back and add additional constraints to an already existing partition set. The python script will apply the constraint to all relevant child partitions. It determines which children to apply the constraints to by the optimize_constraint pg_partman configuration value for that partition set (prior to v2.2.0 and in the old 1.x series, this was based on the premake value). The default is 30, so it will not apply constraints to child tables newer than the current and the previous 30. In this example, I set this value to 4 so it will apply constraints to child tables older than the previous 4 .

So, while it won’t apply constraints to all the old children, it can at least allow constraint exclusion to potentially exclude a huge majority of them. In this case it was able to exclude partitions _p30, _p40 & _p50. As more partitions are added, more constraints would get added to older partitions and at most the tables that would need to be scanned would be the current one, the last 4, the empty future ones and whichever partitions contained the relevant data. On a partition set with thousands of partitions and/or partitions with millions of rows each, that would have significant performance benefits.

The big caveat with this is that it would prevent edits on any of these older tables if the value would change the constraint boundaries. If there comes a time when you must edit this data, some included functions can help. The drop_constraints() and apply_constraints() functions can drop then reapply all constraints managed by pg_partman on a given child table (see docs for their parameters). They’ve been designed to cleanly handle if constraints created by pg_partman already exist or if the columns contain only NULL values. These functions only work on a single child table at a time. If you need to drop/apply to the entire partition set, the python script used  above can make that much easier.

You can set this up during partition creation with a parameter to the create_parent() function. Or it can be added at any time later like I did above. Once the additional constraint columns are configured, the normal partition maintenance procedures that pg_partman runs will take care of creating the constraints for you.

This is a pretty tricky situation to handle and its one of the things that makes partitioning a big challenge to set up correctly. And it’s something most people don’t realize until they’ve already gotten their partitioning set up and only encounter the issue when they start noticing the performance impacts of scanning their now very large partition set. Hopefully this feature will allow more people to take advantage of partitioning when long term storage of data is required within PostgreSQL. If anyone has any feedback or suggestions for making this feature better, I’d greatly appreciate it.

Written by Keith

January 8th, 2014 at 11:21 am

PostgreSQL Extension Code Organization

without comments

I was going to title this “Why is my extension failing to install in 9.3?”, but I figured a more general post about how I’ve been organizing my extension code would be more helpful. But I will also be answering that question along the way since my organization method provided a solution for me.

For extension installation, the code that installs objects within the database must be in a file in the format extension–version.sql. If your extension is primarily C code with just a few functions to provide calling methods, the SQL is usually pretty easy to maintain in a single file. You can organize the pre-compiled source code as needed and the Makefile can then designate the files that get compiled and installed during installation. But if your entire extension is PL/pgSQL (or any of the other scripting languages), it can really start to get hard to manage all that code in a single file, which is what I see most people doing.

Extensions were my first venture into even looking at the contents of a Makefile. I used the template from David Wheeler’s PGXN HowTo to get started. Learning what the bit of code below does gave me a clue to a way to manage things a little easier.

Essentially that tells make to copy the contents of whatever is after the colon into the filename given before the colon.  So my solution was to split all my functions, tables, types, etc out into their own files organized into folders and then have make pull it all together into a single file.

The order of the files listed after the colon is important since make pulls things together in the order given. In the single sql file for my pg_partman extension, types must be created before tables and tables must be created before functions. Your extension may need things done in a different order, so the important thing is to just organize your files in a manner that you can give make a specific order to merge things.

The main extension file (extension–version.sql) doesn’t necessarily have to contain all the installation code for the extension. You can include commands to read in your code stored in other files instead of having make merge it all together. But then whenever you have a new file, you have to remember to include it in the master file and possibly add it to the Makefile as well to ensure it gets copied over. This also puts more files in the PGSHARED folder. More of a pet peeve, but I like keeping that folder less cluttered.

The other thing I do to organize my extension code is to put the updates into a separate folder. For a while, I hadn’t realized that the following line in the PGXN Makefile was for copying all update files (filename format extension–oldversion–newversion.sql) along with the core extension file to the PGSHARED folder

So my extensions had just included instructions to manually copy updates to PGSHARED. But, when the 9.3 beta came out and I started testing things, I (along with many other extension developers) started getting this error during make install

This pgsql-hackers mailing list thread contains a discussion of what is causing this error & possible patches for PGXS to fix it back to the way it used to be. From what I’m understanding, it’s because the wildcard in the DATA value ($(wildcard sql/*–*.sql)) matches the second file’s naming pattern (sql/$(EXTENSION)–$(EXTVERSION).sql) and it’s trying to copy the same file over. A change in 9.3 PGXS is now causing an error to be thrown instead of either just ignoring it or copying it anyway (not sure what it originally did before 9.3).

Because I have my updates in a separate folder than where the core extension file is created, I can have a different DATA value and avoid this problem all together

Unfortunately since many people used the PGXN Makefile template, a lot of third-party extensions are running into this error after they upgrade to 9.3. Hopefully clarifying what is causing the error can help you fix your Makefile so your extensions can be installed on 9.3 without any problems.

Some of my extensions also contain scripts or alternate versions of functions for different purposes. For those I just make an “extras” folder and keep them there. And then for the scripts, I’ve added something like the following line to the Makefile to have them installed along with the extension code.

For more Makefile options see the PostgreSQL documentation on the Extension Building Infrastructure.

I’m not sure this code organization method for my extensions is the best way to do things. But the extension system is still fairly new, so the standard methods for doing things are still in flux. The only real issue I run into now is having to make sure the code in the update files matches the code in the individual object files. I use Meld to compare my update file to the object files to make sure things match up. Not the most elegant way, but it’s worked for quite a few updates of my extensions over the last year or so.

If anyone has any suggestions or sees any issues with this organization method, I’d appreciate feedback.

 

Written by Keith

October 25th, 2013 at 10:43 am

Posted in PostgreSQL

Tagged with , ,

Table Partitioning & Long Names

with 2 comments

While working on a way to reapply indexes to existing partitions in my partition management extension, it dawned on me that PostgreSQL’s default object name length limit of 63 characters could cause some interesting issues for child table names when setting up partitioning. Since most partitioning relies on a suffix on the end of the child table names for uniqueness and to easily identify their contents, the automatic truncation of a long child name could really cause some big problems. So I set about seeing if I could handle this with pg_partman.

After several weeks of rewriting a lot of the core functionality of object naming, I think I’ve handled this as best as I can. This applied not only to the child table names, but also to the trigger function and trigger name. Below is a simplified version of the function that’s included in the extension for general use in any situation.

This takes whatever object name is given, checks the length with or without the provided suffix, and returns an object name with a valid length. If the object name would be too long with the suffix, the original name is truncated and the suffix is added. You can also just provide an object name and no suffix and it will check that as well. If no truncation is necessary, you get your original name back with the schema and/or suffix applied if given. The schema name of an object does not count against this length limit, but a parameter is provided so you can get a schema qualified name returned.

Here is an example when no truncation is needed:

Here’s an example where truncation would be needed

The only issue I’ve seen with this method is that if you have a lot of really long, similarly named tables, you can still run into naming conflicts. Especially with serial based partitioning where the original table name is slowly getting truncated more and more over time. But I think handling it like this is preferred to having the suffix unexpectedly truncated, which could cause conflicts within the same partition set.

It’s edge cases like this that drove me to try and make something to handle partitioning easier in PostgreSQL until it’s built in. I’ll be giving a talk at PG Open next week about the development I’ve been doing and how the new extensions system has made it much easier to contribute to PostgreSQL without knowing a single line of C!

 

Written by Keith

September 9th, 2013 at 12:18 pm

PostgreSQL Partition Manager

with 9 comments

Update June 11, 2015: Version 2.0.0 of pg_partman has been released and changed some of what is discussed below, most significantly there are no longer distinct Static & Dynamic partition trigger functions. Please see my updated post, http://www.keithf4.com/pg-partition-manager-v2-0-0, as well as the CHANGELOG included with the extension.

I’ve finally gotten a lot of the planned features implemented for another extension, so it’s about time I wrote a blog post on it. One of the bigger features missing from PostgreSQL that some other database systems have is table partition management. That’s not to say there isn’t a way to do it, or that it isn’t being worked on, but at the current time it’s not the easiest thing to get going from scratch.

The most common method of partitioning that I’ve seen is either time or serial/id based. There are of course other partitioning methods used, such as alphabetically or via UUID. But those methods are usually a once-and-done setup and don’t require as much on-going maintenance. Time & serial based partitioning require new tables and/or functions to be created indefinitely and it can be quite a challenge to set it up and handle all edge cases consistently.  So that is what my PG Partition Manager extension does. It currently only handles inserts to the parent table, but I’ll be trying to get update support in the future. It automates several processes involved in partitioning:

  • Parent & child table creation
  • Trigger function creation and updating
  • Dropping unneeded partitions
  • Moving existing data into partitions
  • Undoing partitioning and moving data back into the parent.

I went with triggers vs rules for handling inserts on the parent. Another blog post better explains the differences, so please read there if you’re more curious about them. Child tables are always pre-created before they are needed to handle the race condition with triggers and creating partitions on the fly.

Static vs Dynamic

The other major difference of methods for partitioning with triggers is whether you statically list each child table in the trigger function or use a dynamic function that determines which child table to insert to based on the data in the statement. The first, static method leads to better performance since the query plan in the function can be cached. However, the more child tables that are added to the conditional statement, the slower performance becomes, so this method is usually limited to partition sets where only recent data is inserted. The dynamic method uses an EXECUTE statement in the function to designate which child table to use on the fly, so it cannot be cached. But, it is capable of handling inserts to any child table in the set at any time. This extension supports both of these methods.

Time & Serial Partitioning

For time-based partitioning, the following time periods are supported: yearly, quarterly, monthly, weekly, daily, hourly, half-hourly, quarter-hourly. If anyone needs decade, century, or a smaller regular time interval I can probably get that added fairly easily. Just put in a request on github. Outside of that, dynamic or non-regular time periods are bit more challenging, so I don’t know if I’ll ever be doing that. The tables are pre-created using a cronjob to call a maintenance function. The number of child table to pre-create is configurable. That premake number also designates how many child tables the static partitioning method will handle, both forward and backward in time. So if you partition daily with a premake value of 4 (the default) on July 4th, it will handle inserting data from June 30th to July 8th.

Serial/id-based partitioning is done using a fixed integer interval that each child table will contain based on an integer column’s value (typically the primary key). Pre-creation is not managed via cron since it’s not predictable how quickly or slowly a partition will fill up. At the cost of a slight performance hit vs the time-based method, the current partition column value is checked at each insert and when it reaches 50% of maximum, the next partition is made. Partitions are still pre-made in the same manner, though, so the next partition being created is typically several intervals ahead to avoid race conditions of the next child table not existing for an insert.

For both time & serial based partitioning, if for some reason the required child doesn’t exist, it is inserted into the parent. A check_parent() function is included so you can monitor all partition sets managed by the extension for data getting put into the parent table.

To make configuring indexes, constraints, defaults & permissions easier, all of those table properties are inherited from the parent. Just set what you need on the parent, and all future partitions will obtain those properties. Due to performance issues, privileges are not retroactively added to all prior partitions if you change them on an existing partition set. But there is a function included to do that for you when your systems can handle it.

Partitioning Existing Data

Tables with existing data can also be partitioned with minimal impact to a production system. When you create a partition set, a brief, exclusive lock is taken on the given parent table to add the required trigger and child tables. Any new data inserted from then on is properly partitioned. All existing data is left in the new parent table and an included python script can move that data in batches, creating all child tables that are needed.

To give you an example of how well this works, I successfully partitioned a high-traffic, 74 million row table with zero noticeable interruption. It took about a 2-3 second lock on parent table which caused inserts to wait but they all successfully completed once new partitions & triggers were in place. I then partitioned the existing data using a 1 hour interval to avoid large, long insert batches and the 5 second wait option to ease the load on the streaming slave a bit. This averaged 50,000 – 200,000 rows per batch and the slave never fell more than 100 seconds behind and quickly caught up once it was done. Here is the script command that did that:

There is also an included python script that can undo a partition set, moving all data from child tables back into the parent. This script can actually work on any inheritance table setup in postgres, not just ones managed by the extension.

Retention

One of the biggest advantages of partitioning is being able to remove unneeded data without expensive delete commands and/or table bloat. Using the extension’s retention configuration options, the same maintenance function that manages pre-making time-based partitions can also automatically drop partitions you no longer need. By default, tables are just uninherited from the parent and not actually dropped, but you can also have it drop just the indexes from the old child table to recover space, move the table to a different schema or actually drop the table. Moving the table to a new schema also allows the use of another included python script that can use pg_dump to extract the tables in that schema out along with a SHA-512 hash before actually dropping them from the database. This allows you to keep old data around without having it take up space in the database itself.

Monitoring

To provide monitoring of any issues with partition creation and and audit trail, my pg_jobmon extension works with pg_partman as well. It is optional and will automatically be used if it is available. The premake value was set to 4 by default because pg_jobmon’s default is to alert when a function fails to run 3 times in a row. That way it just works by default to raise an alert before the last partition is used up.

Conclusion

I’ve gotten a lot of positive feedback from people when I’ve given Lightning Talks at recent PostgreSQL conferences about this extension. It’s also made common partitioning solutions for clients a lot easier to implement. Not having partitioning built into PostgreSQL still has its shortcomings, one of the biggest being primary/unique keys not being enforced across the entire partition set unless that column happens to be the one the partition constraints are based on. Or lack of constraint exclusion support outside of the partition column. But, along with update support, I’m working on some other options that won’t provide a perfect solution to these issues, but will hopefully make trying to manage them easier.

Written by Keith

July 8th, 2013 at 11:48 am

Mimeo – Repulling Incremental Replication Data

without comments

With other per-table replication methods, if the source and destination become out of sync, you typically have to repull the entire table. One of the nice things about using an incremental based replication method (based on incrementing time or serial number; see previous blog post) is that it can make repulling smaller batches of that data much easier.

One of our clients had been having some issues with their website hits tracking table. Some of the hits had been missed via the normal tracking method and the had to be re-obtained via other means and re-inserted into the hits tracking table on production. This table is also replicated to a data warehouse system for reporting. Since this table uses incremental replication based on time, the old data that was reinserted to the source with the old timestamp values would never make it to the reporting database on its own.

All of mimeo’s refresh functions have a p_repull boolean parameter that can be set to true and have it purge the destination table and repull all the data from the source. But the incremental refresh functions have two additional parameters: p_repull_start & p_repull_end. Right now I’m only supporting time-based incremental replication, so both of these values are timestamps. They let you set a starting and/or ending value for a block of data that you’d like purged on the destination and repulled from the source. If one or the other is left off, it just sets a boundary for the start or end and gets everything before or after the timestamps set. For very large tables (which most inserter/updater tables seem to be from my experience working on this tool) this can be a gigantic time-saver for getting the source and destination tables back in sync. If you do use this, just keep in mind that these start and end times are exclusive (< & >, not <= & >=).

Here is an example of it in use. I also set the p_debug option so I can follow, in real-time, the repull process. This information is also available via pg_jobmon, with the number of rows done kept up to date in the details log table as it runs. The data missing was between April 8th and 11th, so I set the start and end days a few minutes before and after each day just to make sure I got everything.

 

Written by Keith

April 16th, 2013 at 2:34 pm