Keith's Ramblings…

WARNING: If accidentally read, induce vomiting

Checking for PostgreSQL Bloat

with 2 comments

One of the on-going issues with PostgreSQL that every administrator must deal with at some point is table and/or index bloat. The MVCC architecture of PostgreSQL lends itself to requiring some extra monitoring & maintenance, especially in large enterprise systems. I’ll save the description of why bloat happens until further down in this post (for those that are curious) and get to the point of this article first.

So far the best check for bloat I’ve come across is the one contained in the check_postgres suite from Bucardo. And in places where we can get check_postgres set up and used, it works great. But sometimes we just want to check the bloat on a system and also have a little more fine grained control of the output. So recently I pulled out the rather complex bloat check query from check_postgres and adapted it to a command-line python script: pg_bloat_check.py

While I first started working on this script, just running the query by itself on some of our client systems, I also realized that it can be quite an expensive query to run on larger systems. It also brings back both table and index bloat together and other parts of the script split that to make the distinction. To allow this check to provide data more efficiently and separate table & index bloat, my script first creates a view to use. For postgresql 9.3+, this can be created as a materialized view so running the script to get bloat data back is fast and causes no contention in the system.

The other thing I realized is that bloat percentage alone is a poor indicator of actual system health. Very small tables may always have a higher than average bloat, or there may always be 1 or 2 pages considered waste, and in reality that  has next to zero impact on database performance. Constantly “debloating” them is more a waste of time than the space used. So I added in additional filters that also look at the number of pages in objects and the actual wasted space of those objects. This allows the final output of the bloat report to provide a more accurate representation of where there may actually be problems that need to be looked into.

Another option is a filter for individual tables or indexes to be ignored. If you understand why bloat happens, you will come across cases where a table is stuck at a certain bloat point at all times, no matter how many times you VACUUM FULL it or run pg_repack on it (those two things do remove it, but it quickly comes back). This happens with tables that have a specific level of churn with the rows being inserted, updated & deleted. The number of rows being updated/deleted is balanced with the number of rows being inserted/updated as well as the autovacuum schedule to mark space for reuse. Removing the bloat from tables like this can actually cause decreased performance because instead of re-using the space that VACUUM marks as available, Postgres has to again allocate more pages to that object from disk first before the data can be added. So bloat is actually not always a bad thing and the nature of MVCC can lead to improved write performance on some tables.

The “simple” output format is suitable for just using on the command-line or piping out to an email program like mailx to get a status report sent to you.

System tables like those you see at the bottom are extremely small and you can see the space they “waste” is negligible. I left the simple format like this so it just provides a quick glance for monitoring purposes. I also provide another output method in the form of a python dictionary that has some additional details. It’s also in a format that can be easily fed to other python scripts or anything else that needs a predictable, structured format.

You can see this provides some more details on how many pages are actually in the given objects vs how many are considered wasted space. All the system tables only have 1 page of wasted space so the given percentage varies because of how many pages the actually have. You have several options for filtering those last, unnecessary entries. The first would be to filter out the entire pg_catalog schema, but I don’t recommend that because catalog bloat can actually be a problem on databases that have extensive DDL changes over time. The other option is to use –min_pages to exclude any table or index that has less than a certain number of pages. I think the best option here would actually be –min_wasted_pages and set to something like 10.

You can use all these filters in combination as well to also avoid low percentage bloat like that top entry. Be careful doing this, though, since if you have an extremely large table and are trying to figure out where all potentially wasted space is, you could accidentally filter it out (10% is a low percentage but 10% of a 500GB table would be 50GB).

 

Why Bloat Happens

For those of you newer to PostgreSQL administration, and this is the first time you may be hearing about bloat, I figured I’d take the time to explain why this scenario exists and why tools like this are necessary (until they’re hopefully built into the database itself someday). It’s something most don’t understand unless someone first explains it to them or you run into the headaches it causes when it’s not monitored and you learn about it hard way.

MVCC (multi-version concurrency control) is how Postgres has chosen to deal with multiple transactions/sessions hitting the same rows at (nearly) the same time. The documentation, along with wikipedia provide excellent and extensive explanations of how it all works, so I refer you there for all the details. Bloat is a result of one particular part of MVCC, concentrated around the handling of updates and deletes.

Whenever you delete a row, it’s not actually deleted, it is only marked as unavailable to all future transactions taking place after the delete occurs. The same happens with an update: the old version of a row is kept active until all currently running transactions have finished, then it is marked as unavailable. I emphasize the word unavailable because the row still exists on disk, it’s just not visible any longer. The VACUUM process in Postgres then comes along and marks any unavailable rows as space that is now available for future inserts or updates. The auto-vacuum process is configured to run VACUUM automatically after so many writes to a table (follow the link for the configuration options), so it’s not something you typically have to worry about doing manually very often (at least with more modern versions of Postgres).

People often assume that VACUUM is the process that should return the disk space to the file system. It does do this but only in very specific cases. That used space is contained in page files that make up the tables and indexes (called objects from now on) in the Postgres database system. Page files all have the same size and differently sized objects just have as many page files as they need. If VACUUM happens to mark every row in a page file as unavailable AND that page also happens to be the final page for the entire object, THEN the disk space is returned to the file system. If there is a single available row, or the page file is any other but the last one, the disk space is never returned by a normal VACUUM. This is bloat. Hopefully this explanation of what bloat actually is shows you how it can  sometimes be advantageous for certain usage patterns of tables as well, and why I’ve included the option to ignore objects in the report.

If you give the VACUUM command the special flag FULL, then all of that reusable space is returned to the file system. But VACUUM FULL does this by completely rewriting the entire table (and all its indexes) to new pages and takes an exclusive lock on the table the entire time it takes to run (CLUSTER does the same thing, but what that does is outside the scope of this post). For large tables in frequent use, this is problematic.  pg_repack has been the most common tool we’ve used to get around that. It recreates the table in the background, tracking changes to it, and then takes a brief lock to swap the old bloated table with the new one.

Why bloat is actually a problem when it gets out of hand is not just the disk space it uses up. Every time a query is run against a table, the visibility flags on individual rows and index entries is checked to see if is actually available to that transaction. On large tables (or small tables with a lot of bloat) that time spent checking those flags builds up. This is especially noticeable with indexes where you expect an index scan to improve your query performance and it seems to be making no difference or is actually worse than a sequential scan of the whole table. And this is why index bloat is checked independently of table bloat since a table could have little to no bloat, but one or more of its indexes could be badly bloated. Index bloat (as long as it’s not a primary key) is easier to solve because you can either just reindex that one index, or you can concurrently create a new index on the same column and then drop the old one when it’s done.

In all cases when you run VACUUM, it’s a good idea to run ANALYZE as well, either at the same time in one command or as two separate commands. This updates the internal statistics that Postgres uses when creating query plans. The number of live and dead rows in a table/index is a part of how Postgres decides to plan and run your queries. It’s a much smaller part of the plan than other statistics, but every little bit can help.

I hope this explanation of what bloat is, and how this tool can help with your database administration, has been helpful.

Written by Keith

August 25th, 2014 at 3:16 pm

Posted in PostgreSQL

Tagged with , ,

Table Partitioning and Foreign Keys

with 21 comments

Table partitioning & foreign keys don’t get along very well in databases and PostgreSQL’s lack of having it built in shows it very clearly with the workarounds that are necessary to avoid the issues. The latest release of pg_partman deals with the lesser of two shortcomings that must be dealt with, that being where child tables in a partition set do not automatically inherit foreign keys created on the parent table. I’ll be using my other extension pg_jobmon as a reference for example here since it works well to illustrate both the issues and possible solutions. You can see here the job_detail table, which contains the individual steps of a logged job, references the the job_log table for the main job_id values.

With version <= 1.7.0 of pg_partman, turning this table into a partition set illustrates the issue.

You can see it is now a partitioned table, but if you look at any of the children, none of them have the FK back to the main job_log table.

As a side note, notice I set the p_jobmon parameter to false in create_parent(). By default pg_partman uses pg_jobmon when it is installed to log everything it does and provide monitoring that your partitioning is working. Since this would mean pg_jobmon is trying to log the partitioning steps of its own table, it puts it into a permanent lockwait state since it’s trying to write to the table it is partitioning. Turning pg_jobmon off for the initial creation avoids this compatibility issue between these two extensions. It can be turned back on for monitoring of future child table creation by modifying the jobmon column in pg_partman’s part_config table. Creation of partitions ahead of the current one does not interfere since a lock on the parent table is no longer required.

Back to the foreign key issue… Lets undo the partitioning here, upgrade pg_partman, and try again

Now our child table has the parent foreign key! The apply_foreign_keys() plpgsql function and the reapply_foreign_keys.py script that are part of the version 1.7.1 can actually be used on any table inheritance set, not just the ones managed by pg_partman. So some may find it useful elsewhere as well. So, what happens if we now partition the reference table, job_log, as well?

It partitions the table without any errors and you can see all the child table foreign keys referencing the parent. But notice the job_log_p0 child table? It has no references from any of the children. And this is the bigger issue that pg_partman does not solve, and most likely never will…

Foreign key reference checks to the parent table in an inheritance set do not propagate to the children

Since the parent table in an inheritance set is typically either empty, or only contains a fraction of the total data, the table referencing the partition set will either fail on every insert or when it hits a value that is only in a child table. The below SQL statements illustrate the issue

You can clearly see the job_log table has the job_id value “2″, but trying to insert that value into the table that uses it as a reference fails. This is because that value lives in job_log_p0, not job_log and the FK reference check does not propagate to the child tables.

I’m not sure of all of the reasons why PostgreSQL doesn’t allow FK checks to propagate down inheritance trees, but I do know one of the consequences of doing so could be some heavy performance hits for the source table if the inheritance set is very large. Every insert would have to scan down all tables in the inheritance tree. Even with indexes, this could be a very expensive.

There is a way to write a trigger and “fake” the foreign key if this is needed. (Update: As someone commented below, there are still race conditions here that are hard to deal with) I looked into this because I do want to be able to partition the pg_jobmon tables and keep referential integrity. To see how this works, I’m starting with a clean installation of pg_jobmon (no partitions). First the original foreign key on job_detail has to be removed, then a trigger is created in its place.

This MUST be a BEFORE trigger and I gave the trigger name a prefix of “aa_” because PostgreSQL fires triggers off in alphabetical order and I want to ensure it goes first as best I can.  Now we partition job_detail & job_log the same as before.

You can see that triggers are not inherited to child tables, so that is why it must be a BEFORE trigger on the job_detail parent. The insert does not actually happen on the job_detail parent table, so the event must be caught before any insert is actually done. Also, this isn’t quite as flexible as a real foreign key since there are no CASCADE options to handle data being removed on the parent. This also causes much heavier locks than a real foreign key. Lets see what happens if we try the same inserts that failed above

No errors! And what happens if we try and insert invalid data to the child table?

Since the trigger function is doing a normal select on the parent table of the job_log partition set, it is seeing data across all the child partitions. AND, since job_id is the partition column of job_log, the trigger function will actually be able to take advantage of constraint exclusion and will only actually touch the one single partition that value could be in. So this works very well in this case, even if the partition set grows extremely large. Now, if you create a FK trigger like this on any other column that doesn’t have constraints, you will begin to notice performance issues as the reference table grows in size. If your tables contain static, unchanging data, pg_partman has some additional options that can help here as well (see my previous post about constraint exclusion).

The other issue with this is exclusive to pg_jobmon being an extension. The lack of a foreign key and presence of a trigger is different than the default extension code. There is the potential that a future extension update could either remove the trigger or replace the foreign key. There’s currently no way to give extension installation options for different code branches that I’m aware of and keep things consistent. In the case of pg_jobmon, the extension is mostly feature complete and I don’t foresee any updates breaking the above fix. But it is something to be aware of if you have to change the default code in any extension.

This is a complicated issue and one that many people don’t realize when trying to plan out table partitioning for more complex schemas. Hopefully I’ve helped clarify things and shown why partitioning is such a tricky issue to get right.

Written by Keith

May 28th, 2014 at 10:27 am

Partitioning With Custom Time Intervals

with 2 comments

This has been something I’ve had in the works since I started on this extension, but it took a while to settle on the best method to use. The pre-determined time ranges for PG Partition Manager were done because it was easy and, as it turns out, provide a very noticeable performance improvement when used with the static partitioning method. Note that custom time partitioning, while the most flexible, is the least performant of the 3 methods now supported with this extension.

The biggest roadblock to implementing this was trying to find an answer to the following question:

If an interval can be any value, how do you determine the partition that the currently inserted timestamp goes into?

Seems a simple enough question, but when you try and implement it, you realize the impact solving it can have on the potential performance of every insert to the table. My solution in the end (which I was trying to avoid) was to create a lookup table. Thankfully the new range type added in PostgreSQL 9.2 made this a whole lot easier than it could’ve been, but it does mean this feature is only supported on the latest versions. If anyone has an efficient solution as an algorithm that can be implemented solely in the partitioning function, I’d be extremely grateful for a contribution.

Until then, pg_partman 1.6.0 hopefully now supports any interval value from 1 second up to the minimum and maximum values for the timestamp type that PostgreSQL supports. Less than that is potentially possible, but even a partitioning interval of 1 second seemed ridiculous enough to me, so I doubt I’m going to pursue it. And I say hopefully because it’s impossible for me to make unit tests for every possible interval there is. I test for intervals less than 1 minute (due to the new child table naming pattern) and greater than 100 years (due to the seemingly non-standard century partition boundaries (2001, 2101, 2201, etc)) so hopefully that will be enough for now. Would greatly appreciate feedback.

This version also bring the ability to manually designate your starting partition at creation. Be aware that if you’re using static partitioning, the trigger function on the parent may not cover all the partitions this could create. The partitioning functions (partition_data_time() & partition_data_id()) will create these tables automatically when you migrate data out of the parent. But if you’re starting from scratch, I thought this might make things easier so if you’re loading data externally, you can have the tables already set to have data loaded directly into them. Below is an example of using both the new “time-custom” type and setting a custom start partition (by default it would’ve only created 4 partitions behind CURRENT_DATE)

I’ve gotten a lot of positive responses (and bug fixes!) from the community so far and am glad that so many people are finding this a useful tool. It’s nice to see that the effort has been worth it.

Written by Keith

February 13th, 2014 at 11:34 am

Posted in PostgreSQL

Tagged with , ,

Version Specific PostgreSQL Extension Features

with 2 comments

When I say version specific here, I’m speaking of the PostgreSQL version (9.1, 9,2, etc), not the extension version. An upcoming update to PG Partition Manager takes advantage of the range data type, which is a 9.2 feature. But I also wanted to keep backward compatibility with 9.1 by excluding this new feature.

The more important thing is to get the extension installed in the first place, so the Makefile was where I started.

Keep in mind, I know very little about Makefiles in the first place, so my solution is using just what I already knew how to do in the original Makefile. There may be a better way, and if there is, I’d love feedback. I just expanded on the already existing lines that check for a minimal version of 9.1 to even allow the extension to install. And I organized my code folders to have one specifically for 9.2+ features (sql/92/tables/). After that it’s just another check to see if the PG92 variable is true, and if so, grab the code from that additional folder.

Now the next challenge is checking the version when installing an update to an already existing installation. First, we need a means of checking if the current version is new enough. I made up the below function for inclusion in my extension since I’ll actually be doing version checking elsewhere as well.

Now we can just take advantage of the DO block syntax to allow script code in a plain SQL file.

The full implementation of this will be available in the 1.6.0 update for pg_partman that I’ll be releasing in the near future. It contains a feature I’ve wanted to get done for a while now: custom time intervals!

Written by Keith

January 22nd, 2014 at 10:44 am

Posted in PostgreSQL

Tagged with , , ,

Managing Constraint Exclusion in Table Partitioning

with 7 comments

One of the biggest advantages of table partitioning in databases is taking advantage of a feature called constraint exclusion. The PostgreSQL docs explain this best:

With constraint exclusion enabled, the planner will examine the constraints of each partition and try to prove that the partition need not be scanned because it could not contain any rows meeting the query’s WHERE clause. When the planner can prove this, it excludes the partition from the query plan.

Using the following table partitioned on col1, some examples of how this works will follow

Without constraint exclusion, doing a simple SELECT * for a smaller subset of the partition set still does a scan across all tables (constraint exclusion is turned on by default with the setting “partition”).

Even though this is doing INDEX scans, doing this against a partition set with much larger amounts of data would be extremely expensive and cause significant delays. With constraint exclusion on, you can see the plan is greatly simplified and only the tables that need to be queried actually are

However, when you create a partitioned set, usually only the column that has a constraint on it is the column that is controlling the partitioning. As soon as you do a query with a WHERE condition on one of the other columns, you lose all the advantages you thought you gained. In this case its even worse because it causes full sequential scans across the whole set. Indexes could help, but again, not if your partition set is very large.

You could try and add constraints before you insert the data, but often that is either hard or impossible. Another alternative is to go back and create constraints on old partitions that are no longer being edited based on the data they contain. That is the solution I built into pg_partman with the 1.5.0 release.

The second constraint is the new one that was created based on the data currently in the child table. The configuration column constraint_cols is an ARRAY that can contain as many columns for the given table that you’d like to have constraints set for. The above is how you can go back and add additional constraints to an already existing partition set. The python script will apply the constraint to all relevant child partitions. It determines which children to apply the constraints to by the premake pg_partman configuration value for that partition set (default is 4, so it will not apply constraints to child tables newer than the current and the previous 4. See the pg_partman docs for more details on what the premake value is used for).

So, while it won’t apply constraints to all the old children, it can at least allow constraint exclusion to potentially exclude a huge majority of them. In this case it was able to exclude partitions _p30, _p40 & _p50. As more partitions are added, more constraints would get added to older partitions and at most the tables that would need to be scanned would be the current one, the last 4, the empty future ones and whichever partitions contained the relevant data. On a partition set with thousands of partitions and/or partitions with millions of rows each, that would have significant performance benefits.

The big caveat with this is that it would prevent edits on any of these older tables if the value would change the constraint boundaries. If there comes a time when you must edit this data, some included functions can help. The drop_constraints() and apply_constraints() functions can drop then reapply all constraints managed by pg_partman on a given child table (see docs for their parameters). They’ve been designed to cleanly handle if constraints created by pg_partman already exist or if the columns contain only NULL values. These functions only work on a single child table at a time. If you need to drop/apply to the entire partition set, the python script used  above can make that much easier.

You can set this up during partition creation with a parameter to the create_parent() function. Or it can be added at any time later like I did above. Once the additional constraint columns are configured, the normal partition maintenance procedures that pg_partman runs will take care of creating the constraints for you.

This is a pretty tricky situation to handle and its one of the things that makes partitioning a big challenge to set up correctly. And it’s something most people don’t realize until they’ve already gotten their partitioning set up and only encounter the issue when they start noticing the performance impacts of scanning their now very large partition set. Hopefully this feature will allow more people to take advantage of partitioning when long term storage of data is required within PostgreSQL. If anyone has any feedback or suggestions for making this feature better, I’d greatly appreciate it.

Written by Keith

January 8th, 2014 at 11:21 am