Keith's Ramblings…

"Fixing a complex system sounds pretty easy when you're an idiot."

PG Partition Manager v2.0.0 – Background Worker, Better Triggers & Extension Versioning Woes

with 11 comments

PG Partition Manager has been the most popular project I’ve ever done a significant amount of work on and I really appreciate everyone’s feedback for the roughly 2 years it’s been out there. I’ve got plenty more ideas for development and features and look forward to being able to move forward on them with this new major version released.

PostgreSQL 9.3 introduced the ability for user created, programmable background workers (BGW). 9.4 then introduced the ability to dynamically start & stop these with an already running cluster. The first thing that popped into my mind when I heard about this was hopefully having some sort of built-in scheduling system. There still hasn’t been a generalized version of anything like this, so in the mean time I studied the worker_spi contrib module. This is a very simple BGW example with a basic scheduler that runs a process with a configurable interval. This is basically all pg_partman needs for partition maintenance, and what required an external scheduler like cron before.

The above are the new postgresql.conf options that pg_partman v2.0.0 can use to control the background worker it now comes with. These options can be changed at any time with a simple reload, but to start the BGW requires a restart of the database. This is my first venture into writing C, so it’s still very simplistic, and essentially does nothing different than an external scheduler would do. But now you no longer need that external scheduler if you don’t require calling run_maintenance() on any individual partition sets!

The BGW is completely optional and the Makefile has an option to not compile it if you’d just like to stick with the plain, plpgsql version of the extension.

Another major change with v2 is the contents of the partitioning trigger function. If you look back at my first post about pg_partman, you’ll see me mention there are two possible methods to writing the trigger: static & dynamic. You can read that post for more details. All you need to know now is that there are no longer distinct static & dynamic partitioning types. The different trigger methods have been combined into one trigger (thanks to a suggestion from Andrew Dunstan). An example below from the daily partitioning unit test:

You can see that the trigger function starts out like the static method, explicitly giving the SQL statements for inserting into each child table that it handles. But then the final condition falls back to using the dynamic method to first see if the relevant child table exists and then inserting it there (otherwise it goes to the parent). So this hybrid trigger now allows the performance benefit of the static method and the flexibility of the dynamic method, only losing performance when you try and insert data outside of the premake range. The time-custom type still exists and still uses the lookup-table method from v1, so nothing has changed there. This new hybrid function is only available for the standard time intervals and serial partitioning.

For those updating from the v1.x.x version of pg_partman, this new trigger function will be put in place automatically the next time a new child table is created for a partition set. If you’d like to convert all your trigger functions immediately, see the v2.0.0 extension update file for some simple plpgsql that can do that for you. I couldn’t make updating the trigger functions part of the extension update itself because doing so would then make all the trigger functions part of the extension (that was a fun accident to run into during testing).

There have been some other fixes to the extension as well, but these are the biggest and most relevant to most users. Please see the CHANGELOG for more info.

I’ve also taken this time to make pg_partman only compatible with PostgreSQL 9.4+. It uses the dynamic BGW feature, so I couldn’t go with 9.3. And while you can compile the extension without the BGW feature, I did include some new code that is only compatibly with 9.2+. I know this will inconvenience some users, but with the way extension versions are managed within the database, I couldn’t see any easy way to maintain both a v1 & v2 branch. For every new v1 release I’d make, I’d also have to maintain a separate update file to get that version to v2 as well as having to go back and update every previous v1 -> v2 update to include any relevant changes for people sticking with v1 that want to change to v2 in the future. I just don’t have that kind of time for extension management and I could see it very easily leading to a mess if I didn’t find a way to keep all that versioning straight. I will still fix bugs in v1, but all new feature development will be on v2 only. Drawing the line at 9.4 also gives me the freedom to use the latest and greatest features in Postgres and not have to compromise for backward compatibility, at least for the near future.

Upgrading to 9.4 is definitely worth the time to do if you’d like to keep using this extension easily and get the new features it will have in the future. If you’re 2 or 3 major versions behind (or more) and are curious why upgrading is worth the time, I’ll be giving an updated version of my Don’t Forget the Elephant talk at this year’s PGOpen in Dallas, Tx.

Written by Keith

June 11th, 2015 at 3:30 pm

PG Partman – Sub-partitioning

without comments

After my talk at PGCon 2014 where I discussed pg_partman, someone I met at the bar track said they’d use it in a heartbeat if it supported sub-partitioning. Discussing this with others and reading online, I found that there is quite a demand for this feature and the partitioning methods in MySQL & Oracle both support this as well. So I set out to see if I could incorporate it. I thought I’d had it figured out pretty easily and started writing this blog post a while ago (last October) to precede the release of version 1.8.0. Then I started working on the examples here and realized this is a trickier problem to manage than I anticipated. The tricky part being managing the context relationship between the top level parent and their child sub-partitions in a general manner that would work for all partitioning types pg_partman supports. When I first started working on the feature, I’d get things like this:

Obviously, having 2014 child sub-partitions in the 2013 parent partition set doesn’t make sense. I believe I’ve gotten this figured out and handled now in version 1.8.0 and fixed several issues encountered since then in 1.8.1 and 1.8.2 (thanks to the users that reported them!). Also if the parent is serial and the child is time (or vice versa), there’s no contextual relationship and all the child tables will be created in every partition set.

When I first saw that all sub-partitioning did was move the data further down into the inheritance tree, I at first wondered at the expected gains from this, outside of just organizing the data better. The use case of the person I mentioned in the first sentence gave a bit of a hint to the gains. If I remember correctly, they had an extremely large amount of time-series data that needed to be queried as efficiently as possible. One of the advantages of partitioning is the constraint exclusion feature (see my other post for more details on this) which allows the query plan to skip tables that it knows don’t contain that data. But postgres still has to do some work in order to figure out that those tables can be excluded in the first place. For very large partition sets, even this is a noticeable performance hit. Or so I’m told, as I do not have data sets near big enough to really evaluate this claim. But knowing how constraint exclusion works, I could see that as a possibility. Sub-partitioning, with a known naming pattern to the child tables, allows an application to target directly the exact child tables it needs and avoid even the minor overhead of constraint exclusion in the query plan.

Let’s see how it works.

First I create a standard, yearly partitioned table set

Next some data is added and I check that everything looks right

Say now we want to subpartition by day to better organize our data because we’re expecting to get A LOT of it. The new create_sub_parent() function works just like create_parent() except the first parameter is instead an already existing parent table whose children we want to partition. In this case, we’ll be telling it we want each yearly child table to be further partitioned by day.

Hopefully you’ve realized that all the data we inserted isn’t yet partitioned to the new daily tables yet. It all still resides in each one of the yearly sub-parent tables and the only tables that were created in 2015 are the ones around the current date of March 6th, 2015 +/- 4 days (since the premake config value is set to 4). For previous and future years, only a single partition was created for the lowest possible values. All parent tables in a partition set managed by pg_partman, at all partitioning levels, have at least one child, even if they have no data. You can see that for 2014 below. I don’t yet have things figured out for the data partitioning functions & scripts to handle sub-partitioning, but in the mean time, a query like the the one below the table definitions can generate the script lines for every sub-parent table for a given parent.

After running the partitioning script for each parent, you can see it automatically created 365 child partitions for 2014 (because there was data in the tables for every day) and only 69 for 2015 since we’re only partway into the year. It did so for the other years as well, but I figured showing one should be proof enough it worked.

Since this is the first sub-partition level, that parent table argument to create_sub_parent() just happens to be the same as we originally used for create_parent(). If you then wanted to again further sub-partition one of the new child tables, you would feed that to create_sub_partition() and it would be different.

I’ve also included a file in pg_partman now that gives some more detailed instructions on this and also how to undo such partitioning as well. If anyone has any issues with this feature, I’d appreciate the feedback.

Also, as a sneak preview for what’s currently in development, I believe I’ve gotten a very simple background worker process to handle partition maintenance working. This means, for the general maintenance where you call run_maintenance() with no parent table argument, you will no longer need an external scheduler such as cron! Just set a few variables in postgresql.conf and pg_partman will take care of things all within postgres itself. This does mean the next major version of pg_partman (2.0.0) will be 9.4+ only (I’m creating a dynamic BGW), but it also allows me to simplify a lot of code I’d been keeping around for 9.1 compatibility and add more features that are only available in later versions. So, think of this as more motivation to get your systems upgraded if you want to keep up with new features in this extension!

Written by Keith

March 9th, 2015 at 11:40 am

A Small Database Does Not Mean Small shared_buffers

without comments

As a followup to my previous blog post, A Large Database Does Not Mean Large shared_buffers, I had some more interesting findings applying the queries in that blog post to another client recently. I assume you have read that one already and don’t repeat any of what I explained previously, so if you haven’t read that one and aren’t familiar with the pg_buffercache extension, I highly recommend you go read that one first.

Another mantra often heard in PostgreSQL circles is that you usually don’t want to set shared_buffers higher than 8GB. I will admit, that for a majority of users, that is great advice and a good starting point (and a whole lot more useful than the default 32MB). There are also issues around double-buffering and allowing the kernel to do what it can probably do better than PostgreSQL as far as managing page reads/writes (a topic way out of the scope of this blog post). But if you investigate further into how PostgreSQL is using its shared memory and what your high demand data blocks actually are, you can possibly find benefit in setting it higher. Especially when you can clearly see what PostgreSQL thinks it needs most often. Or if you can just fit the whole thing into memory, as I stated before.

The client in these examples has shared_buffers set to 24Gb and the total database size is 145GB (111GB in the primary followed by 28GB, 5GB,  270MB & 150MB). I say small in the title of this post, but both large and small are relative terms and for my typical work this is a small database. And a setting that is 17% of the total size is larger than normal, so along with being a catchy followup name, the results do fit the title.

So I ran the basic query at the end of my previous post to see what the “ideal” minimal is. I ran this several times over about a half-hour period and, unlike the databases in my previous post, it did not deviate much.

Much higher than I previously encountered and with a much smaller database too. The value did deviate slightly, but it never changed from the rounded, pretty value of 18GB. So I investigated further. First the primary, 111GB database:

A good amount of the large tables had a significant amount of themselves in shared buffers. I looked at the top table here to see if it may be having problems keeping its high demand usage blocks in memory

Actually looks ok. It’s got about 2GB of space to be able to swap out lower priority blocks for higher ones if needed. How about those next two 100% tables?

I actually increased the usagecount parameter for both these tables all the way up to 5 and that only lowered the amount by a 2-3MB. So these are some pretty heavily used tables. For a client that does online order processing, this would seem to make sense for the context of this table. But it could also indicate a problem as well. This could mean there are queries doing a whole lot of sequential scans on this table and they might not need to be doing so. If that’s not something that’s readily apparent in the code accessing the database, I would then suggest turning to something like pgbadger for more in-depth query analysis to see where problems may be.

You may have noticed this doesn’t account for all the memory usage seen in the first query. Time to dive into the other databases (the 28GB one).

That primary key is taking up a lot of space and almost all of it seems to be in memory. But again, how much of it is really high usage?

Not nearly as much as is in shared_buffers. So no justification for an increase here. How about the messages table?

The whole thing is in very high demand! And there’s plenty of space for it it be there. The remainder of the majority of the space was a table similar to this in yet another one of the databases in the cluster.

So this PostgreSQL cluster seems to have some pretty good justification for having a shared_buffers 3x higher than what is typically suggested. It’s not actually using all of what’s available (only 18 of 24GB) and there’s still a significant amount in shared_buffers that’s got a usagecount below 3. My guidance to the client was to leave shared_buffers where it was, but to keep an eye on the tables like orders, order_items & messages. If the high usage of those tables is justified and they start increasing in size significantly, then this evaluation should be done again to see if shared_buffers should possibly be increased to keep that high demand data readily available in memory.

The pg_buffercache extension has been a great help with fine tuning one of the more important settings in PostgreSQL. Hopefully this helps clarify more how to evaluate shared_buffers usage and figuring out an ideal setting. And to be honest, I’m hoping that someone that reads this is in a position to better experiment with actually changing the shared_buffers value in situations like this to see if it really can make a difference in performance. As someone commented on my previous post, shared_buffers is a pretty invasive setting to change, not only because it requires a restart, but because you don’t want to screw up your performance on an active production machine. But you need the kind of activity that will be on an active production machine to accurately evaluate such settings. Reproducing such activity outside of production is really challenging.

So, looking for feedback and for anyone else to either validate or find fault with my experimentations so far.


Written by Keith

October 8th, 2014 at 2:25 pm

A Large Database Does Not Mean Large shared_buffers

with 12 comments

A co-worker of mine did a blog post last year that I’ve found incredibly useful when assisting clients with getting shared_buffers tuned accurately.

Setting shared_buffers the hard way

You can follow his queries there for using pg_buffercache to find out how your shared_buffers are actually being used. But I had an incident recently that I thought would be interesting to share that shows how shared_buffers may not need to be set nearly as high as you believe it should. Or it can equally show you that you that you definitely need to increase it. Object names have been sanitized to protect the innocent.

IMPORTANT NOTE: An important caveat to using this extension is that when you query pg_buffercache it puts a lock on the shared buffers in order to get an accurate assessment of usage. For the most part, this lock is brief and you probably won’t even notice anything. But on extremely busy systems you may notice some queries take a bit longer than usual. This is not something to be monitored in real-time for any extended period of time. It’s mostly something to be checked on occasionally when you’re noticing performance problems. If you do set up automatic monitoring for it, do not be aggressive.

To set the stage, the database total size is roughly 260GB and the use case is high data ingestion with some reporting done on just the most recent data at the time. shared_buffers is set to 8GB. The other thing to note is that this is the only database in the cluster. pg_buffercache has info for all databases in the cluster, but when you join against pg_class to get object information, you can only do this on individual database at a time.

You can see that table1 is taking up a vast majority of the space here and it’s a large table, so only 9% of it is actually in shared_buffers. What’s more interesting though is how much of the space for that table is actually in high demand.

Data blocks that go into and come out of postgres all go through shared_buffers. Just to review the blog post I linked to, whenever a block is used in shared memory, it increments a clock-sweep algorithm that ranges from 1-5, 5 being extremely high use data blocks. This means high usage blocks are likely to be kept in shared_buffers (if there’s room) and low usage blocks will get moved out if space for higher usage ones is needed. We believe that a simple insert or update sets a usagecount of 1. So, now we look at the difference when usage count is dropped to that.

So the shared_buffers is actually getting filled mostly by the data ingestion process, but relatively very little of it is of any further use afterwards. If anything of greater importance was needed in shared_buffers, there’s plenty of higher priority space and that inserted data would quickly get flushed out of shared memory due to having a low usagecount.

So with having pg_buffercache installed, we’ve found that the below query seems to be a good estimate on an optimal, minimum shared_buffers setting

This is the sort of query you would run after you have had your database running through your expected workload for a while. Also, note my use of the key word minimal. This does not account for unexpected spikes in shared_buffers usage that may occur during a session of reporting queries or something like that. So you definitely want to set it higher than this, but it can at least show you how effectively postgres is using its shared memory. In general we’ve found the typical suggestion of 8GB to be a great starting point for shared_buffers.

So, in the end, the purpose of this post was to show that shared_buffers is something that needs further investigation to really set optimally and there is a pretty easy method to figuring it out once you know where to look.


I’ve written a followup post to this one as well: A Small Database Does Not Mean Small shared_buffers.

So, as someone commented below, you don’t really need to join against pg_class & pg_database to get the ideal suggested minimum. This also avoids having to manually do totals across multiple databases in the cluster. The reason for joining against those two was to be able to identify which databases and objects the blocks in shared buffers were associated with. pg_class can only identify the objects of in the database you’re in.

Also, for really high traffic databases with fluctuating query activity, the suggested minimum query isn’t something you can run just once. It has to be run multiple times because the values can vary drastically.  Below are the results of running the shorter query just a few times in less than a 1 minute time period on a different client of ours that has a much different traffic pattern (OLTP) than the one above. There’s 46 databases in the cluster with a total size of roughly 900GB, with 800GB in one database, 30GB in the next largest and quickly getting smaller from there. For this one we actually have shared_buffers set down to 4GB and it’s been working great for years.


Written by Keith

September 11th, 2014 at 2:53 pm

Posted in PostgreSQL

Tagged with , ,

Checking for PostgreSQL Bloat (old)

with 2 comments

UPDATE: A newer version of this post has been made along with a rewrite of the script to use pgstattuple –

One of the on-going issues with PostgreSQL that every administrator must deal with at some point is table and/or index bloat. The MVCC architecture of PostgreSQL lends itself to requiring some extra monitoring & maintenance, especially in large enterprise systems. I’ll save the description of why bloat happens until further down in this post (for those that are curious) and get to the point of this article first.

So far the best check for bloat I’ve come across is the one contained in the check_postgres suite from Bucardo. And in places where we can get check_postgres set up and used, it works great. But sometimes we just want to check the bloat on a system and also have a little more fine grained control of the output. So recently I pulled out the rather complex bloat check query from check_postgres and adapted it to a command-line python script:

While I first started working on this script, just running the query by itself on some of our client systems, I also realized that it can be quite an expensive query to run on larger systems. It also brings back both table and index bloat together and other parts of the script split that to make the distinction. To allow this check to provide data more efficiently and separate table & index bloat, my script first creates a view to use. For postgresql 9.3+, this can be created as a materialized view so running the script to get bloat data back is fast and causes no contention in the system.

The other thing I realized is that bloat percentage alone is a poor indicator of actual system health. Very small tables may always have a higher than average bloat, or there may always be 1 or 2 pages considered waste, and in reality that  has next to zero impact on database performance. Constantly “debloating” them is more a waste of time than the space used. So I added in additional filters that also look at the number of pages in objects and the actual wasted space of those objects. This allows the final output of the bloat report to provide a more accurate representation of where there may actually be problems that need to be looked into.

Another option is a filter for individual tables or indexes to be ignored. If you understand why bloat happens, you will come across cases where a table is stuck at a certain bloat point at all times, no matter how many times you VACUUM FULL it or run pg_repack on it (those two things do remove it, but it quickly comes back). This happens with tables that have a specific level of churn with the rows being inserted, updated & deleted. The number of rows being updated/deleted is balanced with the number of rows being inserted/updated as well as the autovacuum schedule to mark space for reuse. Removing the bloat from tables like this can actually cause decreased performance because instead of re-using the space that VACUUM marks as available, Postgres has to again allocate more pages to that object from disk first before the data can be added. So bloat is actually not always a bad thing and the nature of MVCC can lead to improved write performance on some tables.

The “simple” output format is suitable for just using on the command-line or piping out to an email program like mailx to get a status report sent to you.

System tables like those you see at the bottom are extremely small and you can see the space they “waste” is negligible. I left the simple format like this so it just provides a quick glance for monitoring purposes. I also provide another output method in the form of a python dictionary that has some additional details. It’s also in a format that can be easily fed to other python scripts or anything else that needs a predictable, structured format.

You can see this provides some more details on how many pages are actually in the given objects vs how many are considered wasted space. All the system tables only have 1 page of wasted space so the given percentage varies because of how many pages they actually have. You have several options for filtering those last, unnecessary entries. The first would be to filter out the entire pg_catalog schema, but I don’t recommend that because catalog bloat can actually be a problem on databases that have extensive DDL changes over time. The other option is to use –min_pages to exclude any table or index that has less than a certain number of pages. I think the best option here would actually be –min_wasted_pages and set to something like 10.

You can use all these filters in combination as well to also avoid low percentage bloat like that top entry. Be careful doing this, though, since if you have an extremely large table and are trying to figure out where all potentially wasted space is, you could accidentally filter it out (10% is a low percentage but 10% of a 500GB table would be 50GB).

Why Bloat Happens

For those of you newer to PostgreSQL administration, and this is the first time you may be hearing about bloat, I figured I’d take the time to explain why this scenario exists and why tools like this are necessary (until they’re hopefully built into the database itself someday). It’s something most don’t understand unless someone first explains it to them or you run into the headaches it causes when it’s not monitored and you learn about it the hard way.

MVCC (multi-version concurrency control) is how Postgres has chosen to deal with multiple transactions/sessions hitting the same rows at (nearly) the same time. The documentation, along with wikipedia provide excellent and extensive explanations of how it all works, so I refer you there for all the details. Bloat is a result of one particular part of MVCC, concentrated around the handling of updates and deletes.

Whenever you delete a row, it’s not actually deleted, it is only marked as unavailable to all future transactions taking place after the delete occurs. The same happens with an update: the old version of a row is kept active until all currently running transactions have finished, then it is marked as unavailable. I emphasize the word unavailable because the row still exists on disk, it’s just not visible any longer. The VACUUM process in Postgres then comes along and marks any unavailable rows as space that is now available for future inserts or updates. The auto-vacuum process is configured to run VACUUM automatically after so many writes to a table (follow the link for the configuration options), so it’s not something you typically have to worry about doing manually very often (at least with more modern versions of Postgres).

People often assume that VACUUM is the process that should return the disk space to the file system. It does do this but only in very specific cases. That used space is contained in page files that make up the tables and indexes (called objects from now on) in the Postgres database system. Page files all have the same size and differently sized objects just have as many page files as they need. If VACUUM happens to mark every row in a page file as unavailable AND that page also happens to be the final page for the entire object, THEN the disk space is returned to the file system. If there is a single available row, or the page file is any other but the last one, the disk space is never returned by a normal VACUUM. This is bloat. Hopefully this explanation of what bloat actually is shows you how it can  sometimes be advantageous for certain usage patterns of tables as well, and why I’ve included the option to ignore objects in the report.

If you give the VACUUM command the special flag FULL, then all of that reusable space is returned to the file system. But VACUUM FULL does this by completely rewriting the entire table (and all its indexes) to new pages and takes an exclusive lock on the table the entire time it takes to run (CLUSTER does the same thing, but what that does is outside the scope of this post). For large tables in frequent use, this is problematic.  pg_repack has been the most common tool we’ve used to get around that. It recreates the table in the background, tracking changes to it, and then takes a brief lock to swap the old bloated table with the new one.

Why bloat is actually a problem when it gets out of hand is not just the disk space it uses up. Every time a query is run against a table, the visibility flags on individual rows and index entries is checked to see if is actually available to that transaction. On large tables (or small tables with a lot of bloat) that time spent checking those flags builds up. This is especially noticeable with indexes where you expect an index scan to improve your query performance and it seems to be making no difference or is actually worse than a sequential scan of the whole table. And this is why index bloat is checked independently of table bloat since a table could have little to no bloat, but one or more of its indexes could be badly bloated. Index bloat (as long as it’s not a primary key) is easier to solve because you can either just reindex that one index, or you can concurrently create a new index on the same column and then drop the old one when it’s done.

In all cases when you run VACUUM, it’s a good idea to run ANALYZE as well, either at the same time in one command or as two separate commands. This updates the internal statistics that Postgres uses when creating query plans. The number of live and dead rows in a table/index is a part of how Postgres decides to plan and run your queries. It’s a much smaller part of the plan than other statistics, but every little bit can help.

I hope this explanation of what bloat is, and how this tool can help with your database administration, has been helpful.

Written by Keith

August 25th, 2014 at 3:16 pm

Posted in PostgreSQL

Tagged with , ,