Keith's Ramblings…

"Fixing a complex system sounds pretty easy when you're an idiot."

Removing A Lot of Old Data (But Keeping Some Recent)

with 5 comments

I’ve had this situation crop up a few times with clients and after a discussion on #postgresql on Freenode recently, decided a blog post may be in order. The pitfalls that lead me to this solution are useful to cover and it seems a useful set of steps to have documented and be able to share again later.

There comes a time for most people when you have a table that builds up quite a lot of rows and you then realize you didn’t actually need to keep all of it. But you can’t just run a TRUNCATE because you do want to keep some of the more recent data. If it’s just a few million small rows, it’s not a huge deal to just run a simple DELETE. But when that starts getting into the billions of rows, or your rows are very large (long text, bytea, etc), a simple DELETE may not be realistic.

The first, and arguably easiest way, to deal with this would be to run the DELETE in batches instead of one large transaction. This allows you to add a pause in between the batches to help control I/O. Another side affect of such large delete operations can be an excessively high amount of WAL generation. This not only contributes to I/O, this can also dramatically increase disk space usage. When you’re trying to delete data due to disk space constraints, this can end up making things even worse before they get better. And if you’re deleting quite a lot of data from over a long period of time that didn’t receive many deletes before, you likely won’t get much disk space back at the end without doing a VACUUM FULL or pg_repack (see my discussion on bloat here). And the biggest issue of all when disk space is critical is either one of those options requires that you have at least as much disk space available as a full copy of the unbloated table would take up. So if disk space is the crucial problem most people are trying to solve with removing old data, how can we easily do this?

The first option that’s most easily done on almost any RDBMS is to make a secondary table and have new data copied/redirected there. The I/O and WAL generation of moving the smaller, required amount of data is much less than removing all the old data. Then you can just drop the old table. And most people do this via a trigger-based method: have every write to the old table also write to the new one as well. Then once you’re sure things are working, you can take a brief lock on both the old and new tables and swap their names. PostgreSQL makes this last step much easier, having transactional DDL. I’ll cover the commands to do this later since my final solution is similar to this. The main problems surrounding this come when the table you’re trying to clean up is a very high-traffic table. Doing a trigger like this basically doubles all writes which could possibly cause some I/O issues. There’s also the fact of making sure you get that trigger function code write, otherwise all writes break. Can be a little nerve wracking on critical production systems. But there is a way to avoid both the additional I/O of double writes and the headache of writing triggers.

The actual solution we came up for this involves using the often overlooked feature of table inheritance in PostgreSQL. The steps basically go like this:

  1. Create a new table exactly like the first one
  2. Set ownership/privileges on the new table to match the old table
  3. Have the old table INHERIT the new table
  4. Swap old and new table names
  5. Move the data you want to keep from the old table to the new one
  6. Drop the old table

As soon as you do step 4, all new data is immediately going to the new table. Also, since the old table is a child of the new table, all your old data is still visible from the original table name. Moving the data from the old table to the new one is nearly transparent to any users of the table, the only issue being there may be some slightly slower queries during that transition period since the planner has to account for 2 tables. You can help mitigate this slightly by placing a constraint on the new table (before the swap when it’s empty) that only allows data in the new table’s window. This allows constraint exclusions to possibly ignore the old table while you get data moved. Placing a constraint on the old table probably wouldn’t help much since it would have to lock it for validation. Yes there will be additional I/O and disk usage from WAL while you move data from the old table to the new, but this can be controlled to a much greater degree since all new data isn’t also being written twice by a trigger. You can move the data in batches with any necessary pauses to control those I/O & WAL spikes.

The one thing this method does not account for is if there are updates to data in the old table that would technically place it in the new one while you’re in the process of moving the recent data you want to keep. That update will not be moved from the old table to do the new one. But in most cases where we’ve done this, those updates weren’t that critical since, if you’d done the original method of just deleting the old data, you would’ve been deleting that data that was getting updated anyway. But it’s something to be aware of if you go querying the old table and still see “new” data after you think you’re done migrating. It’s also something to be aware of if your application thinks it updated something when the old table was there and now it’s suddenly gone. You may need to stick with the trigger method above if that’s the case then.

An example of commands to do this is below. The LIKE clause to the CREATE TABLE statement is quite useful since this can automatically include all indexes, constraints, defaults, comments, and storage options. It does NOT include ownership or privileges however, so the one critical step in this piece is definitely #2 above. You can easily see a table’s privileges with the \dp option in psql. Also, explicitly obtaining the exclusive lock on both tables before doing the name switch ensures nothing weird happens during whatever brief moment could exist between the switch.

Once all these steps are done, you can then begin the process of moving your more recent data out of the old table and into the new via whichever method works best for you. One easy method to batch this is a CTE query that does the DELETE/INSERT with a SELECT in a single query to limit the rows moved.

And once that’s done, you can then DROP the old table, instantly recovering all that disk space with minimal WAL traffic and zero bloat aftermath!


Also published on Medium.

Written by Keith

March 15th, 2017 at 11:32 am

Posted in PostgreSQL

Tagged with , ,

  • Rakesh Kumar

    Hi Keith,

    There is another way to do this:

    BEGIN TRANSACTION
    SELECT * FROM LARGE_TABLE
    WHERE where_clause_with_the_rows_you_want_to_keep
    INTO TEMP TABLE TMP123 ;

    TRUNCATE TABLE LARGE_TABLE ;

    INSERT INTO LARGE_TABLE
    SELECT * from TMP123 ;

    COMMIT TRANSACTION ;

    • keithf4

      Doesn’t this pretty much lock all the data from the point of the TRUNCATE forward? The one I described above only requires a very brief heavy lock for the table rename. There may be some row locks while moving the data as well, depending on usage, but that can be controlled by committing in smaller batches.

      Guess it all depends on how much data is “recent”. If it’s only a few rows that can be moved in a few seconds, your method could work. If it’s still millions of rows, that could be problematic.

      • Rakesh Kumar

        I am not sure. The whole point of encapsulating this within a transaction is to allow PG to truncate the table in MVCC mode where only this sessions sees the truncation while others will not note it. PG even allows truncate to be rolled back. So in my code, if I rollback instead of commit, no one else will note.
        So in the above case, if other sessions are inserting into the table, they won’t wait, but the above session will wait on the truncate statement until all other sessions have no more open transactions. Then this session will have truncate complete and committed. At that time the truncated table will be visible to all.

        Yes you are right. If “recent” rows means millions of rows, the last it will take some time to insert, but due to MVCC, it won’t affect others.

        Overall I do agree that my approach works best only during quiet time. Your approach can work on a more varied situations.

        • keithf4

          Have you actually tested this? I appears doing a TRUNCATE locks the table completely. So you can neither read new rows coming in once that’s run, nor write any new ones. Below is the pg_stat_activity view from a query that is just doing a simple select on a table that I have being truncated in another open transaction. You can see the TRUNCATE is idle in transaction and the read query is under a heavyweight lock (thank you 9.6 for the greater lock visibility!). That’s the whole point of using the method I did above was to avoid locking the table for any noticeable length of time.

          -[ RECORD 2 ]—-+
          datid | 16384
          datname | keith
          pid | 4790
          usesysid | 10
          usename | keith
          application_name | psql
          client_addr | «NULL»
          client_hostname | «NULL»
          client_port | -1
          backend_start | 2017-03-16 15:10:25.629205-04
          xact_start | 2017-03-16 15:11:13.156526-04
          query_start | 2017-03-16 15:11:22.491366-04
          state_change | 2017-03-16 15:11:22.49469-04
          wait_event_type | Client
          wait_event | ClientRead
          state | idle in transaction
          backend_xid | 21584
          backend_xmin | «NULL»
          query | truncate testing_trunc ;
          -[ RECORD 3 ]—-+
          datid | 16384
          datname | keith
          pid | 4813
          usesysid | 10
          usename | keith
          application_name | psql
          client_addr | «NULL»
          client_hostname | «NULL»
          client_port | -1
          backend_start | 2017-03-16 15:11:24.709798-04
          xact_start | 2017-03-16 15:14:03.067949-04
          query_start | 2017-03-16 15:14:03.067949-04
          state_change | 2017-03-16 15:14:03.067982-04
          wait_event_type | Lock
          wait_event | relation
          state | active
          backend_xid | «NULL»
          backend_xmin | 21584
          query | select * from testing_trunc ;

          • Rakesh Kumar

            You may be right. My test involved one truncate session and one insert session. I have to mix a select session also.