Keith's Ramblings…

Table Partitioning and Foreign Keys

with 21 comments

Table partitioning & foreign keys don’t get along very well in databases and PostgreSQL’s lack of having it built in shows it very clearly with the workarounds that are necessary to avoid the issues. The latest release of pg_partman deals with the lesser of two shortcomings that must be dealt with, that being where child tables in a partition set do not automatically inherit foreign keys created on the parent table. I’ll be using my other extension pg_jobmon as a reference for example here since it works well to illustrate both the issues and possible solutions. You can see here the job_detail table, which contains the individual steps of a logged job, references the the job_log table for the main job_id values.

keith=# \d jobmon.job_detail
                                           Table "jobmon.job_detail"
    Column    |           Type           |                              Modifiers                              
--------------+--------------------------+---------------------------------------------------------------------
 job_id       | bigint                   | not null
 step_id      | bigint                   | not null default nextval('jobmon.job_detail_step_id_seq'::regclass)
 action       | text                     | not null
 start_time   | timestamp with time zone | not null
 end_time     | timestamp with time zone | 
 elapsed_time | real                     | 
 status       | text                     | 
 message      | text                     | 
Indexes:
    "job_detail_step_id_pkey" PRIMARY KEY, btree (step_id)
    "job_detail_job_id_idx" btree (job_id)
Foreign-key constraints:
    "job_detail_job_id_fkey" FOREIGN KEY (job_id) REFERENCES jobmon.job_log(job_id) ON DELETE CASCADE

With version <= 1.7.0 of pg_partman, turning this table into a partition set illustrates the issue.

keith=# select partman.create_parent('jobmon.job_detail', 'job_id', 'id-static', '10000', p_jobmon := false);
 create_parent 
---------------
 
(1 row)

keith=# \d+ jobmon.job_detail
                                                               Table "jobmon.job_detail"
    Column    |           Type           |                              Modifiers                              | Storage  | Stats target | Description 
--------------+--------------------------+---------------------------------------------------------------------+----------+--------------+-------------
 job_id       | bigint                   | not null                                                            | plain    |              | 
 step_id      | bigint                   | not null default nextval('jobmon.job_detail_step_id_seq'::regclass) | plain    |              | 
 action       | text                     | not null                                                            | extended |              | 
 start_time   | timestamp with time zone | not null                                                            | plain    |              | 
 end_time     | timestamp with time zone |                                                                     | plain    |              | 
 elapsed_time | real                     |                                                                     | plain    |              | 
 status       | text                     |                                                                     | extended |              | 
 message      | text                     |                                                                     | extended |              | 
Indexes:
    "job_detail_step_id_pkey" PRIMARY KEY, btree (step_id)
    "job_detail_job_id_idx" btree (job_id)
Foreign-key constraints:
    "job_detail_job_id_fkey" FOREIGN KEY (job_id) REFERENCES jobmon.job_log(job_id) ON DELETE CASCADE
Triggers:
    job_detail_part_trig BEFORE INSERT ON jobmon.job_detail FOR EACH ROW EXECUTE PROCEDURE jobmon.job_detail_part_trig_func()
Child tables: jobmon.job_detail_p0,
              jobmon.job_detail_p10000,
              jobmon.job_detail_p20000,
              jobmon.job_detail_p30000,
              jobmon.job_detail_p40000
Has OIDs: no

keith=# \d+ jobmon.job_detail_p0
                                                             Table "jobmon.job_detail_p0"
    Column    |           Type           |                              Modifiers                              | Storage  | Stats target | Description 
--------------+--------------------------+---------------------------------------------------------------------+----------+--------------+-------------
 job_id       | bigint                   | not null                                                            | plain    |              | 
 step_id      | bigint                   | not null default nextval('jobmon.job_detail_step_id_seq'::regclass) | plain    |              | 
 action       | text                     | not null                                                            | extended |              | 
 start_time   | timestamp with time zone | not null                                                            | plain    |              | 
 end_time     | timestamp with time zone |                                                                     | plain    |              | 
 elapsed_time | real                     |                                                                     | plain    |              | 
 status       | text                     |                                                                     | extended |              | 
 message      | text                     |                                                                     | extended |              | 
Indexes:
    "job_detail_p0_pkey" PRIMARY KEY, btree (step_id)
    "job_detail_p0_job_id_idx" btree (job_id)
Check constraints:
    "job_detail_p0_partition_check" CHECK (job_id >= 0::bigint AND job_id < 10000::bigint)
Inherits: jobmon.job_detail
Has OIDs: no

You can see it is now a partitioned table, but if you look at any of the children, none of them have the FK back to the main job_log table.

As a side note, notice I set the p_jobmon parameter to false in create_parent(). By default pg_partman uses pg_jobmon when it is installed to log everything it does and provide monitoring that your partitioning is working. Since this would mean pg_jobmon is trying to log the partitioning steps of its own table, it puts it into a permanent lockwait state since it’s trying to write to the table it is partitioning. Turning pg_jobmon off for the initial creation avoids this compatibility issue between these two extensions. It can be turned back on for monitoring of future child table creation by modifying the jobmon column in pg_partman’s part_config table. Creation of partitions ahead of the current one does not interfere since a lock on the parent table is no longer required.

Back to the foreign key issue… Lets undo the partitioning here, upgrade pg_partman, and try again

keith=# select partman.undo_partition_id('jobmon.job_detail', 20, p_keep_table := false);
NOTICE:  Copied 0 row(s) to the parent. Removed 5 partitions.
 undo_partition_id 
-------------------
                 0
(1 row)

keith=# alter extension pg_partman update to '1.7.1';
ALTER EXTENSION

keith=# select partman.create_parent('jobmon.job_detail', 'job_id', 'id-static', '10000', p_jobmon := false);
 create_parent 
---------------
 
(1 row)

keith=# \d jobmon.job_detail_p0
                                         Table "jobmon.job_detail_p0"
    Column    |           Type           |                              Modifiers                              
--------------+--------------------------+---------------------------------------------------------------------
 job_id       | bigint                   | not null
 step_id      | bigint                   | not null default nextval('jobmon.job_detail_step_id_seq'::regclass)
 action       | text                     | not null
 start_time   | timestamp with time zone | not null
 end_time     | timestamp with time zone | 
 elapsed_time | real                     | 
 status       | text                     | 
 message      | text                     | 
Indexes:
    "job_detail_p0_pkey" PRIMARY KEY, btree (step_id)
    "job_detail_p0_job_id_idx" btree (job_id)
Check constraints:
    "job_detail_p0_partition_check" CHECK (job_id >= 0::bigint AND job_id < 10000::bigint)
Foreign-key constraints:
    "job_detail_p0_job_id_fkey" FOREIGN KEY (job_id) REFERENCES jobmon.job_log(job_id)
Inherits: jobmon.job_detail

Now our child table has the parent foreign key! The apply_foreign_keys() plpgsql function and the reapply_foreign_keys.py script that are part of the version 1.7.1 can actually be used on any table inheritance set, not just the ones managed by pg_partman. So some may find it useful elsewhere as well. So, what happens if we now partition the reference table, job_log, as well?

keith=# select partman.create_parent('jobmon.job_log', 'job_id', 'id-static', '10000', p_jobmon := false);
 create_parent 
---------------
 
(1 row)

keith=# \d+ jobmon.job_log
                                                             Table "jobmon.job_log"
   Column   |           Type           |                            Modifiers                            | Storage  | Stats target | Description 
------------+--------------------------+-----------------------------------------------------------------+----------+--------------+-------------
 job_id     | bigint                   | not null default nextval('jobmon.job_log_job_id_seq'::regclass) | plain    |              | 
 owner      | text                     | not null                                                        | extended |              | 
 job_name   | text                     | not null                                                        | extended |              | 
 start_time | timestamp with time zone | not null                                                        | plain    |              | 
 end_time   | timestamp with time zone |                                                                 | plain    |              | 
 status     | text                     |                                                                 | extended |              | 
 pid        | integer                  | not null                                                        | plain    |              | 
Indexes:
    "job_log_job_id_pkey" PRIMARY KEY, btree (job_id)
    "job_log_job_name_idx" btree (job_name)
    "job_log_pid_idx" btree (pid)
    "job_log_start_time_idx" btree (start_time)
    "job_log_status_idx" btree (status)
Referenced by:
    TABLE "jobmon.job_detail" CONSTRAINT "job_detail_job_id_fkey" FOREIGN KEY (job_id) REFERENCES jobmon.job_log(job_id) ON DELETE CASCADE
    TABLE "jobmon.job_detail_p0" CONSTRAINT "job_detail_p0_job_id_fkey" FOREIGN KEY (job_id) REFERENCES jobmon.job_log(job_id)
    TABLE "jobmon.job_detail_p10000" CONSTRAINT "job_detail_p10000_job_id_fkey" FOREIGN KEY (job_id) REFERENCES jobmon.job_log(job_id)
    TABLE "jobmon.job_detail_p20000" CONSTRAINT "job_detail_p20000_job_id_fkey" FOREIGN KEY (job_id) REFERENCES jobmon.job_log(job_id)
    TABLE "jobmon.job_detail_p30000" CONSTRAINT "job_detail_p30000_job_id_fkey" FOREIGN KEY (job_id) REFERENCES jobmon.job_log(job_id)
    TABLE "jobmon.job_detail_p40000" CONSTRAINT "job_detail_p40000_job_id_fkey" FOREIGN KEY (job_id) REFERENCES jobmon.job_log(job_id)
Triggers:
    job_log_part_trig BEFORE INSERT ON jobmon.job_log FOR EACH ROW EXECUTE PROCEDURE jobmon.job_log_part_trig_func()
    trg_job_monitor AFTER UPDATE ON jobmon.job_log FOR EACH ROW EXECUTE PROCEDURE jobmon.job_monitor()
Child tables: jobmon.job_log_p0,
              jobmon.job_log_p10000,
              jobmon.job_log_p20000,
              jobmon.job_log_p30000,
              jobmon.job_log_p40000
Has OIDs: no


keith=# \d jobmon.job_log_p0
                                        Table "jobmon.job_log_p0"
   Column   |           Type           |                            Modifiers                            
------------+--------------------------+-----------------------------------------------------------------
 job_id     | bigint                   | not null default nextval('jobmon.job_log_job_id_seq'::regclass)
 owner      | text                     | not null
 job_name   | text                     | not null
 start_time | timestamp with time zone | not null
 end_time   | timestamp with time zone | 
 status     | text                     | 
 pid        | integer                  | not null
Indexes:
    "job_log_p0_pkey" PRIMARY KEY, btree (job_id)
    "job_log_p0_job_name_idx" btree (job_name)
    "job_log_p0_pid_idx" btree (pid)
    "job_log_p0_start_time_idx" btree (start_time)
    "job_log_p0_status_idx" btree (status)
Check constraints:
    "job_log_p0_partition_check" CHECK (job_id >= 0::bigint AND job_id < 10000::bigint)
Inherits: jobmon.job_log

It partitions the table without any errors and you can see all the child table foreign keys referencing the parent. But notice the job_log_p0 child table? It has no references from any of the children. And this is the bigger issue that pg_partman does not solve, and most likely never will…

Foreign key reference checks to the parent table in an inheritance set do not propagate to the children

Since the parent table in an inheritance set is typically either empty, or only contains a fraction of the total data, the table referencing the partition set will either fail on every insert or when it hits a value that is only in a child table. The below SQL statements illustrate the issue

keith=# INSERT INTO jobmon.job_log (owner, job_name, start_time, pid) values ('keith', 'FK FAILURE TEST', now(), pg_backend_pid());
INSERT 0 0

keith=# select * from jobmon.job_log;
 job_id | owner |                     job_name                     |          start_time           | end_time | status |  pid  
--------+-------+--------------------------------------------------+-------------------------------+----------+--------+-------
      2 | keith | FK FAILURE TEST                                  | 2014-05-26 23:14:35.830266-04 | «NULL»   | «NULL» | 25286

keith=# insert into jobmon.job_detail (job_id, action, start_time) values (2, 'FK FAILURE TEST STEP 1', now());
ERROR:  insert or update on table "job_detail_p0" violates foreign key constraint "job_detail_p0_job_id_fkey"
DETAIL:  Key (job_id)=(2) is not present in table "job_log".
CONTEXT:  SQL statement "INSERT INTO jobmon.job_detail_p0 VALUES (NEW.*)"
PL/pgSQL function jobmon.job_detail_part_trig_func() line 11 at SQL statement

You can clearly see the job_log table has the job_id value “2”, but trying to insert that value into the table that uses it as a reference fails. This is because that value lives in job_log_p0, not job_log and the FK reference check does not propagate to the child tables.

keith=# select * from only jobmon.job_log;
 job_id | owner | job_name | start_time | end_time | status | pid 
--------+-------+----------+------------+----------+--------+-----
(0 rows)

keith=# select * from only jobmon.job_log_p0;
 job_id | owner |    job_name     |          start_time           | end_time | status |  pid  
--------+-------+-----------------+-------------------------------+----------+--------+-------
      2 | keith | FK FAILURE TEST | 2014-05-26 23:14:35.830266-04 | «NULL»   | «NULL» | 25286
(1 row)

I’m not sure of all of the reasons why PostgreSQL doesn’t allow FK checks to propagate down inheritance trees, but I do know one of the consequences of doing so could be some heavy performance hits for the source table if the inheritance set is very large. Every insert would have to scan down all tables in the inheritance tree. Even with indexes, this could be a very expensive.

There is a way to write a trigger and “fake” the foreign key if this is needed. (Update: As someone commented below, there are still race conditions here that are hard to deal with) I looked into this because I do want to be able to partition the pg_jobmon tables and keep referential integrity. To see how this works, I’m starting with a clean installation of pg_jobmon (no partitions). First the original foreign key on job_detail has to be removed, then a trigger is created in its place.

keith=# alter table jobmon.job_detail drop constraint job_detail_job_id_fkey;
ALTER TABLE

keith=# CREATE OR REPLACE FUNCTION jobmon.job_detail_fk_trigger() RETURNS trigger
keith-#     LANGUAGE plpgsql
keith-#     AS $$
keith$# DECLARE
keith$# v_job_id    bigint;
keith$# BEGIN
keith$#     SELECT l.job_id INTO v_job_id
keith$#     FROM jobmon.job_log l
keith$#     WHERE l.job_id = NEW.job_id;
keith$# 
keith$#     IF v_job_id IS NULL THEN
keith$#         RAISE foreign_key_violation USING 
keith$#             MESSAGE='Insert or update on table "jobmon.job_detail" violates custom foreign key trigger "job_detail_fk_trigger" ',
keith$#             DETAIL='Key (job_id='||NEW.job_id||') is not present in jobmon.job_log';
keith$#     END IF;
keith$#     RETURN NEW;
keith$# END
keith$# $$;
CREATE FUNCTION

keith=# 
keith=# 
keith=# CREATE TRIGGER aa_job_detail_fk_trigger 
keith-# BEFORE INSERT OR UPDATE OF job_id
keith-# ON jobmon.job_detail
keith-# FOR EACH ROW
keith-# EXECUTE PROCEDURE jobmon.job_detail_fk_trigger();
CREATE TRIGGER

This MUST be a BEFORE trigger and I gave the trigger name a prefix of “aa_” because PostgreSQL fires triggers off in alphabetical order and I want to ensure it goes first as best I can.  Now we partition job_detail & job_log the same as before.

keith=# select partman.create_parent('jobmon.job_detail', 'job_id', 'id-static', '10000', p_jobmon := false);
 create_parent 
---------------
 
(1 row)

Time: 168.831 ms
keith=# \d+ jobmon.job_detail
                                                               Table "jobmon.job_detail"
    Column    |           Type           |                              Modifiers                              | Storage  | Stats target | Description 
--------------+--------------------------+---------------------------------------------------------------------+----------+--------------+-------------
 job_id       | bigint                   | not null                                                            | plain    |              | 
 step_id      | bigint                   | not null default nextval('jobmon.job_detail_step_id_seq'::regclass) | plain    |              | 
 action       | text                     | not null                                                            | extended |              | 
 start_time   | timestamp with time zone | not null                                                            | plain    |              | 
 end_time     | timestamp with time zone |                                                                     | plain    |              | 
 elapsed_time | real                     |                                                                     | plain    |              | 
 status       | text                     |                                                                     | extended |              | 
 message      | text                     |                                                                     | extended |              | 
Indexes:
    "job_detail_step_id_pkey" PRIMARY KEY, btree (step_id)
    "job_detail_job_id_idx" btree (job_id)
Triggers:
    aa_job_detail_fk_trigger BEFORE INSERT OR UPDATE OF job_id ON jobmon.job_detail FOR EACH ROW EXECUTE PROCEDURE jobmon.job_detail_fk_trigger()
    job_detail_part_trig BEFORE INSERT ON jobmon.job_detail FOR EACH ROW EXECUTE PROCEDURE jobmon.job_detail_part_trig_func()
Child tables: jobmon.job_detail_p0,
              jobmon.job_detail_p10000,
              jobmon.job_detail_p20000,
              jobmon.job_detail_p30000,
              jobmon.job_detail_p40000
Has OIDs: no

keith=# \d+ jobmon.job_detail_p0
                                                             Table "jobmon.job_detail_p0"
    Column    |           Type           |                              Modifiers                              | Storage  | Stats target | Description 
--------------+--------------------------+---------------------------------------------------------------------+----------+--------------+-------------
 job_id       | bigint                   | not null                                                            | plain    |              | 
 step_id      | bigint                   | not null default nextval('jobmon.job_detail_step_id_seq'::regclass) | plain    |              | 
 action       | text                     | not null                                                            | extended |              | 
 start_time   | timestamp with time zone | not null                                                            | plain    |              | 
 end_time     | timestamp with time zone |                                                                     | plain    |              | 
 elapsed_time | real                     |                                                                     | plain    |              | 
 status       | text                     |                                                                     | extended |              | 
 message      | text                     |                                                                     | extended |              | 
Indexes:
    "job_detail_p0_pkey" PRIMARY KEY, btree (step_id)
    "job_detail_p0_job_id_idx" btree (job_id)
Check constraints:
    "job_detail_p0_partition_check" CHECK (job_id >= 0::bigint AND job_id < 10000::bigint)
Inherits: jobmon.job_detail
Has OIDs: no

keith=# select partman.create_parent('jobmon.job_log', 'job_id', 'id-static', '10000', p_jobmon := false);
 create_parent 
---------------
 
(1 row)

Time: 197.390 ms
keith=# \d+ jobmon.job_log
                                                             Table "jobmon.job_log"
   Column   |           Type           |                            Modifiers                            | Storage  | Stats target | Description 
------------+--------------------------+-----------------------------------------------------------------+----------+--------------+-------------
 job_id     | bigint                   | not null default nextval('jobmon.job_log_job_id_seq'::regclass) | plain    |              | 
 owner      | text                     | not null                                                        | extended |              | 
 job_name   | text                     | not null                                                        | extended |              | 
 start_time | timestamp with time zone | not null                                                        | plain    |              | 
 end_time   | timestamp with time zone |                                                                 | plain    |              | 
 status     | text                     |                                                                 | extended |              | 
 pid        | integer                  | not null                                                        | plain    |              | 
Indexes:
    "job_log_job_id_pkey" PRIMARY KEY, btree (job_id)
    "job_log_job_name_idx" btree (job_name)
    "job_log_pid_idx" btree (pid)
    "job_log_start_time_idx" btree (start_time)
    "job_log_status_idx" btree (status)
Triggers:
    job_log_part_trig BEFORE INSERT ON jobmon.job_log FOR EACH ROW EXECUTE PROCEDURE jobmon.job_log_part_trig_func()
    trg_job_monitor AFTER UPDATE ON jobmon.job_log FOR EACH ROW EXECUTE PROCEDURE jobmon.job_monitor()
Child tables: jobmon.job_log_p0,
              jobmon.job_log_p10000,
              jobmon.job_log_p20000,
              jobmon.job_log_p30000,
              jobmon.job_log_p40000
Has OIDs: no

You can see that triggers are not inherited to child tables, so that is why it must be a BEFORE trigger on the job_detail parent. The insert does not actually happen on the job_detail parent table, so the event must be caught before any insert is actually done. Also, this isn’t quite as flexible as a real foreign key since there are no CASCADE options to handle data being removed on the parent. This also causes much heavier locks than a real foreign key. Lets see what happens if we try the same inserts that failed above

keith=# INSERT INTO jobmon.job_log (owner, job_name, start_time, pid) values ('keith', 'FK FAILURE TEST', now(), pg_backend_pid());
INSERT 0 0

keith=# select * from jobmon.job_log;
 job_id | owner |    job_name     |          start_time          | end_time | status | pid  
--------+-------+-----------------+------------------------------+----------+--------+------
      1 | keith | FK FAILURE TEST | 2014-05-27 12:59:03.06901-04 | «NULL»   | «NULL» | 3591
(1 row)

keith=# insert into jobmon.job_detail (job_id, action, start_time) values (1, 'FK FAILURE TEST STEP 1', now());
INSERT 0 0

keith=# select * from jobmon.job_detail;
 job_id | step_id |         action         |          start_time          | end_time | elapsed_time | status | message 
--------+---------+------------------------+------------------------------+----------+--------------+--------+---------
      1 |       1 | FK FAILURE TEST STEP 1 | 2014-05-27 12:59:40.03766-04 | «NULL»   |       «NULL» | «NULL» | «NULL»
(1 row)

keith=# select * from only jobmon.job_detail;
 job_id | step_id | action | start_time | end_time | elapsed_time | status | message 
--------+---------+--------+------------+----------+--------------+--------+---------
(0 rows)

keith=# select * from only jobmon.job_detail_p0;
 job_id | step_id |         action         |          start_time          | end_time | elapsed_time | status | message 
--------+---------+------------------------+------------------------------+----------+--------------+--------+---------
      1 |       1 | FK FAILURE TEST STEP 1 | 2014-05-27 12:59:40.03766-04 | «NULL»   |       «NULL» | «NULL» | «NULL»
(1 row)

No errors! And what happens if we try and insert invalid data to the child table?

keith=# insert into jobmon.job_detail (job_id, action, start_time) values (2, 'FK FAILURE TEST STEP 1', now());
ERROR:  Insert or update on table "jobmon.job_detail" violates custom foreign key trigger "job_detail_fk_trigger" 
DETAIL:  Key (job_id=2) is not present in jobmon.job_log

Since the trigger function is doing a normal select on the parent table of the job_log partition set, it is seeing data across all the child partitions. AND, since job_id is the partition column of job_log, the trigger function will actually be able to take advantage of constraint exclusion and will only actually touch the one single partition that value could be in. So this works very well in this case, even if the partition set grows extremely large. Now, if you create a FK trigger like this on any other column that doesn’t have constraints, you will begin to notice performance issues as the reference table grows in size. If your tables contain static, unchanging data, pg_partman has some additional options that can help here as well (see my previous post about constraint exclusion).

The other issue with this is exclusive to pg_jobmon being an extension. The lack of a foreign key and presence of a trigger is different than the default extension code. There is the potential that a future extension update could either remove the trigger or replace the foreign key. There’s currently no way to give extension installation options for different code branches that I’m aware of and keep things consistent. In the case of pg_jobmon, the extension is mostly feature complete and I don’t foresee any updates breaking the above fix. But it is something to be aware of if you have to change the default code in any extension.

This is a complicated issue and one that many people don’t realize when trying to plan out table partitioning for more complex schemas. Hopefully I’ve helped clarify things and shown why partitioning is such a tricky issue to get right.

Written by Keith

May 28th, 2014 at 10:27 am