PG Extractor - A smarter pg_dump


UPDATE: This post still gets a lot of views due to search results for pg_extractor. Please note that the perl version this post discusses is no longer in development and a complete rewrite in Python was done for version 2.0. The version control features were removed as well. Please see the README file for more information.

For my debut blog post, I’ll be explaining a tool, pg_extractor, that I began working on at my current job. I’d like to first give a big thank you to depesz and the DBA team at OmniTI for their help in making this tool.

We had already had a tool in use for doing a one-file-per-object schema dump of PostgreSQL databases for use in sending to version control. It was originally designed with some complex queries to fetch the internal table, view & function schema, but with the release of 9.x, many of those queries broke with internal structure changes. It also didn’t account for partitioned tables with inheritance, overloaded functions and several other more advanced internal structures. I’d originally set out to just fix the schema queries to account for these changes and shortcomings but depesz mentioned his idea for a complete reworking of the tool using pg_dump/pg_restore instead of complex queries. This would hopefully keep it as future proof as possible and keep the output in a format that was better guaranteed to be a representation of how PostgreSQL sees its own object structure, no matter how it may change in the future.

SVN options have been integrated into the tool and I have plans on getting Git in as well. While the ideal way to track a database’s schema in VCS would be before you actually commit it to your production system, this was an easy way to just have the schema available for reference and automatically track changes outside of the normal development cycle.

Working on this tool has shown many shortcomings with the built in pg_dump/restore. One of my hopes with this project is to show what improvements can be made and possibly built in sometime in the future. One of the biggest is the lack of outputting of ACLs and Comments for anything but tables. And if you have many individual objects you’d like to export, formatting that for pg_dump can be quite tedious. And there is a lack of filtering for anything other than tables and schemas with pg_dump. The filter options list for pg_extractor has become quite long and will probably only continue to grow. Just do a -help or see the documentation on GitHub. I think one of the nicest features is the ability to use external files as input to the filter options and also filter with regex!

Probably best to move on to some examples instead of continuing this wall of text…

perl -U postgres --dbname=mydb --getall --sqldump

This is a very basic use of the tool using many assumed defaults. It uses the postgres database user, grabs all objects (tables, views, functions/aggregates, custom types, & roles), and also makes a permanent copy of the pg_dump file used to create the individual schema files. I tried to use options that were similar to, or exactly the same as, the options for pg_dump/restore. There’s only so many letters in the alphabet though, so please check the documentation to be sure. Just like pg_dump, it will also use the PG environment variables $PGDATABASE, $PGPORT, $PGUSER, $PGHOST, $PGPASSFILE, and $PGCLIENTENCODING if they are set (they are actually used internally by the script as well if the associated options are used).

perl -U postgres --dbname=mydb --getfuncs --n=keith

This will extract only the functions from the keith schema. Any overloaded functions are put into the same file. I’d thought about trying to use the parameters to somehow make unique names for each version of an overloaded function, but I found this much easier for now, especially when going back and removing files if the -delete cleanup option is set.

perl -U postgres --dbname=mydb --getfuncs -p_file=/home/postgres/func_incl

This will extract only specifically named functions in the given filename. You must ensure that the full function signature is given with only the variable types for arguments. When using include files, it’s best to explicitly name the schemas that the objects in the file are in as well (it makes the temporary dump file that’s created smaller). The contents of any external file list are newline separated like so

dblink_exec(text, text)
dblink_exec(text, text, boolean)
dblink_exec(text, boolean)

One of the things I ensured was possible for external file lists was that you can use an object list created with psql when using the \t and \o options

postgres=# \t
Showing only tuples.
postgres=# \o table_filter
postgres=# select schemaname || '.' || tablename from pg_tables;

You could then use the table_filter file with the -t_file or -T_file options. I use this ability at work for a cronjob that generates a specific list of views I want exported and tracked in SVN. But that view list is not static and can change at any time. So the cronjob generates a new view filter file every day.

perl -U postgres --dbname=mydb --gettables -Fc
--t_file=/home/postgres/tbl_incl --getdata

This extracts only the tables listed in the given filename along with the data in the pg_dump custom format. By default, pg_extractor’s output will all be in plain (-Fp) format since the intended purpose of the tool is to create output in a human-readable format.

perl -U postgres --dbname=mydb --pgdump=/opt/pgsql/bin/pg_dump
--pgrestore=/opt/pgsql/bin/pg_restore --pgdumpall=/opt/pgsql/bin/pg_dumpall --getall
--regex_excl_file=/home/postgres/part_exclude --N=schema1,schema2,schema3,schema4

This excludes objects (most likely table partitions) that have the pattern objectname_pYYYY_MM or fairly similar. Binaries are also not in the $PATH and it EXCLUDES several schemas.

part_exclude file contains: p?(20|19)\d\d(_?\d+)*$

perl -U postgres --dbname=mydb --svn --svncmd=/opt/svn/bin/svn
--commitmsg=Weekly svn commit of postgres schema --svndel

This is an example using svn. To keep the svn password out of system logs, the svn username & password have to be manually entered into the script file. It’s the only option that requires any manual editing of the source code. May see about having this as an option that points to an external file instead. The -svndel option cleans up any objects that are removed from the database from the svn repository.

Update: Writing up this blog post got me motivated to fix having to edit any source to get a certain option to work. There is now an -svn_userfile option where you can give it the path to a file containing the svn username and password to use. See the help for formatting. Just set permissions appropriately on this file for security and ensure the user running pg_extractor can read it.

One of the first fun issues I came across was handling objects that have special characters in their names. It’s rare, but since individual objects are all in their own files based on their object names, the OS doesn’t particularly like that, and caused some rather annoying errors every time the script ran. This is now handled by hex-encoding the special character and preceding that character with a comma in the file name.

For example: table|name becomes table,7cname.sql

This hopefully makes it easy to decode if needed by any other tools. It’s actually done internally in the tool when you want objects that were deleted to also have their files deleted on subsequent runs.

One issue that I still haven’t solved (and make note of in the source where it would possibly be handled) is that the signature for a function when you do a pg_restore -l is different for any comments that are also associated with that function if variables are named in the parameter list.

pg_restore -l pgdump.pgr
14663; 1255 16507 FUNCTION keith do_something(integer, text) postgres
48818; 0 0 COMMENT keith FUNCTION do_something(data_source_id integer, query text) postgres

Since looping back over the pg_restore -l list is how I find overloaded functions, ACLs and comments, it means it’s very hard to match those comments with those functions in all cases since PostgreSQL has variable types with spaces in them. If anyone has any solutions for this, I’d appreciate feedback. But this honestly seems like a bug to me since only the variable type in the parameter list is actually needed to uniquely identify a function. If it’s ok for the function definition itself, it should be ok for the associated comments as well.

As I said in the beginning, I will hopefully get Git integration done as well. Another option I’m working on right now is a filter for objects where a role has permissions on it, not just ownership (which there is an option for!). For example, dump all objects that the role keith has permission on. Maybe even more specific permissions.

I hope others will find this as useful as I have. It’s one of my first large projects released into the public, so I’d appreciate any constructive feedback on code quality and bugs found. I’ve enjoyed working on it and learned a great deal in the process.