Keith's Ramblings…

Archive for the ‘pg_extractor’ tag

PG Extractor – Finally a version number…

without comments

So it’s been a while since I released pg_extractor and made my first blog post. It’s had quite a few improvements in that time, and actually made version 1.0.0! (currently at 1.2.0) Many thanks to Joby Walker and Heather Sherman for the new features and bug fixes!

The the most significant and the one I’m most thankful for getting done is

        Use a configuration file to list all the options you'd like to use. Each option is
        listed on its own line exactly as it would appear on the command line. This can be
        used in combination with command line options, but preference will be given to
        whichever is listed LAST on the command line. Also note that unlike other options
        here, there is NO equal sign between the option and the path to the options file.

The available options for this script just kept growing and growing and if you needed to use a significant amount of them, the command line got pretty crowded. This uses the Getopt::ArgvFile perl module, so please make sure you have that installed if you’re going to use this. It is an optional module, so the script will still work if you don’t have it and can’t use this option.

In the order things were added since my last blog post about adding Git support:

* Added support for getting roles from versions <=8.2 (thanks to Depesz)

        Adds DROP commands to the SQL output of all objects. Allows the same behavior as
        OR REPLACE since ACLs are included with all objects. WARNING: For overloaded
        function/aggregates, this adds drop commands for all versions to the single output
        export schema ddl
        breakout each schema's content into subdirectories under the database directory
        name of the directory under database name to place the export file with role data.
        no impact without the --getroles or --getall option.
        name of the directory under the database name directory to place the pg_dump file.
        has no impact without the --sqldump option

* Automatically add newly created folders to svn. This will make it easier when, for example, you might not be outputting views but once some are added then exported, they’ll automatically get added into svn.

        database name (replaces --dbname as the directory name); used as directory under
        --hostname (If you are extracting multiple databases this is unsafe)
        Modifies the function and view ddl files to replace CREATE with CREATE OR REPLACE.
        location of the required postgresql binaries (pg_dump, pg_restore, pg_dumpall). If
        these are all in the same location, you can use this single option instead of the
        individual --pgdump, --pgrestore and --pgdumpall options.

One of the biggest shortcomings with the tool that I’ve come across is with the exporting of individual table data. This is really noticeable on databases with large numbers of tables. In order to get all the table dependencies included in a single file (indexes, constraints, permissions, etc) I have to run the pg_dump command for each table back against the database again. Using pg_restore against the temporary dump file (as I do with all the other objects) with -t does not include all that table information in the output. And the object list that pg_restore -L outputs provides no link between the table and its dependencies. I got to do a Lightning Talk at PGCon 2012 this year and brought this shortcoming up. I was able to talk to Robert Haas briefly about it and I’m hopeful that now that the issue is known, this can be cleared up in future versions. One of my goals in making this app was to point out the shortcomings of pg_dump and help make the core app better. Who knows, maybe one of these days I’ll get familiar enough with C and the PostgreSQL source and be able to help improve it myself!

Written by Keith

July 2nd, 2012 at 10:47 am

Posted in PostgreSQL

Tagged with , ,

PG Extractor – A smarter pg_dump

without comments

UPDATE: This post still gets a lot of views due to search results for pg_extractor. Please note that the perl version this post discusses is no longer in development and a complete rewrite in Python was done for version 2.0. The version control features were removed as well. Please see the README file for more information.

For my debut blog post, I’ll be explaining a tool, pg_extractor, that I began working on at my current job. I’d like to first give a big thank you to depesz and the DBA team at OmniTI for their help in making this tool.

We had already had a tool in use for doing a one-file-per-object schema dump of PostgreSQL databases for use in sending to version control. It was originally designed with some complex queries to fetch the internal table, view & function schema, but with the release of 9.x, many of those queries broke with internal structure changes. It also didn’t account for partitioned tables with inheritance, overloaded functions and several other more advanced internal structures. I’d originally set out to just fix the schema queries to account for these changes and shortcomings but depesz mentioned his idea for a complete reworking of the tool using pg_dump/pg_restore instead of complex queries. This would hopefully keep it as future proof as possible and keep the output in a format that was better guaranteed to be a representation of how PostgreSQL sees its own object structure, no matter how it may change in the future.

SVN options have been integrated into the tool and I have plans on getting Git in as well. While the ideal way to track a database’s schema in VCS would be before you actually commit it to your production system, this was an easy way to just have the schema available for reference and automatically track changes outside of the normal development cycle.

Working on this tool has shown many shortcomings with the built in pg_dump/restore. One of my hopes with this project is to show what improvements can be made and possibly built in sometime in the future. One of the biggest is the lack of outputting of ACLs and Comments for anything but tables. And if you have many individual objects you’d like to export, formatting that for pg_dump can be quite tedious. And there is a lack of filtering for anything other than tables and schemas with pg_dump. The filter options list for pg_extractor has become quite long and will probably only continue to grow. Just do a –help or see the documentation on GitHub. I think one of the nicest features is the ability to use external files as input to the filter options and also filter with regex!
Probably best to move on to some examples instead of continuing this wall of text…

perl -U postgres --dbname=mydb --getall --sqldump

This is a very basic use of the tool using many assumed defaults. It uses the postgres database user, grabs all objects (tables, views, functions/aggregates, custom types, & roles), and also makes a permanent copy of the pg_dump file used to create the individual schema files. I tried to use options that were similar to, or exactly the same as, the options for pg_dump/restore. There’s only so many letters in the alphabet though, so please check the documentation to be sure. Just like pg_dump, it will also use the PG environment variables $PGDATABASE, $PGPORT, $PGUSER, $PGHOST, $PGPASSFILE, $PGCLIENTENCODING if they are set (they are actually used internally by the script as well if the associated options are used).

perl -U postgres --dbname=mydb --getfuncs --n=keith

This will extract only the functions from the “keith” schema. Any overloaded functions are put into the same file. I’d thought about trying to use the parameters to somehow make unique names for each version of an overloaded function, but I found this much easier for now, especially when going back and removing files if the –delete cleanup option is set.

perl -U postgres --dbname=mydb --getfuncs -p_file=/home/postgres/func_incl --n=dblink

This will extract only specifically named functions in the given filename. You must ensure that the full function signature is given with only the variable types for arguments. When using include files, it’s best to explicitly name the schemas that the objects in the file are in as well (it makes the temporary dump file that’s created smaller). The contents of any external file list are newline separated like so

dblink_exec(text, text)
dblink_exec(text, text, boolean)
dblink_exec(text, boolean)

One of the things I ensured was possible for external file lists was that you can use an object list created with psql when using the \t and \o options

postgres=# \t
Showing only tuples.
postgres=# \o table_filter
postgres=# select schemaname || '.' || tablename from pg_tables;

You could then use the “table_filter” file with the –t_file or –T_file options. I use this ability at work for a cronjob that generates a specific list of views I want exported and tracked in SVN. But that view list is not static and can change at any time. So the cronjob generates a new view filter file every day.

perl -U postgres --dbname=mydb --gettables -Fc --t_file=/home/postgres/tbl_incl --getdata

This extracts only the tables listed in the given filename along with the data in the pg_dump “custom” format. By default, pg_extractor’s output will all be in “plain” (-Fp) format since the intended purpose of the tool is to create output in a human-readable format.

perl -U postgres --dbname=mydb --pgdump=/opt/pgsql/bin/pg_dump --pgrestore=/opt/pgsql/bin/pg_restore --pgdumpall=/opt/pgsql/bin/pg_dumpall --getall --regex_excl_file=/home/postgres/part_exclude --N=schema1,schema2,schema3,schema4

This excludes objects (most likely table partitions) that have the pattern objectname_pYYYY_MM or fairly similar. Binaries are also not in the $PATH and it EXCLUDES several schemas.
part_exclude file contains: _p_?(20|19)\d\d(_?\d+)*$

perl -U postgres --dbname=mydb --svn --svncmd=/opt/svn/bin/svn --commitmsg="Weekly svn commit of postgres schema" --svndel

This is an example using svn. To keep the svn password out of system logs, the svn username & password have to be manually entered into the script file. It’s the only option that requires any manual editing of the source code. May see about having this as an option that points to an external file instead. The –svndel option cleans up any objects that are removed from the database from the svn repository.
Update: Writing up this blog post got me motivated to fix having to edit any source to get a certain option to work. There is now an –svn_userfile option where you can give it the path to a file containing the svn username and password to use. See the help for formatting. Just set permissions appropriately on this file for security and ensure the user running pg_extractor can read it.

One of the first fun issues I came across was handling objects that have special characters in their names. It’s rare, but since individual objects are all in their own files based on their object names, the OS doesn’t particularly like that, and caused some rather annoying errors every time the script ran. This is now handled by hex-encoding the special character and preceding that character with a comma in the file name.
For example: table|name becomes table,7cname.sql
This hopefully makes it easy to decode if needed by any other tools. It’s actually done internally in the tool when you want objects that were deleted to also have their files deleted on subsequent runs.

One issue that I still haven’t solved (and make note of in the source where it would possibly be handled) is that the signature for a function when you do a pg_restore -l is different for any comments that are also associated with that function if variables are named in the parameter list.

keith@pgsql:~$ pg_restore -l pgdump.pgr 
14663; 1255 16507 FUNCTION keith do_something(integer, text) postgres
48818; 0 0 COMMENT keith FUNCTION do_something(data_source_id integer, query text) postgres

Since looping back over the pg_restore -l list is how I find overloaded functions, ACLs and comments, it means it’s very hard to match those comments with those functions in all cases since PostgreSQL has variable types with spaces in them. If anyone has any solutions for this, I’d appreciate feedback. But this honestly seems like a bug to me since only the variable type in the parameter list is actually needed to uniquely identify a function. If it’s ok for the function definition itself, it should be ok for the associated comments as well.

As I said in the beginning, I will hopefully get Git integration done as well. Another option I’m working on right now is a filter for objects where a role has permissions on it, not just ownership (which there is an option for!). For example, dump all objects that the role “keith” has permission on. Maybe even more specific permissions.

I hope others will find this as useful as I have. It’s one of my first large projects released into the public, so I’d appreciate any constructive feedback on code quality and bugs found. I’ve enjoyed working on it and learned a great deal in the process.

Written by Keith

January 5th, 2012 at 1:39 pm

Posted in PostgreSQL

Tagged with , ,