Keith's Ramblings…

Archive for October, 2012

PostgreSQL Full Text Search – An Unexpected Use

with 5 comments

One of our clients recently asked for a way to manage a stopwords table in PostgreSQL so that they could try to shorten the URLs for the titles of their web pages. In all honesty, I had no idea what a stopwords list even was at the time the question was asked. The client was previously a MySQL user, so they’d sent me a link to its documentation on them as an example. Once I saw that it was related to full-text search, I immediately started reading up on PostgreSQL’s abilities with that.

The client’s original thinking was to just get a table of the words like the ones listed in the MySQL docs and use that to join against to shorten the URL. That seemed inefficient in the long run, so I started playing with some of the full-text search queries that PostgreSQL has. While you can add more to the internal dictionaries if you really need to, I just started using what was built in to see how it would work. Using all the words in the story’s title (Bob Woodward On John Boehner’s Refusal To Take Obama’s Call), the unchanged URL would be something like this

The client was hoping to just simplify it to something like

PostgreSQL turned out to be even better at this than the client was expecting

SELECT title, plainto_tsquery(title) AS slug FROM table WHERE id = 123;
-[ RECORD 1 ]------------------------------------------------------------------------
title | Bob Woodward On John Boehner's Refusal To Take Obama's Call
slug  | 'bob' & 'woodward' & 'john' & 'boehner' & 'refus' & 'take' & 'obama' & 'call'

Doing a little text formatting on that result gives something usable as a URL string

 SELECT title, replace(plainto_tsquery(title)::text, ''' & ''', '-') AS slug FROM table WHERE id = 123;
-[ RECORD 1 ]------------------------------------------------------
title | Bob Woodward On John Boehner's Refusal To Take Obama's Call
slug  | 'bob-woodward-john-boehner-refus-take-obama-call'

So, not only does it easily apply a stopwords filter with no additional maintenance required, it also reduces some words to their base formats to make the string even shorter. To avoid possible duplications in the reduced titles, the client also puts a short, unique identifier string on the end of the URL.

This was my first introduction to using the full-text search capabilities in PostgreSQL. Both the client and myself were pretty impressed with this. Probably not how people typically use it, but I thought it was an interesting use case to share.

Written by Keith

October 15th, 2012 at 12:47 pm

Posted in PostgreSQL

Tagged with ,

PostgreSQL Extension Developer Tips – Part 2

without comments

Following up on my previous tip, there’s a way you can make other extensions optional using the last tip’s method of finding another extension’s schema. A use case for this is my pg_jobmon extension.

First, you do the same check for the other extension’s schema

SELECT nspname INTO v_jobmon_schema FROM pg_namespace n, pg_extension e WHERE e.extname = 'pg_jobmon' AND e.extnamespace = n.oid;

Then, wherever you would optionally use your other extension, just check and see if the schema variable is null

IF v_jobmon_schema IS NOT NULL THEN
    v_job_id := add_job('PARTMAN CREATE FUNCTION: '||p_parent_table);
    v_step_id := add_step(v_job_id, 'Creating partition function for table '||p_parent_table);

If you want to see a working example, the partition management extension I’m working on does this. If pg_jobmon is installed, it just automatically does the logging. If it’s not there, all logging steps are simply ignored and the extension does only its core functionality.

You can also use this method to check for a specific version of another extension and allow you to support multiple versions of the other extension over time

SELECT replace(extversion,'.','')::int FROM pg_extension e WHERE e.extname = 'pg_jobmon';

IF v_jobmon_version >= 30 THEN
    ... do some stuff only in 0.3.0 or later...
    ... do stuff the pre 0.3.0 way...

Written by Keith

October 7th, 2012 at 10:44 pm

Posted in PostgreSQL

Tagged with , ,

PostgreSQL Extension Developer Tips – Part 1

without comments

I’ve been working on several extensions for PostgreSQL over the last few months. I was going to do one big blog post of some developer tips, but figured I’d just do this as an ongoing series as I come across something useful to share.

One of the first issues I came across was when my extension required another extension to work. I was designing my extension to be released to the public, so I couldn’t assume what schema someone may put those other extensions in. After some discussion with coworkers and at PGCon 2012 in the Royal Oak track, I managed to find a method to dynamically use any other extension’s schema.

I’ve been working on a per-table replication extension that requires both dblink and another extension I’ve blogged about called pg_jobmon. I can’t control where someone may install dblink and I didn’t set any requirements for pg_jobmon’s schema either. The first step is to grab what schema these extensions were installed to and save that for use later.

SELECT nspname INTO v_dblink_schema FROM pg_namespace n, pg_extension e WHERE e.extname = 'dblink' AND e.extnamespace = n.oid;
SELECT nspname INTO v_jobmon_schema FROM pg_namespace n, pg_extension e WHERE e.extname = 'pg_jobmon' AND e.extnamespace = n.oid;

Now, I could just use those variables and dynamically write EXECUTE statements where those schema names are needed for the rest of the function. I found this to be quite a pain when using pg_jobmon frequently to log different steps of a function’s execution. Also, using a lot of dynamic EXECUTE statements isn’t that great for efficiency. So, just for the running of this function, I change the search path, first saving the original one so I can restore it later.

SELECT current_setting('search_path') INTO v_old_search_path;
EXECUTE 'SELECT set_config(''search_path'',''@extschema@,'||v_jobmon_schema||','||v_dblink_schema||''',''false'')';

Note I added the extension’s own schema in there as well to make calling its own functions easier too. Then after the function body, I restore the search path to what it was originally

EXECUTE 'SELECT set_config(''search_path'','''||v_old_search_path||''',''false'')';

So, it’s still using EXECUTE statements, but at most it’s two instead of potentially many more. And it makes calling the functions in other schemas much easier since I can just use SELECT or PERFORM statements most of the time.

One issue I ran into with this is during exceptions. The exception block resets the path, so if you have to use any of the other extensions’ functions in the exception block you either have to write execute statements using the schema variables or reset the search path again the same way.

        EXECUTE 'SELECT set_config(''search_path'',''@extschema@,'||v_jobmon_schema||','||v_dblink_schema||''',''false'')';
       ... do stuff with other extensions...
        EXECUTE 'SELECT set_config(''search_path'','''||v_old_search_path||''',''false'')';

Now, if you look at the documentation for set_config, if you set the last parameter to ‘true’ instead of false, it should only apply the change to the current session. I was running into some weird edge cases if/when the function failed that would not reset the path properly and potentially cause other issues. So I just went with ‘false’ and manually resetting the path to ensure it doesn’t interfere with other queries in the session.

Of course this can be applied to function authoring in general for PostgreSQL. But extension development is where I’ve come across it the most so far since this is most likely where you’re going to be sharing code that cannot assume the destination’s environment.

Written by Keith

October 7th, 2012 at 12:57 pm

Posted in PostgreSQL

Tagged with , ,