Keith's Ramblings…

Archive for the ‘full-text search’ tag

PostgreSQL Full Text Search – An Unexpected Use

with 5 comments

One of our clients recently asked for a way to manage a stopwords table in PostgreSQL so that they could try to shorten the URLs for the titles of their web pages. In all honesty, I had no idea what a stopwords list even was at the time the question was asked. The client was previously a MySQL user, so they’d sent me a link to its documentation on them as an example. Once I saw that it was related to full-text search, I immediately started reading up on PostgreSQL’s abilities with that.

The client’s original thinking was to just get a table of the words like the ones listed in the MySQL docs and use that to join against to shorten the URL. That seemed inefficient in the long run, so I started playing with some of the full-text search queries that PostgreSQL has. While you can add more to the internal dictionaries if you really need to, I just started using what was built in to see how it would work. Using all the words in the story’s title (Bob Woodward On John Boehner’s Refusal To Take Obama’s Call), the unchanged URL would be something like this

The client was hoping to just simplify it to something like

PostgreSQL turned out to be even better at this than the client was expecting

SELECT title, plainto_tsquery(title) AS slug FROM table WHERE id = 123;
-[ RECORD 1 ]------------------------------------------------------------------------
title | Bob Woodward On John Boehner's Refusal To Take Obama's Call
slug  | 'bob' & 'woodward' & 'john' & 'boehner' & 'refus' & 'take' & 'obama' & 'call'

Doing a little text formatting on that result gives something usable as a URL string

 SELECT title, replace(plainto_tsquery(title)::text, ''' & ''', '-') AS slug FROM table WHERE id = 123;
-[ RECORD 1 ]------------------------------------------------------
title | Bob Woodward On John Boehner's Refusal To Take Obama's Call
slug  | 'bob-woodward-john-boehner-refus-take-obama-call'

So, not only does it easily apply a stopwords filter with no additional maintenance required, it also reduces some words to their base formats to make the string even shorter. To avoid possible duplications in the reduced titles, the client also puts a short, unique identifier string on the end of the URL.

This was my first introduction to using the full-text search capabilities in PostgreSQL. Both the client and myself were pretty impressed with this. Probably not how people typically use it, but I thought it was an interesting use case to share.

Written by Keith

October 15th, 2012 at 12:47 pm

Posted in PostgreSQL

Tagged with ,