Document Storage in PostgreSQL & Open Source Benefits

2015-11-06 1421 words 7 minutes

Contents

Update 2016-07-20: Since this blog post, I’ve recently gone back and updated pg_doc_store to take advantage of the new INSERT ON CONFLICT (upsert) feature in 9.5. So the extension is much more ready for possible production use if anyone finds it useful.

This past week I’ve had two experiences that show the amazing benefits of having your code be part of an open source community. The first involves my pg_partman extension. People have been asking me for quite some time about having a more generalized partitioning solution beyond just time/serial. I’ve resisted because that’s really not the focus of the tool since outside of those two types, partitioning is usually a once and done setup and only rarely needs further maintenance. Also, that would add quite a bit more complexity if I wanted to support, for example, all the things that MySQL does (range, list, hash, key) and all the variations that can be possible in there. I’ve already been feeling feature creep as it is, so wasn’t in a hurry for this. But, a user of pg_partman submitted some great work for getting a generalized range partition setup going with hopes of possibly integrating it with pg_partman. I instead thought it would be better released as its own tool. You can find his repo here and give it a try if you’ve been needing that feature: (UPDATE: Range Partitioning project has been discontinued since native partitioning was included in PG10+. Link removed since repo is gone).

I told that story to tell another similar story with the roles reversed. Around the same time I was testing out that range partitioning code, I came across this talk from Rob Conery about the new JSONB document storage features in PostgreSQL 9.4 (as well as learning about his interesting Star Wars Theory (possible episode 7 spoiler alert)).

I still hadn’t had any chances for practical use of JSON in any work I’ve done, but the pg_docs_api he demos at the end caught my interest. And then his note at the bottom of his README requesting a plpgsql version gave me something practical to work on to learn a bit more.

The closest I’ve come to actually using any NoSQL has been using Redis as a caching frontend and even then it was for a third-party tool where I don’t really look at the contents much. I did do the MongoDB tutorial program so I’m familiar with how it basically works, but never used it in production. One thing I always thought was neat was how you can just throw schemaless data at it and it (mostly) stores it without argument, adding some extra meta-data to make it more useful later. The pg_docs_api is the first attempt I saw to bring something similar to this to PostgreSQL. And even if it turns out there’s something like this already out there and I’m duplicating work, or it doesn’t actually end up being very useful for others in the end, it was still useful for me as a learning opportunity.

So I forked the repo and did the work. I submitted it back as a push request, turning it into an extension as well. Turns out the author decided to do exactly what I did above. He suggested I release it as my own project since he felt it was unique enough from his. So I give you pg_doc_store.

Just as a quick demo, there are some examples below. More details can be found in the docs. The original author looked to be replicating the create, save, & find commands in MongoDB so I tried to keep that theme going. Just throw your json at it with the save_document() command and things should “just work”. Only caveat at this time is that it will require the UPSERT feature in PostgreSQL 9.5 to actually be 100% atomic and cause no consistency errors. I just use an exception loop for now, so beware of race conditions.

And while this functionality is currently all based on MongoDB, I don’t have any plans on restricting the interface or underlying structure to the way MongoDB is. Like Rob found, it does provide a good template to get an API going and it’s widely known so its basic functionality isn’t so foreign.

The create_document() function** **can be thought of like the create command in MongoDB. The tablename given is your collection. All functions return a set, so you can either call them with just SELECT or you can do SELECT * FROM to get more formatted output.

keith=# select create_document('public.mycollection');
    create_document    
-----------------------
 (mycollection,public)
(1 row)

Time: 99.392 ms
keith=# drop table mycollection;
DROP TABLE
Time: 16.149 ms
keith=# select * from create_document('public.mycollection');
  tablename   | schemaname 
--------------+------------
 mycollection | public
(1 row)

Time: 12.085 ms

keith=# \d public.mycollection
                        Table "public.mycollection"
   Column   |           Type           |             Modifiers              
------------+--------------------------+------------------------------------
 id         | uuid                     | not null default gen_random_uuid()
 body       | jsonb                    | not null
 search     | tsvector                 | 
 created_at | timestamp with time zone | not null default now()
 updated_at | timestamp with time zone | not null default now()
Indexes:
    "mycollection_pkey" PRIMARY KEY, btree (id)
    "mycollection_body_idx" gin (body jsonb_path_ops)
    "mycollection_search_idx" gin (search)
Triggers:
    mycollection_trig BEFORE INSERT OR UPDATE OF body ON mycollection FOR EACH ROW EXECUTE PROCEDURE update_search()

The save_document() function is the main workhorse, just as the save command is in Mongo.

keith=# drop table mycollection;
DROP TABLE
Time: 5.846 ms

keith=# select save_document('public.mycollection', '{"FirstName": "Keith", "LastName":"Fiske", "Company":"OmniTI"}');
                                                 save_document                                                  
----------------------------------------------------------------------------------------------------------------
 {"id": "3225d06c-1f49-44e8-b32f-a29234a7a5fb", "Company": "OmniTI", "LastName": "Fiske", "FirstName": "Keith"}
(1 row)

Time: 29.499 ms
keith=# select * from mycollection ;
-[ RECORD 1 ]--------------------------------------------------------------------------------------------------------------
id         | 3225d06c-1f49-44e8-b32f-a29234a7a5fb
body       | {"id": "3225d06c-1f49-44e8-b32f-a29234a7a5fb", "Company": "OmniTI", "LastName": "Fiske", "FirstName": "Keith"}
search     | 'fisk':2 'keith':3 'omniti':1
created_at | 2015-10-29 19:00:43.413318-04
updated_at | 2015-10-29 19:00:43.413318-04

Time: 0.880 ms

You can see I dropped the table so it didn’t exist anymore. Just pass the tablename anyway, and it will create the table if it’s not already there. You can see it also added the primary key id value into the document itself. These are always kept in sync so you can then update your document later. If you pass an id value that does not exist, this will insert it as a new row. Keep in mind the id value here is the uuid data type, so it must meet certain criteria and can’t just be any random number/string. So it’s better to just let it generate it on its own if it’s a new document. This function is where the UPSERT is really needed.

keith=# select save_document('public.mycollection', '{"id":"3225d06c-1f49-44e8-b32f-a29234a7a5fb", "FirstName": "Keith", "LastName":"Fiske", "Company":"OmniTI", "Job":"DBA"}');
                                                        save_document                                                         
------------------------------------------------------------------------------------------------------------------------------
 {"id": "3225d06c-1f49-44e8-b32f-a29234a7a5fb", "Job": "DBA", "Company": "OmniTI", "LastName": "Fiske", "FirstName": "Keith"}
(1 row)

Time: 5.129 ms
keith=# select * from mycollection ;
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------
id         | 3225d06c-1f49-44e8-b32f-a29234a7a5fb
body       | {"id": "3225d06c-1f49-44e8-b32f-a29234a7a5fb", "Job": "DBA", "Company": "OmniTI", "LastName": "Fiske", "FirstName": "Keith"}
search     | 'dba':1 'fisk':3 'keith':4 'omniti':2
created_at | 2015-10-29 19:00:43.413318-04
updated_at | 2015-10-29 19:00:43.413318-04

Time: 1.091 ms

Here I inserted another row and use the find_document() function to look for a document with a matching json key/value pair.

keith=# select save_document('public.mycollection', '{"FirstName": "Robert", "LastName":"Treat", "Company":"OmniTI", "Job":"CEO"}');                                                         save_document                                                         
-------------------------------------------------------------------------------------------------------------------------------
 {"id": "ac7b12fe-0627-48e0-85fc-ef0e3c19a813", "Job": "CEO", "Company": "OmniTI", "LastName": "Treat", "FirstName": "Robert"}
(1 row)

Time: 5.767 ms

keith=# select find_document('public.mycollection', '{"Company":"OmniTI"}');
                                                         find_document                                                         
-------------------------------------------------------------------------------------------------------------------------------
 {"id": "3225d06c-1f49-44e8-b32f-a29234a7a5fb", "Job": "DBA", "Company": "OmniTI", "LastName": "Fiske", "FirstName": "Keith"}
 {"id": "ac7b12fe-0627-48e0-85fc-ef0e3c19a813", "Job": "CEO", "Company": "OmniTI", "LastName": "Treat", "FirstName": "Robert"}
(2 rows)

Time: 5.061 ms

By default the result is sorted by id. There are parameters to find to sort by different keys and order. There is also a search_document() function that tries to make use of full-text search. I’ve not really done much with that before, and it seems pretty limited to me right now. It’s only populating the tsvector search column with data based on the document values. I hope to learn more about this and make it more effective. A trigger keeps this up to date after every insert/update. Both these methods take advantage of GIN indexing, so finding document data via either method should be extremely fast and efficient.

So two different PostgreSQL projects out there taking advantage of others building off their ideas. Some other interesting news following this same theme is Greenplum has open sourced their PostgreSQL fork and looks to try to contribute some of its features back to the core system. Hopefully the potential patent issues can be resolved and we’ll see some great new features in future versions of Postgres.

So thank you to Rob for releasing your code under the BSD license and letting me do this!