Thursday, February 26, 2015

The Long Tail - vertical table partitioning I


DISCLAIMER: This is just an idea, I don't have tried this in a production environment!

Having said that, any input is welcome. :-)

Storing tons of rows in a table of which only a small percentage of rows are frequently queried is a common scenario for RDBMS, especially with databases that have to keep historical information just in case they might be audited, e.g. in environments regulated by law.

With the advent of the first affordable 8TB harddisk and fast SSDs still being much more expensive per GB, I wondered if such long-tailed tables could be split over a SSD holding the frequently accessed pages and a near-line storage HDD keeping the archive - elegantly - with PostgreSQL.

With elegant, I mean without fiddling around with VIEWs, INSTEAD OF triggers and exposing a clean and familiar interface to the developer.

OK, since PostgreSQL already supports horizontal partitioning, spreading one table transparently over many parallel tables, how about vertical partitioning, spreading one table over a hierarchy of speed?

The speed zones can be mapped to tablespaces:

CREATE TABLESPACE fast LOCATION '/mnt/fastdisk';
CREATE TABLESPACE slow LOCATION '/mnt/slowdisk';

Next comes the table(s):

CREATE TABLE the_table
(
  id integer NOT NULL,
  value real
)
WITH (
  OIDS=FALSE
)
TABLESPACE fast;


CREATE TABLE the_table_archive
(
)
INHERITS (the_table)
WITH (
  OIDS=FALSE
)
TABLESPACE slow;


Table inheritance in PostgreSQL is so cool...

And a function to move data from fast to slow:

CREATE OR REPLACE FUNCTION move_longtail()
  RETURNS boolean AS
$BODY$
DECLARE worked BOOLEAN;
DECLARE rowcount INTEGER;
BEGIN
worked := false;
rowcount := count(*) FROM ONLY the_table WHERE id >= 5000000;
IF (rowcount > 100000) THEN
INSERT INTO the_table_archive SELECT * FROM ONLY the_table WHERE id >= 5000000;
DELETE FROM ONLY the_table WHERE id >= 5000000;
worked := true;
END IF;
RETURN worked;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE STRICT;


This function runs only if a minimum of movable rows qualify. This is recommended since SMR disks like large contiguous writes due to how SMR technically works.

Notice the ONLY keyword. This allows to control precisely which partition of the table is affected by the DML statement. Did I say already that table inheritance in PostgreSQL is so cool?

And basically that's it. All the developer sees is SELECT * FROM the _table; or something, being oblivious to the underlying machinery.

Some kind of automatic maintenance is missing. On to part II...

No comments:

Post a Comment