Tuesday, October 11, 2016

pg_sentinel - Protecting data with sentinel values

According to OWASP, SQL injection is the #1 attack vector against web applications. Well at least it was in 2013, but a 2016 survey is under way. But since it also was #1 in 2010, it might make the triple.

But even if injection does not work, an amazing number of database servers can be discovered by Shodan - and it's not only MongoDB...

Quite a few years ago, I read about the idea to detect leaking data by using sentinel values. The concept is to put values into the data that are never queried and thus never emitted if a certain application uses the data as it should.
But if an attacker bypasses security and slurps all of the data, e.g. by SELECT * FROM sometable; the sentinels are emitted, detected, and countermeasures can be triggered.

As far as I can remember, the original idea was to put sensors on the network to detect the sentinels, but since then I always wondered if it could be applied to the PostgreSQL server itself.

Where to put the sensor then?

The application is certainly the wrong place, since the attacker might bypass it altogether. The driver isn't an option also, because the attacker might bring his own driver. Even a middleman like pgbouncer or pgpool-II could be bypassed, so the only way is to sink the detector as deep as possible into the heart of the server.

To cut a long story short, as proof-of-concept I chose to write a PostgreSQL module that hooks into ExecutorRun(): pg_sentinel.

Documentation comes with the module.

If you find things to improve, feel free to do so. After all, except for a minuscule contribution to pg_plan_filter, I had no previous experience programming modules and hooks and started eight hours ago from zero.

Friday, September 23, 2016

pgchem::tigress vs. OpenBabel 2.4.0

OpenBabel 2.4.0 is released, the build process worked flawlessly, 100% tests passed.

Now I have to see if the Tigress still likes it...

Well, pgchem::tigress builds against OpenBabel 2.4.0 without errors, but will it work?

Yes, pgchem::tigress works with OpenBabel 2.4.0 without any notable alterations, except changing:


in the Makefile to:


As always with .0 releases, subtle issues might be lurking below the surface, though.

Thursday, September 15, 2016

Blending gene sequence variation data into PostgreSQL

VCF, the Variant Call Format, is a clever idea. Instead of storing all genetic information of a sampled genome, it only stores the delta against some reference genome. This squeezes a lot of redundancy out of the data and thus occupies a lot less storage space.

Unfortunately, VCF is also a unwieldy format. Only a part is fixed, with metadata in the header describing fields in the body which in turn describe the actual data format of the samples.

This makes VCF especially hard to grasp for systems like Hadoop that work on chunked files spread over many compute nodes. With VCF, every chunk has to carry a copy of the header to make sense of the data it carries.

Formats like ADAM are under development that tackle this (any many more) problems, but when I was given the task to make VCF files accessible to a database last year, I took the direct route:

Since there have been already quite powerful tools and libraries to work with VCF files around for some time, I used Multicorn and PyVCF and wrote a foreign data wrapper for PostgreSQL that understands the VCF format.

After it was finished, I realized that it had been done before...

However, my implementation is different in some parts. Most notably it abstracts more from the actual storage of the VCF files and it works with vanilla PyVCF and does not need any special modifications.

A few days ago, I was granted permission to release the code into the wild by the people who paid for it in the first place, so now you have the freedom of choice. ;-)

Ladies and Gentlemen, please take a look at just another multicorn based foreign data wrapper for VCF files for PostgreSQL, the one and only truly integrarelational DBMS.

Thursday, August 25, 2016

Timeseries: How long can the elephant remember?

Frankly, I don't know where the practical limit for the number of rows in a single PostgreSQL table is from experience, but the interwebs seems to agree on 10^9 for narrow tables.

After a lively discussion with a NoSQL afficionado yesterday about the (in)ability to effectively store timeseries data in a RDBMS I made a quick calculation.

Timeseries data is usually a triple of the form key timestamp value, so it can be stored in a pretty narrow table, hence I stick to the 10^9 rows limit.

If we get a data point every second, we can store 10^9 seconds worth of data. 10^9 seconds is 16666666.6667 minutes, which is 277777.777778 hours, which is 11574.0740741 days, which is good for about 31 years of recording.

Every second of 31 years. Per table.

Thursday, August 18, 2016

Hexastores are easy

Did you know that you can make a Hexastore from a RDF triple in just one line of SQL? (This needs PostgreSQL 9.4 or better, because of the multi-array unnest)

    IN sub text,
    IN pred text,
    IN obj text)
  RETURNS TABLE(ord text, a text, b text, c text) AS
$$select A.t || B.t || C.t as ord, A.v, B.v, C.v from (select * from unnest(ARRAY[sub, pred, obj],ARRAY['s', 'p', 'o'])) as A(v, t) cross join (select * from unnest(ARRAY[sub, pred, obj],ARRAY['s', 'p', 'o'])) as B(v, t) cross join (select * from unnest(ARRAY[sub, pred, obj],ARRAY['s', 'p', 'o'])) as C(v, t) where a.v != b.v and a.v != c.v and b.v != c.v order by ord desc$$
  COST 100
  ROWS 6;

SELECT * FROM hexify('subject','predicate','object');

Sometimes, PostgreSQL SQL is just awesome...

More on Hexastores here and here.

Thursday, June 23, 2016

When 'good enough' is good enough - approximate answers with PostgreSQL 9.4999

Approximation in databases seems to be an alien concept at first. But if your application can deal with a known and controllable degree of error, it can help even in cases where conventional tuning is not an option for whatever reason.

Approximation is not evil 


One of the top requirements for database systems is reliability. Whether you run a big bank or a small retail business, you don't want to lose a cent here or there or charge your customer twice for the Pink Fluffy Unicorn he just bought, just because the DBMS gave a wrong answer. Classic OLTP operations have to be always 100% correct.

However, for the case of analytics, things become different. In some cases, it can be desirable to trade a bit of accuracy for a lot of speed. This is called approximation and to many database people (and users), the concept of accepting results with less than 100% accuracy seems strange at first.

But if you know - and can control - the error introduced by approximation, it is not. It can even be very useful, if a 95% accurate answer now is worth more than a 100% accurate answer tomorrow.

Welcome to approximation in PostgreSQL 9.5.

Approximating queries


Approximate queries work on subsets, called samples, of the whole data set, called the population.
If the sampling is done statistically correct, a sample much smaller than the whole population gives answers close to the real answer within a known error range.

A possible application for the hypothetical retail business would be to find which product is currently trending.
Instead of knowing that exactly 500, 1000 and 2000 Pink Fluffy Unicorns were sold in the last three weeks, knowing that 498, 1001 and 1999 Pink Fluffy Unicorns were sold in the last three weeks with let's say 5% error tells the procurement people that Pink Fluffy Unicorns are a trending product just as fine as the exact numbers. Only, they might have to wait a few seconds for the answer instead of a few hours...

PostgreSQL 9.5 has built-in support for approximate queries. Because I'm lazy and already wrote about this I just point to the corresponding post.

Still, all the population data has to be there for approximate queries to work. How about running queries without storing the underlying data at all?


Approximating data structures 


If PostgreSQL has a weakness, it's the comparably poor performance of count() and distinct.
Due to the lock-free multiversion concurrency design of PostgreSQL, count() has to touch each row in a table to check whether it is visible in the current transaction or not. Unlike locking DBMS like Oracle, it can only use an index to count in a few cases . Full table scan.

Distinct always has to sort the table. It can use an index, but only covering indexes, and the larger the index is compared to the table, the less likely PostgreSQL will use it. Sorting can be tuned by raising work_mem, but since this is a per session parameter, it is limited by available RAM.

So count(distinct) is like the worst of both worlds (In the following example distinct alone is slower, because it has to return ten million rows to the client, count(distinct) returns only one value).
Like here (times are w/o Index / w Index):
create table hmr(id serial, value real);

insert into hmr (value) select random()*10000000 from generate_series(1,10000000);

select count (value) from hmr; --826 msec. / 817 msec.

select distinct value from hmr; --33917 msec. / 32420 msec.

select count (distinct value) from hmr; -- 9665 msec. / 9439 msec.

Enter the HyperLogLog cardinality estimator. Some clever people at Google observed, that the cardinality of a multiset of evenly distributed random numbers can be predicted by finding the maximum number of leading zeroes in the binary representation of those numbers: For a maximum of k leading zeroes, the cardinality is 2^k.

HyperLogLog uses a hash function to transform arbitrary input values into such random numbers and thus allows to estimate the cardinality of an input multiset for cardinalities > 10^9 with a 2-3% error, using only 1280 bytes of storage

PostgreSQL has a HyperLogLog extension, hll.
create extension hll;

CREATE TABLE cardinality (
            id      integer,
            set     hll

INSERT INTO cardinality(id, set)
    SELECT 1, (select hll_add_agg(hll_hash_any(value))
    FROM hmr); -- 2267 msec.

SELECT hll_cardinality(set)::int FROM cardinality WHERE id = 1; -- 11 msec.
Since count distinct(value) = 8187749 and hll_cardinality = 8470057, the error is ~3%

Another, not so PostgreSQL specific example would be a database that has a stream table, e.g. holding only one hour worth of events at any given point in time. I showed how to do this with stock PostgreSQL and a bit of Java here and here.

If you also want to know, how many distinct events that stream has seen in total, it's impossible, unless you store all distinct values and update their counts every time a new event arrives. But then, you might end up in storing all events - which is not what you wanted in the first place if you chose to use a stream table.

With HyperLogLog it's easy. Update your HyperLogLog estimator on every new event and you get a good approximation how many distinct values the stream has seen in total.

Approximating indexes


9.5 introduced BRIN indexes for very large tables. Unlike e.g. a btree, BRIN stores only ranges of values and points to the physical pages where a value that falls into that range could possibly be found.

A BRIN index thus only gives precise answers to the question where a certain value could not be found on disk.

9.6 will have Bloom-Filter indexes as an extension. Bloom filters can tell you that a value does not exist in a set with perfect accuracy. But the question if a value exists in the set can only be answered with a probability that increases with the collision resilience of the underlying hash.

So, as BRIN and Bloom indexes both are approximating indexes, every index hit has to be rechecked by the DBMS against the underlying data. But if you know their limitations and use them accordingly, they too can speed up your queries quite a bit.

Monday, March 14, 2016


According to the access statistics, my blog has now more readers from the Czech Republic than from the U.S.A. which had the lead for the last few years.