Thursday, January 28, 2016

ML based prediction with PostgreSQL & PL/R in four rounds - IV

Further digging into the PL/R documentation shows a way to run code at startup and make it globally available.

Round 4

First we need a table called plr_modules:
CREATE TABLE plr_modules (
  modseq int4,
  modsrc text
);
Then we add the necessary entries:

The SVM is now globally available and the predictor function can be reduced to the following:

Let's run this statement three times again:

select s.*, r_predict4(s.*) from generate_series(1,1000) s;

541 ms for the first run. 281 ms for each of the following two. Average: 368 ms.

That's only a 73% improvement compared to the original code. predict3() is faster then predict4().

Since the only difference is, that the mysvm object is now global, that might explain the difference. Maybe it is more expensive to work with global than local objects in R?

But there is another way, suggested by a reader of the first installment of this series.

Push the whole loop into R and wrap it into an SRF.


Let's see and run this statement three times again:

select * from r_predict5(array(select * from generate_series(1,1000)));

341 ms for the first run. 31 ms for each of the following two. Average: 134 ms.

Now that's a 90% improvement compared to the original code. Ten times faster.
If only the post initialization runs are taken into account it's even better: about 45x.

If you process sets anyway and you can live with passing arrays we have a winner here!

Bottom line:
  1. Do expensive initializations only once.
  2. Pre-can R objects that are immutable and expensive to create with saveRDS().
  3. If you process sets, push the whole loop down to R and stay there as long as possible.

ML based prediction with PostgreSQL & PL/R in four rounds - III

The predict2() function initializes the SVM only on first call which improves performance significantly. But it still needs to build the model from scratch.

If the training takes a comparatively long time or the training data cannot be provided along with the code, this is a problem.

Round 3

R has the ability to serialize objects to disk and read them back with saveRDS() and readRDS().

Having saved the SVM object like that, we can restore it from disk instead of rebuilding it each time.

Let's run this statement three times again:

select s.*, r_predict3(s.*) from generate_series(1,1000) s;

484 ms for the first run. 302 ms for each of the following two. Average: 363 ms.

That's a 75% improvement compared to the original code.

Still, the first call is more expensive than the subsequent ones.

Can we do better?

Wednesday, January 27, 2016

ML based prediction with PostgreSQL & PL/R in four rounds - II

The predict1() function from the first post of this series has a performance problem: The svm is trained every time the function is called. Can this be corrected?

Round 2:

Sifting through the PL/R documentation reveals a way to do expensive initializations only once and persist them between function calls.

This leads to the first optimized version of our predictor function:

Let's run this statement three times again:

select s.*, r_predict2(s.*) from generate_series(1,1000) s;

671 ms for the first run. 302 ms for each of the following two. Average: 425 ms.

That's a 60% improvement compared to the original code.

But we still need to provide the training data and run the training once.
What if we can't, because e.g. of sheer size, legal or intellectual property restrictions?

Can we do better?

ML based prediction with PostgreSQL & PL/R in four rounds - I

By means of PL/R,  PostgreSQL can execute R code inside the database server since 2008 or so (welcome MS SQL Server 2016. ;->).

I tried to teach PostgreSQL a new trick lately with the help of PL/R and here are the results...

Round 1:

Let's start with a very simple example of supervised machine learning, using a Support vector machine (SVM) from the e1071 package that I stole from here.

When you run this, the output is something like this

predclasses

ab
a40
b06

This is the confusion matrix of the SVM, and it tells us, that it predicted all classes correctly from the numerical input.

Which in this case is not surprising, because we cross validated with the very same data used for training. Usually you don't do this but split the data into a training and a validation set, but for the sake of brevity this model will do.

The model is now ready to predict classes from numerical input, like so:

And  this is already all we need for a naive implementation of a SVM based predictor function in PostgreSQL.

Let's try...

select r_predict1(7);

a

select r_predict1(2);

b

Whoa, it lives! :-) But what about performance? Let's run this statement three times:

select s.*, r_predict1(s.*) from generate_series(1,1000) s;

1.4 seconds for each run. Average: 1.4 s.

That's not exactly stellar.

Can we do better?

Monday, January 11, 2016

9.5 Feature: ALTER TABLE SET LOGGED / UNLOGGED

With PostgreSQL 9.5 you now can alternate tables between LOGGED and UNLOGGED mode on the fly:

CREATE TABLE demo (id INTEGER, val INTEGER); -- LOGGED

INSERT INTO demo (id,val) select s.*,s.* FROM generate_series(1,1000000) s;


-- Query returned successfully: 1000000 rows affected, 5.4 secs execution time.

TRUNCATE TABLE demo;

ALTER TABLE demo SET UNLOGGED; -- now UNLOGGED

INSERT INTO demo (id,val) select s.*,s.* FROM generate_series(1,1000000) s;


-- Query returned successfully: 1000000 rows affected, 515 msec execution time.

ALTER TABLE demo SET LOGGED;  -- and back again

Noted the INSERT speed difference between LOGGED and UNLOGGED?

Very convenient, e.g. when initially populating large tables and then switch to production mode.

So, 9.5 is not only jsonb, there is much more goodness inside... :-)

One reader pointed out that:

You haven't showed the time for "ALTER TABLE demo SET LOGGED" and haven't said that it gets AccessExclusiveLock on the table so you can't even read from it.

OK:

ALTER TABLE demo SET LOGGED;

-- Query returned successfully with no result in 3.9 secs.

So it takes some time. I still maintain my opinion that this is a useful feature given the fact, that it was impossible before 9.5 to do it in-place.

Especially for tasks that do not require constant availability like: "initially populating [sic] large tables and then switch to production mode".

Thursday, January 7, 2016

pgchem::tigress - bugfix release

I recently discovered a bug in the fingerprinter code of pgchem::tigress which led to the following undesirable behaviour:

select 'ClC1=CC=C(CN2CCN\C2=N\N(=O)=O)C=N1'::molecule = '[O-][N+](=O)N=C1NCCN1Cc1ccc(Cl)nc1'::molecule

true

but:

select fp2string('ClC1=CC=C(CN2CCN\C2=N\N(=O)=O)C=N1'::molecule) = fp2string('[O-][N+](=O)N=C1NCCN1Cc1ccc(Cl)nc1'::molecule)

false

I.e. the fingerprinter code regards the two differently drawn nitro groups as chemically different while the exact matcher correctly recognizes them as chemically equal. So, whenever an index is used for exact match searching, it was overselective, possibly filtering out correct hits.

This has been fixed.

Also, the add_hydrogens() function now accepts a fourth parameter to specify the PH value for protonation. If unspecified, it has a default of 7.4 so existing code using this function won't break.