Tuesday, August 18, 2015

So, what is TABLESAMPLE actually good for?

Q: So, what is TABLESAMPLE actually good for?

A: To speed things up that can be answered from a small sample instead of the whole population!

Like so:

CREATE TABLE public.ts_test
(
  id integer NOT NULL DEFAULT nextval('ts_test_id_seq'::regclass),
  val double precision,
  CONSTRAINT ts_test_pkey PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);

Insert 100 million random values in the range 0.0 <= x < 1.0:

INSERT INTO public.ts_test (val)
SELECT
    random()
FROM
    generate_series(1, 100000000);

And now, some queries to get the standard deviation.

For the whole population:

SELECT round(stddev_pop(val)::numeric,5) FROM ts_test;

0.28869 in 15005 ms
Error: 0%
Speedup: 1x

For a sample of size 20000 (good for 1% error and 99% confidence):

SELECT round(stddev_samp(val)::numeric,5) FROM ts_test TABLESAMPLE BERNOULLI(0.02);

0.28867 in 2387 ms
Error: < 1%
Speedup: 6.3x

SELECT round(stddev_samp(val)::numeric,5) FROM ts_test TABLESAMPLE SYSTEM(0.02);

0.28791 in 25 ms
Error: < 1%
Speedup: 600x

With a table that large, both methods BERNOULLI and SYSTEM are accurate enough to give results well below the 1% error target.

(For potential problems with SYSTEM, see this post)

SYSTEM aces BERNOULLI with a speedup of 600x vs. 6.3x relative to a full population query.

And this is hard to beat...

Addendum: TABLESAMPLE only works on Tables and materialized Views!

No comments:

Post a Comment