MFVSketch: Most Frequent Values variant of CountMin sketch, implemented as a UDA.
Produces an n-bucket histogram for a column where each bucket counts one of the most frequent values in the column. The output is an array of doubles {value, count} in descending order of frequency; counts are approximated via CountMin sketches. Ties are handled arbitrarily.
The MFV frequent-value UDA comes in two different versions:
mfvsketch_top_histogram( col_name, n )
mfvsketch_quick_histogram( col_name, n )
In PostgreSQL the two UDAs are identical. In Greenplum, the quick version should produce good results unless the number of values requested is small, or the distribution is flat.
CREATE TABLE data(class INT, a1 INT); INSERT INTO data SELECT 1,1 FROM generate_series(1,10000); INSERT INTO data SELECT 1,2 FROM generate_series(1,15000); INSERT INTO data SELECT 1,3 FROM generate_series(1,10000); INSERT INTO data SELECT 2,5 FROM generate_series(1,1000); INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
SELECT mfvsketch_top_histogram( a1, 5 ) FROM data;Result:
mfvsketch_top_histogram ------------------------------------------------------------- [0:4][0:1]={{2,15000},{1,10000},{3,10000},{5,1000},{6,1000}} (1 row)
File sketch.sql_in documenting the SQL functions.
Module CountMin (Cormode-Muthukrishnan).