User Documentation
 All Files Functions Groups
CountMin (Cormode-Muthukrishnan)
+ Collaboration diagram for CountMin (Cormode-Muthukrishnan):
Warning
This MADlib method is still in early stage development. There may be some issues that will be addressed in a future version. Interface and implementation is subject to change.
About:
This module implements Cormode-Muthukrishnan CountMin sketches on integer values, implemented as a user-defined aggregate. It also provides scalar functions over the sketches to produce approximate counts, order statistics, and histograms.
Usage:
  • Get a sketch of a selected column specified by col_name.
    SELECT cmsketch(col_name) FROM table_name;
  • Get the number of rows where col_name = p, computed from the sketch obtained from cmsketch.
    SELECT cmsketch_count(cmsketch,p) FROM table_name;
  • Get the number of rows where col_name is between m and n inclusive.
    SELECT cmsketch_rangecount(cmsketch,m,n) FROM table_name;
  • Get the kth percentile of col_name where count specifies number of rows. k should be an integer between 1 to 99.
    SELECT cmsketch_centile(cmsketch,k,count) FROM table_name;
  • Get the median of col_name where count specifies number of rows. This is equivalent to cmsketch_centile(cmsketch,50,count).
    SELECT cmsketch_median(cmsketch,count) FROM table_name;
  • Get an n-bucket histogram for values between min and max for the column where each bucket has approximately the same width. The output is a text string containing triples {lo, hi, count} representing the buckets; counts are approximate.
    SELECT cmsketch_width_histogram(cmsketch,min,max,n) FROM table_name;
  • Get an n-bucket histogram for the column where each bucket has approximately the same count. The output is a text string containing triples {lo, hi, count} representing the buckets; counts are approximate. Note that an equi-depth histogram is equivalent to a spanning set of equi-spaced centiles.
    SELECT cmsketch_depth_histogram(cmsketch,n) FROM table_name;
Examples:
  1. Generate some data
    sql> CREATE TABLE data(class INT, a1 INT);
    sql> INSERT INTO data SELECT 1,1 FROM generate_series(1,10000);
    sql> INSERT INTO data SELECT 1,2 FROM generate_series(1,15000);
    sql> INSERT INTO data SELECT 1,3 FROM generate_series(1,10000);
    sql> INSERT INTO data SELECT 2,5 FROM generate_series(1,1000);
    sql> INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
    
  2. Count number of rows where a1 = 2 in each class
    sql> SELECT class,cmsketch_count(cmsketch(a1),2) FROM data GROUP BY data.class;
     class | cmsketch_count
    -------+----------------
         2 |              0
         1 |          15000
    (2 rows)
    
  3. Count number of rows where a1 is between 3 and 6
    sql> SELECT class,cmsketch_rangecount(cmsketch(a1),3,6) FROM data GROUP BY data.class;
     class | cmsketch_rangecount
    -------+---------------------
         2 |                2000
         1 |               10000
    (2 rows)
    
  4. Compute the 90th percentile of all of a1
    sql> SELECT cmsketch_centile(cmsketch(a1),90,count(*)) FROM data;
     cmsketch_centile
    ------------------
                    3
    (1 row)
    
  5. Produce an equi-width histogram with 2 bins between 0 and 10
    sql> SELECT cmsketch_width_histogram(cmsketch(a1),0,10,2) FROM data;
          cmsketch_width_histogram
    ------------------------------------
     [[0L, 4L, 35000], [5L, 10L, 2000]]
    (1 row)
    
  6. Produce an equi-depth histogram of a1 with 2 bins of approximately equal depth
    sql> SELECT cmsketch_depth_histogram(cmsketch(a1),2) FROM data;
                           cmsketch_depth_histogram
    -----------------------------------------------------------------------
     [[-9223372036854775807L, 1, 10000], [2, 9223372036854775807L, 27000]]
    (1 row)
    
Literature:

[1] G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) . http://dimacs.rutgers.edu/~graham/pubs/html/CormodeMuthukrishnan04CMLatin.html

[2] G. Cormode. Encyclopedia entry on 'Count-Min Sketch'. In L. Liu and M. T. Ozsu, editors, Encyclopedia of Database Systems, pages 511-516. Springer, 2009. http://dimacs.rutgers.edu/~graham/pubs/html/Cormode09b.html

See Also
File sketch.sql_in documenting the SQL functions.

Module Quantile for a different implementation of quantile function.