2.1.0
User Documentation for Apache MADlib
CountMin (Cormode-Muthukrishnan)

This module implements Cormode-Muthukrishnan CountMin sketches on integer values, implemented as a user-defined aggregate. It also provides scalar functions over the sketches to produce approximate counts, order statistics, and histograms.

Syntax
  • Get a sketch of a selected column specified by col_name.
    cmsketch( col_name )
    
  • Get the number of rows where col_name = p, computed from the sketch obtained from cmsketch.
    cmsketch_count( cmsketch,
                    p )
    
  • Get the number of rows where col_name is between m and n inclusive.
    cmsketch_rangecount( cmsketch,
                         m,
                         n )
    
  • Get the kth percentile of col_name where count specifies number of rows. k should be an integer between 1 to 99.
    cmsketch_centile( cmsketch,
                      k,
                      count )
    
  • Get the median of col_name where count specifies number of rows. This is equivalent to cmsketch_centile(cmsketch,50,count).
    cmsketch_median( cmsketch,
                     count )
    
  • Get an n-bucket histogram for values between min and max for the column where each bucket has approximately the same width. The output is a text string containing triples {lo, hi, count} representing the buckets; counts are approximate.
    cmsketch_width_histogram( cmsketch,
                              min,
                              max,
                              n )
    
  • Get an n-bucket histogram for the column where each bucket has approximately the same count. The output is a text string containing triples {lo, hi, count} representing the buckets; counts are approximate. Note that an equi-depth histogram is equivalent to a spanning set of equi-spaced centiles.
    cmsketch_depth_histogram( cmsketch,
                              n )
    
Note
This is a User Defined Aggregate which returns the results when used in a query. Use "CREATE TABLE AS ", with the UDA as subquery if the results are to be stored. This is unlike the usual MADlib stored procedure interface which places the results in a table instead of returning it.

Examples
  1. Generate some data.
    CREATE TABLE data(class INT, a1 INT);
    INSERT INTO data SELECT 1,1 FROM generate_series(1,10000);
    INSERT INTO data SELECT 1,2 FROM generate_series(1,15000);
    INSERT INTO data SELECT 1,3 FROM generate_series(1,10000);
    INSERT INTO data SELECT 2,5 FROM generate_series(1,1000);
    INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
    
  2. Count number of rows where a1 = 2 in each class. Store results in a table.
    CREATE TABLE sketch_count AS
    SELECT class,
           cmsketch_count( cmsketch( a1 ), 2 )
    FROM data GROUP BY data.class;
    SELECT * FROM sketch_count;
    
    Result:
     class | cmsketch_count
     ------+----------------
         2 |              0
         1 |          15000
    (2 rows)
    
  3. Count number of rows where a1 is between 3 and 6.
    SELECT class,
           cmsketch_rangecount( cmsketch(a1), 3, 6 )
    FROM data GROUP BY data.class;
    
    Result:
     class | cmsketch_rangecount
     ------+---------------------
         2 |                2000
         1 |               10000
    (2 rows)
    
  4. Compute the 90th percentile of all of a1.
    SELECT cmsketch_centile( cmsketch( a1 ), 90, count(*) )
    FROM data;
    
    Result:
     cmsketch_centile
     -----------------
                    3
    (1 row)
    
  5. Produce an equi-width histogram with 2 bins between 0 and 10.
    SELECT cmsketch_width_histogram( cmsketch( a1 ), 0, 10, 2 )
    FROM data;
    
    Result:
          cmsketch_width_histogram
     -----------------------------------
     [[0L, 4L, 35000], [5L, 10L, 2000]]
    (1 row)
    
  6. Produce an equi-depth histogram of a1 with 2 bins of approximately equal depth.
    SELECT cmsketch_depth_histogram( cmsketch( a1 ), 2 )
    FROM data;
    
    Result:
                           cmsketch_depth_histogram
     ----------------------------------------------------------------------
     [[-9223372036854775807L, 1, 10000], [2, 9223372036854775807L, 27000]]
    (1 row)
    

Literature

[1] G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) . http://dimacs.rutgers.edu/~graham/pubs/html/CormodeMuthukrishnan04CMLatin.html

[2] G. Cormode. Encyclopedia entry on 'Count-Min Sketch'. In L. Liu and M. T. Ozsu, editors, Encyclopedia of Database Systems, pages 511-516. Springer, 2009. http://dimacs.rutgers.edu/~graham/pubs/html/Cormode09b.html

File sketch.sql_in documenting the SQL functions.

Module grp_quantile for a different implementation of quantile function.