MADlib  1.4.1
User Documentation
 All Files Functions Variables Groups
CountMin (Cormode-Muthukrishnan)
Warning
This MADlib method is still in early stage development. There may be some issues that will be addressed in a future version. Interface and implementation is subject to change.

This module implements Cormode-Muthukrishnan CountMin sketches on integer values, implemented as a user-defined aggregate. It also provides scalar functions over the sketches to produce approximate counts, order statistics, and histograms.

Syntax
  • Get a sketch of a selected column specified by col_name.
    cmsketch( col_name )
    
  • Get the number of rows where col_name = p, computed from the sketch obtained from cmsketch.
    cmsketch_count( cmsketch,
                    p
                  )
    
  • Get the number of rows where col_name is between m and n inclusive. <pre class="syntax" cmsketch_rangecount( cmsketch, m, n )
  • Get the kth percentile of col_name where count specifies number of rows. k should be an integer between 1 to 99.
    cmsketch_centile( cmsketch,
                      k,
                      count
                    )
    
  • Get the median of col_name where count specifies number of rows. This is equivalent to cmsketch_centile(cmsketch,50,count).
    cmsketch_median( cmsketch,
                     count
                   )
    
  • Get an n-bucket histogram for values between min and max for the column where each bucket has approximately the same width. The output is a text string containing triples {lo, hi, count} representing the buckets; counts are approximate.
    cmsketch_width_histogram( cmsketch,
                              min,
                              max,
                              n
                            )
    
  • Get an n-bucket histogram for the column where each bucket has approximately the same count. The output is a text string containing triples {lo, hi, count} representing the buckets; counts are approximate. Note that an equi-depth histogram is equivalent to a spanning set of equi-spaced centiles.
    cmsketch_depth_histogram( cmsketch,
                              n
                            )
    

Examples
  1. Generate some data.
    CREATE TABLE data(class INT, a1 INT);
    INSERT INTO data SELECT 1,1 FROM generate_series(1,10000);
    INSERT INTO data SELECT 1,2 FROM generate_series(1,15000);
    INSERT INTO data SELECT 1,3 FROM generate_series(1,10000);
    INSERT INTO data SELECT 2,5 FROM generate_series(1,1000);
    INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
    
  2. Count number of rows where a1 = 2 in each class.
    SELECT class, 
           cmsketch_count( 
                           cmsketch( a1 ),
                           2
                          ) 
    FROM data GROUP BY data.class;
    
    Result:
     class | cmsketch_count
     ------+----------------
         2 |              0
         1 |          15000
    (2 rows)
    
  3. Count number of rows where a1 is between 3 and 6.
    SELECT class, 
           cmsketch_rangecount(
                                cmsketch(a1),
                                3,
                                6
                              ) 
    FROM data GROUP BY data.class;
    
    Result:
     class | cmsketch_rangecount
     ------+---------------------
         2 |                2000
         1 |               10000
    (2 rows)
    
  4. Compute the 90th percentile of all of a1.
    SELECT cmsketch_centile(
                             cmsketch( a1 ),
                             90,
                             count(*)
                           ) 
    FROM data;
    
    Result:
     cmsketch_centile
     -----------------
                    3
    (1 row)
    
  5. Produce an equi-width histogram with 2 bins between 0 and 10.
    SELECT cmsketch_width_histogram(
                                     cmsketch( a1 ),
                                     0,
                                     10,
                                     2
                                   )
    FROM data;
    
    Result:
          cmsketch_width_histogram
     -----------------------------------
     [[0L, 4L, 35000], [5L, 10L, 2000]]
    (1 row)
    
  6. Produce an equi-depth histogram of a1 with 2 bins of approximately equal depth.
    SELECT cmsketch_depth_histogram(
                                     cmsketch( a1 ),
                                     2
                                   ) 
    FROM data;
    
    Result:
                           cmsketch_depth_histogram
     ----------------------------------------------------------------------
     [[-9223372036854775807L, 1, 10000], [2, 9223372036854775807L, 27000]]
    (1 row)
    

Literature

[1] G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) . http://dimacs.rutgers.edu/~graham/pubs/html/CormodeMuthukrishnan04CMLatin.html

[2] G. Cormode. Encyclopedia entry on 'Count-Min Sketch'. In L. Liu and M. T. Ozsu, editors, Encyclopedia of Database Systems, pages 511-516. Springer, 2009. http://dimacs.rutgers.edu/~graham/pubs/html/Cormode09b.html

File sketch.sql_in documenting the SQL functions.

Module Quantile for a different implementation of quantile function.