User Documentation for Apache MADlib

fmsketch_dcount can be run on a column of any type. It returns an approximation to the number of distinct values (a la COUNT(DISTINCT x)), but faster and approximate. Like any aggregate, it can be combined with a GROUP BY clause to do distinct counts per group.


Get the number of distinct values in a designated column.

fmsketch_dcount( col_name )
This is a User Defined Aggregate which returns the results when used in a query. Use "CREATE TABLE AS ", with the UDA as subquery if the results are to be stored. This is unlike the usual MADlib stored procedure interface which places the results in a table instead of returning it.

  1. Generate some data.
    CREATE TABLE data(class INT, a1 INT);
    INSERT INTO data SELECT 1,1 FROM generate_series(1,10000);
    INSERT INTO data SELECT 1,2 FROM generate_series(1,15000);
    INSERT INTO data SELECT 1,3 FROM generate_series(1,10000);
    INSERT INTO data SELECT 2,5 FROM generate_series(1,1000);
    INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
  2. Find the distinct number of values for each class.
    SELECT class, fmsketch_dcount(a1)
    FROM data
    GROUP BY data.class;
    class | fmsketch_dcount
        2 |               2
        1 |               3
    (2 rows)

[1] P. Flajolet and N.G. Martin. Probabilistic counting algorithms for data base applications, Journal of Computer and System Sciences 31(2), pp 182-209, 1985. http://algo.inria.fr/flajolet/Publications/FlMa85.pdf

Related Topics
File sketch.sql_in documenting the SQL function.