Cardinality Estimators

A collection of methods to estimate the number of unique values contained in the data.

Sketches (sometimes called "synopsis data structures") are small randomized in-memory data structures that capture statistical properties of a large set of values (e.g., a column of a table). Sketches can be formed in a single pass of the data, and used to approximate a variety of descriptive statistics.

We implement sketches as SQL User-Defined Aggregates (UDAs). Because they are single-pass, small-space and parallelized, a single query can use many sketches to gather summary statistics on many columns of a table efficiently.

This module currently implements user-defined aggregates based on three main sketch methods:

*Count-Min (CM)*sketches, which can be used to approximate a number of descriptive statistics including`COUNT`

of rows whose column value matches a given value in a set`COUNT`

of rows whose column value falls in a range (*)- order statistics including
*median*and*centiles*(*) *histograms*: both*equi-width*and*equi-depth*(*)

*Flajolet-Martin (FM)*sketches for approximating`COUNT(DISTINCT)`

.*Most Frequent Value (MFV)*sketches, which output the most frequently-occuring values in a column, along with their associated counts.

*Note:* Features marked with a star (*) only work for discrete types that can be cast to int8.

The sketch methods consist of a number of SQL UDAs (user-defined aggregates) and UDFs (user-defined functions), to be used directly in SQL queries.

## Modules | |

CountMin (Cormode-Muthukrishnan) | |

Implements Cormode-Mathukrishnan CountMin sketches on integer values as a user-defined aggregate. | |

FM (Flajolet-Martin) | |

Implements Flajolet-Martin's distinct count estimation as a user-defined aggregate. | |

MFV (Most Frequent Values) | |

Implements the most frequent values variant of the CountMin sketch as a user-defined aggregate. | |