Collaboration diagram for Summary:

About:

'summary' is a generic function used to produce summary statistics of any data table. The function invokes particular 'methods' from the MADlib library to provide an overview of the data.

Usage:

The summary function can be invoked in the following way:

SELECT MADlib.summary
(
    source_table            TEXT,       -- Source table name (Required)
    output_table            TEXT,       -- Output table name (Required)
    target_cols             TEXT,       -- Comma separated columns for which summary is desired
                                        --      (Default: NULL - produces result for all columns)
    grouping_cols           TEXT,       -- Comma separated columns on which to group results
                                        --      (Default: NULL - provides summary on complete table)
    get_distinct            BOOLEAN,    -- Are distinct values required?
                                        --      (Default: True)
    get_quartiles           BOOLEAN,    -- Are quartiles required?
                                        --      (Default: True)
    ntile_array             FLOAT8[],   -- Array of quantile values to compute
                                        --      (Default: NULL - Quantile array not included)
    how_many_mfv            INTEGER,    -- How many most-frequent-values to compute?
                                        --      (Default: 10)
    get_estimates           BOOLEAN     -- Should we produce exact or estimated values?
)                                       --      (Default: True)

Note:

Currently, estimated values are only implemented for the distinct values computation.
The 'get_estimates' parameter controls computation for two statistics
- If 'get_estimates' is True then the distinct value computation is estimated. Further, the most frequent values computation is computed using a "quick and dirty" method that does parallel aggregation in GPDB at the expense of missing some of the most frequent values.
- If 'get_estimates' is False then the distinct values are computed in a slow but exact method. The most frequent values are computed using a faithful implementation that preserves the approximation guarantees of the Cormode/Muthukrishnan method (more information in MFV (Most Frequent Values))

The output of the function is a composite type containing: ouput_table TEXT, -- Name of the output table row_count INT4, -- Number of rows in the output table duration FLOAT8 -- Time taken (in seconds) to compute the summary

The summary stastics are stored in the 'output_table' relation provided in the arguments. The relation 'output_table' can contain the following table (presence of some columns depends on the argument values)

    - group_by_column       : Group-by column names (NULL if none provided)
    - group_by_value        : Values of the Group-by column (NULL if no grouping)
    - target_column         : Targeted column values for which summary is requested
    - column_number         : Physical column number for the target column, as described in pg_attribute
    - data_type             : Data type of target column. Standard GPDB descriptors will be displayed
    - row_count             : Number of rows for the target column
    - distinct_values       : Number of distinct values in the target column
    - missing_values        : Number of missing values in the target column
    - blank_values          : Number of blank values (blanks are defined by the regular expression '^\w*$')
    - fraction_missing      : Percentage of total rows that are missing. Will be expressed as a decimal (e.g. 0.3)
    - fraction_blank        : Percentage of total rows that are blank. Will be expressed as a decimal (e.g. 0.3)
    - mean                  : Mean value of target column (if target is numeric, else NULL)
    - variance              : Variance of target columns (if target is numeric, else NULL for strings)
    - min                   : Min value of target column (for strings this is the length of the shortest string)
    - max                   : Max value of target column (for strings this is the length of the longest string)
    - first_quartile        : First quartile (25th percentile, only for numeric columns)
    - median                : Median value of target column (if target is numeric, else NULL)
    - third_quartile        : Third quartile (25th percentile, only for numeric columns)
    - quantile_array        : Percentile values corresponding to ntile_array
    - most_frequent_values  : Most frequent values
    - mfv_frequencies       : Frequency of the most frequent values

The output can be obtained as

sql> SELECT * FROM 'output_table';

The usage information can be obtained at any time directly from the function using

sql> SELECT summary('usage');