User Documentation
summary.sql_in
Go to the documentation of this file.
00001 /* ----------------------------------------------------------------------- *//**
00002  *
00003  * @file summary.sql_in
00004  *
00005  * @brief Summary function for descriptive statistics
00006  * @date Mar 2013
00007  *
00008  *//* ------------------------------------------------------------------------*/
00009 
00010 m4_include(`SQLCommon.m4')
00011 
00012 /**
00013 
00014 @addtogroup grp_summary
00015 
00016 @about
00017 
00018 'summary' is a generic function used to produce summary statistics of any data 
00019 table.  The function invokes particular 'methods' from the MADlib library to 
00020 provide an overview of the data. 
00021 
00022 @usage
00023 The summary function can be invoked in the following way:
00024 @verbatim
00025 SELECT MADLIB_SCHEMA.summary
00026 (
00027     source_table            TEXT,       -- Source table name (Required)
00028     output_table            TEXT,       -- Output table name (Required)
00029     target_cols             TEXT,       -- Comma separated columns for which summary is desired
00030                                         --      (Default: NULL - produces result for all columns)
00031     grouping_cols           TEXT,       -- Comma separated columns on which to group results
00032                                         --      (Default: NULL - provides summary on complete table)
00033     get_distinct            BOOLEAN,    -- Are distinct values required?
00034                                         --      (Default: True)
00035     get_quartiles           BOOLEAN,    -- Are quartiles required?
00036                                         --      (Default: True)
00037     ntile_array             FLOAT8[],   -- Array of quantile values to compute
00038                                         --      (Default: NULL - Quantile array not included)
00039     how_many_mfv            INTEGER,    -- How many most-frequent-values to compute?
00040                                         --      (Default: 10)
00041     get_estimates           BOOLEAN     -- Should we produce exact or estimated values?
00042 )                                       --      (Default: True)
00043 @endverbatim
00044 
00045 Note:
00046 - Currently, estimated values are only implemented for the distinct values 
00047 computation. 
00048 - The '<em>get_estimates</em>' parameter controls computation for two statistics
00049     - If '<em>get_estimates</em>' is True then the distinct value computation is
00050         estimated. Further, the most frequent values computation is computed using 
00051         a "quick and dirty" method that does parallel aggregation in GPDB
00052         at the expense of missing some of the most frequent values.
00053     - If '<em>get_estimates</em>' is False then the distinct values are computed
00054      in a slow but exact method. The most frequent values are computed using a
00055      faithful implementation that preserves the approximation guarantees of 
00056      the Cormode/Muthukrishnan method (more information in \ref grp_mfvsketch) 
00057 
00058 
00059 The output of the function is a composite type containing: 
00060     ouput_table             TEXT,       -- Name of the output table
00061     row_count               INT4,       -- Number of rows in the output table
00062     duration                FLOAT8      -- Time taken (in seconds) to compute the summary 
00063 
00064 The summary stastics are stored in the 'output_table' relation provided in the
00065 arguments. The relation 'output_table' can contain the following table
00066 (presence of some columns depends on the argument values)
00067 @verbatim
00068     - group_by_column       : Group-by column names (NULL if none provided)
00069     - group_by_value        : Values of the Group-by column (NULL if no grouping)
00070     - target_column         : Targeted column values for which summary is requested
00071     - column_number         : Physical column number for the target column, as described in pg_attribute
00072     - data_type             : Data type of target column. Standard GPDB descriptors will be displayed
00073     - row_count             : Number of rows for the target column
00074     - distinct_values       : Number of distinct values in the target column
00075     - missing_values        : Number of missing values in the target column
00076     - blank_values          : Number of blank values (blanks are defined by the regular expression '^\w*$')
00077     - fraction_missing      : Percentage of total rows that are missing. Will be expressed as a decimal (e.g. 0.3)
00078     - fraction_blank        : Percentage of total rows that are blank. Will be expressed as a decimal (e.g. 0.3)
00079     - mean                  : Mean value of target column (if target is numeric, else NULL)
00080     - variance              : Variance of target columns (if target is numeric, else NULL for strings)
00081     - min                   : Min value of target column (for strings this is the length of the shortest string)
00082     - max                   : Max value of target column (for strings this is the length of the longest string)
00083     - first_quartile        : First quartile (25th percentile, only for numeric columns)
00084     - median                : Median value of target column (if target is numeric, else NULL)
00085     - third_quartile        : Third quartile (25th percentile, only for numeric columns)
00086     - quantile_array        : Percentile values corresponding to ntile_array
00087     - most_frequent_values  : Most frequent values
00088     - mfv_frequencies       : Frequency of the most frequent values 
00089 @endverbatim
00090 
00091 The output can be obtained as
00092 @verbatim
00093 sql> SELECT * FROM 'output_table';
00094 @endverbatim
00095 
00096 The usage information can be obtained at any time directly from the
00097 function using
00098 @verbatim
00099 sql> SELECT summary('usage');
00100 @endverbatim
00101 
00102 */
00103 
00104 DROP TYPE IF EXISTS MADLIB_SCHEMA.summary_result;
00105 CREATE TYPE MADLIB_SCHEMA.summary_result AS
00106 (
00107     ouputtable      TEXT,
00108     row_count       INT4,
00109     duration        FLOAT8
00110 );
00111 
00112 
00113 -----------------------------------------------------------------------
00114 -- Main function for summary
00115 -----------------------------------------------------------------------
00116 /*
00117  * @brief Compute a summary statistics on a table with optional grouping support
00118  * 
00119  * @param source_table      Name of source relation containing the data
00120  * @param output_table      Name of output table name to store the summary
00121  * @param target_cols       String with comma separated list of columns on which summary is desired
00122  * @param grouping_cols     String with comma separated list of columns on which to group the data by
00123  * @param get_distinct      Should distinct values count be included in result
00124  * @param get_quartiles     Should first, second (median), and third quartiles be included in result
00125  * @param ntile_array       Array of percentiles to compute
00126  * @param how_many_mfv      How many most frequent values to compute?
00127  * @param get_estimates     Should distinct counts be an estimated (faster) or exact count?
00128  *
00129  * @usage
00130  * 
00131  * <pre> SELECT MADLIB_SCHEMA.summary (
00132  *       '<em>source_table</em>', '<em>output_table</em>', 
00133  *       '<em>target_cols</em>', '<em>grouping_cols</em>',                            
00134  *       '<em>get_distinct</em>', '<em>get_quartiles</em>',
00135  *       '<em>ntile_array</em>', '<em>how_many_mfv</em>',
00136  *       '<em>get_estimates</em>'          
00137  *   );
00138  *   SELECT * FROM '<em>output_table</em>'
00139  *  </pre>                                       
00140  */
00141 CREATE OR REPLACE FUNCTION
00142 MADLIB_SCHEMA.summary
00143 (
00144     source_table            TEXT,       -- source table name
00145     output_table            TEXT,       -- output table name
00146     target_cols             TEXT,       -- comma separated list of output cols
00147     grouping_cols           TEXT,       -- comma separated names of grouping cols
00148     get_distinct            BOOLEAN,    -- Are distinct values required
00149     get_quartiles           BOOLEAN,    -- Are quartiles required
00150     ntile_array             FLOAT8[],   -- Array of quantiles to compute
00151     how_many_mfv            INTEGER,    -- How many most frequent values to compute?
00152     get_estimates           BOOLEAN     -- Should we produce exact or estimated
00153                                         --      values for distinct computation
00154 )
00155 RETURNS MADLIB_SCHEMA.summary_result AS $$
00156     PythonFunctionBodyOnly(`summary', `summary')
00157     return summary.summary(
00158         schema_madlib, source_table, output_table, target_cols, grouping_cols,
00159         get_distinct, get_quartiles, ntile_array, how_many_mfv, get_estimates)
00160 $$ LANGUAGE plpythonu;
00161 
00162 -----------------------------------------------------------------------
00163 --- Overloaded functions to support optional parameters
00164 -----------------------------------------------------------------------
00165 CREATE OR REPLACE FUNCTION
00166 MADLIB_SCHEMA.summary
00167 (
00168     source_table            TEXT,
00169     output_table            TEXT,
00170     target_cols             TEXT,
00171     grouping_cols           TEXT,
00172     get_distinct            BOOLEAN,
00173     get_quartiles           BOOLEAN,
00174     ntile_array             FLOAT8[],
00175     how_many_mfv            INTEGER
00176 )
00177 RETURNS MADLIB_SCHEMA.summary_result AS $$
00178     SELECT MADLIB_SCHEMA.summary(
00179         $1, $2, $3, $4, $5, $6, $7, $8, True)
00180 $$ LANGUAGE sql;
00181 
00182 CREATE OR REPLACE FUNCTION
00183 MADLIB_SCHEMA.summary
00184 (
00185     source_table            TEXT,
00186     output_table            TEXT,
00187     target_cols             TEXT,
00188     grouping_cols           TEXT,
00189     get_distinct            BOOLEAN,
00190     get_quartiles           BOOLEAN,
00191     ntile_array             FLOAT8[]
00192 )
00193 RETURNS MADLIB_SCHEMA.summary_result AS $$
00194     SELECT MADLIB_SCHEMA.summary(
00195         $1, $2, $3, $4, $5, $6, $7, 10, True)
00196 $$ LANGUAGE sql;
00197 
00198 CREATE OR REPLACE FUNCTION
00199 MADLIB_SCHEMA.summary
00200 (
00201     source_table            TEXT,
00202     output_table            TEXT,
00203     target_cols             TEXT,
00204     grouping_cols           TEXT,
00205     get_distinct            BOOLEAN,
00206     get_quartiles           BOOLEAN
00207 )
00208 RETURNS MADLIB_SCHEMA.summary_result AS $$
00209     SELECT MADLIB_SCHEMA.summary(
00210         $1, $2, $3, $4, $5, $6, NULL, 10, True)
00211 $$ LANGUAGE sql;
00212 
00213 CREATE OR REPLACE FUNCTION
00214 MADLIB_SCHEMA.summary
00215 (
00216     source_table            TEXT,
00217     output_table            TEXT,
00218     target_cols             TEXT,
00219     grouping_cols           TEXT,
00220     get_distinct            BOOLEAN
00221 )
00222 RETURNS MADLIB_SCHEMA.summary_result AS $$
00223     SELECT MADLIB_SCHEMA.summary(
00224         $1, $2, $3, $4, $5, True, NULL, 10, True)
00225 $$ LANGUAGE sql;
00226 
00227 CREATE OR REPLACE FUNCTION
00228 MADLIB_SCHEMA.summary
00229 (
00230     source_table            TEXT,
00231     output_table            TEXT,
00232     target_cols             TEXT,
00233     grouping_cols           TEXT
00234 )
00235 RETURNS MADLIB_SCHEMA.summary_result AS $$
00236     SELECT MADLIB_SCHEMA.summary(
00237         $1, $2, $3, $4, True, True, NULL, 10, True)
00238 $$ LANGUAGE sql;
00239 
00240 CREATE OR REPLACE FUNCTION
00241 MADLIB_SCHEMA.summary
00242 (
00243     source_table            TEXT,
00244     output_table            TEXT,
00245     target_cols             TEXT
00246 )
00247 RETURNS MADLIB_SCHEMA.summary_result AS $$
00248     SELECT MADLIB_SCHEMA.summary(
00249         $1, $2, $3, NULL, True, True, NULL, 10, True)
00250 $$ LANGUAGE sql;
00251 
00252 CREATE OR REPLACE FUNCTION
00253 MADLIB_SCHEMA.summary
00254 (
00255     source_table            TEXT,
00256     output_table            TEXT
00257 )
00258 RETURNS MADLIB_SCHEMA.summary_result AS $$
00259     SELECT MADLIB_SCHEMA.summary(
00260         $1, $2, NULL, NULL, True, True, NULL, 10, True)
00261 $$ LANGUAGE sql;
00262 
00263 -----------------------------------------------------------------------
00264 -- Help functions
00265 -----------------------------------------------------------------------
00266 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.summary(
00267     input_message            TEXT
00268 )
00269 RETURNS TEXT AS $$
00270 PythonFunctionBodyOnly(`summary', `summary')
00271     return summary.summary_help_message(schema_madlib, input_message)
00272 $$ LANGUAGE plpythonu;
00273 
00274 
00275 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.summary()
00276 RETURNS TEXT AS $$
00277 PythonFunctionBodyOnly(`summary', `summary')
00278     return summary.summary_help_message(schema_madlib, None)
00279 $$ LANGUAGE plpythonu;