MADlib
0.7 A newer version is available
User Documentation
|
00001 /* ----------------------------------------------------------------------- *//** 00002 * 00003 * @file summary.sql_in 00004 * 00005 * @brief Summary function for descriptive statistics 00006 * @date Mar 2013 00007 * 00008 *//* ------------------------------------------------------------------------*/ 00009 00010 m4_include(`SQLCommon.m4') 00011 00012 /** 00013 00014 @addtogroup grp_summary 00015 00016 @about 00017 00018 'summary' is a generic function used to produce summary statistics of any data 00019 table. The function invokes particular 'methods' from the MADlib library to 00020 provide an overview of the data. 00021 00022 @usage 00023 The summary function can be invoked in the following way: 00024 @verbatim 00025 SELECT MADLIB_SCHEMA.summary 00026 ( 00027 source_table TEXT, -- Source table name (Required) 00028 output_table TEXT, -- Output table name (Required) 00029 target_cols TEXT, -- Comma separated columns for which summary is desired 00030 -- (Default: NULL - produces result for all columns) 00031 grouping_cols TEXT, -- Comma separated columns on which to group results 00032 -- (Default: NULL - provides summary on complete table) 00033 get_distinct BOOLEAN, -- Are distinct values required? 00034 -- (Default: True) 00035 get_quartiles BOOLEAN, -- Are quartiles required? 00036 -- (Default: True) 00037 ntile_array FLOAT8[], -- Array of quantile values to compute 00038 -- (Default: NULL - Quantile array not included) 00039 how_many_mfv INTEGER, -- How many most-frequent-values to compute? 00040 -- (Default: 10) 00041 get_estimates BOOLEAN -- Should we produce exact or estimated values? 00042 ) -- (Default: True) 00043 @endverbatim 00044 00045 Note: 00046 - Currently, estimated values are only implemented for the distinct values 00047 computation. 00048 - The '<em>get_estimates</em>' parameter controls computation for two statistics 00049 - If '<em>get_estimates</em>' is True then the distinct value computation is 00050 estimated. Further, the most frequent values computation is computed using 00051 a "quick and dirty" method that does parallel aggregation in GPDB 00052 at the expense of missing some of the most frequent values. 00053 - If '<em>get_estimates</em>' is False then the distinct values are computed 00054 in a slow but exact method. The most frequent values are computed using a 00055 faithful implementation that preserves the approximation guarantees of 00056 the Cormode/Muthukrishnan method (more information in \ref grp_mfvsketch) 00057 00058 00059 The output of the function is a composite type containing: 00060 ouput_table TEXT, -- Name of the output table 00061 row_count INT4, -- Number of rows in the output table 00062 duration FLOAT8 -- Time taken (in seconds) to compute the summary 00063 00064 The summary stastics are stored in the 'output_table' relation provided in the 00065 arguments. The relation 'output_table' can contain the following table 00066 (presence of some columns depends on the argument values) 00067 @verbatim 00068 - group_by_column : Group-by column names (NULL if none provided) 00069 - group_by_value : Values of the Group-by column (NULL if no grouping) 00070 - target_column : Targeted column values for which summary is requested 00071 - column_number : Physical column number for the target column, as described in pg_attribute 00072 - data_type : Data type of target column. Standard GPDB descriptors will be displayed 00073 - row_count : Number of rows for the target column 00074 - distinct_values : Number of distinct values in the target column 00075 - missing_values : Number of missing values in the target column 00076 - blank_values : Number of blank values (blanks are defined by the regular expression '^\w*$') 00077 - fraction_missing : Percentage of total rows that are missing. Will be expressed as a decimal (e.g. 0.3) 00078 - fraction_blank : Percentage of total rows that are blank. Will be expressed as a decimal (e.g. 0.3) 00079 - mean : Mean value of target column (if target is numeric, else NULL) 00080 - variance : Variance of target columns (if target is numeric, else NULL for strings) 00081 - min : Min value of target column (for strings this is the length of the shortest string) 00082 - max : Max value of target column (for strings this is the length of the longest string) 00083 - first_quartile : First quartile (25th percentile, only for numeric columns) 00084 - median : Median value of target column (if target is numeric, else NULL) 00085 - third_quartile : Third quartile (25th percentile, only for numeric columns) 00086 - quantile_array : Percentile values corresponding to ntile_array 00087 - most_frequent_values : Most frequent values 00088 - mfv_frequencies : Frequency of the most frequent values 00089 @endverbatim 00090 00091 The output can be obtained as 00092 @verbatim 00093 sql> SELECT * FROM 'output_table'; 00094 @endverbatim 00095 00096 The usage information can be obtained at any time directly from the 00097 function using 00098 @verbatim 00099 sql> SELECT summary('usage'); 00100 @endverbatim 00101 00102 */ 00103 00104 DROP TYPE IF EXISTS MADLIB_SCHEMA.summary_result; 00105 CREATE TYPE MADLIB_SCHEMA.summary_result AS 00106 ( 00107 ouputtable TEXT, 00108 row_count INT4, 00109 duration FLOAT8 00110 ); 00111 00112 00113 ----------------------------------------------------------------------- 00114 -- Main function for summary 00115 ----------------------------------------------------------------------- 00116 /* 00117 * @brief Compute a summary statistics on a table with optional grouping support 00118 * 00119 * @param source_table Name of source relation containing the data 00120 * @param output_table Name of output table name to store the summary 00121 * @param target_cols String with comma separated list of columns on which summary is desired 00122 * @param grouping_cols String with comma separated list of columns on which to group the data by 00123 * @param get_distinct Should distinct values count be included in result 00124 * @param get_quartiles Should first, second (median), and third quartiles be included in result 00125 * @param ntile_array Array of percentiles to compute 00126 * @param how_many_mfv How many most frequent values to compute? 00127 * @param get_estimates Should distinct counts be an estimated (faster) or exact count? 00128 * 00129 * @usage 00130 * 00131 * <pre> SELECT MADLIB_SCHEMA.summary ( 00132 * '<em>source_table</em>', '<em>output_table</em>', 00133 * '<em>target_cols</em>', '<em>grouping_cols</em>', 00134 * '<em>get_distinct</em>', '<em>get_quartiles</em>', 00135 * '<em>ntile_array</em>', '<em>how_many_mfv</em>', 00136 * '<em>get_estimates</em>' 00137 * ); 00138 * SELECT * FROM '<em>output_table</em>' 00139 * </pre> 00140 */ 00141 CREATE OR REPLACE FUNCTION 00142 MADLIB_SCHEMA.summary 00143 ( 00144 source_table TEXT, -- source table name 00145 output_table TEXT, -- output table name 00146 target_cols TEXT, -- comma separated list of output cols 00147 grouping_cols TEXT, -- comma separated names of grouping cols 00148 get_distinct BOOLEAN, -- Are distinct values required 00149 get_quartiles BOOLEAN, -- Are quartiles required 00150 ntile_array FLOAT8[], -- Array of quantiles to compute 00151 how_many_mfv INTEGER, -- How many most frequent values to compute? 00152 get_estimates BOOLEAN -- Should we produce exact or estimated 00153 -- values for distinct computation 00154 ) 00155 RETURNS MADLIB_SCHEMA.summary_result AS $$ 00156 PythonFunctionBodyOnly(`summary', `summary') 00157 return summary.summary( 00158 schema_madlib, source_table, output_table, target_cols, grouping_cols, 00159 get_distinct, get_quartiles, ntile_array, how_many_mfv, get_estimates) 00160 $$ LANGUAGE plpythonu; 00161 00162 ----------------------------------------------------------------------- 00163 --- Overloaded functions to support optional parameters 00164 ----------------------------------------------------------------------- 00165 CREATE OR REPLACE FUNCTION 00166 MADLIB_SCHEMA.summary 00167 ( 00168 source_table TEXT, 00169 output_table TEXT, 00170 target_cols TEXT, 00171 grouping_cols TEXT, 00172 get_distinct BOOLEAN, 00173 get_quartiles BOOLEAN, 00174 ntile_array FLOAT8[], 00175 how_many_mfv INTEGER 00176 ) 00177 RETURNS MADLIB_SCHEMA.summary_result AS $$ 00178 SELECT MADLIB_SCHEMA.summary( 00179 $1, $2, $3, $4, $5, $6, $7, $8, True) 00180 $$ LANGUAGE sql; 00181 00182 CREATE OR REPLACE FUNCTION 00183 MADLIB_SCHEMA.summary 00184 ( 00185 source_table TEXT, 00186 output_table TEXT, 00187 target_cols TEXT, 00188 grouping_cols TEXT, 00189 get_distinct BOOLEAN, 00190 get_quartiles BOOLEAN, 00191 ntile_array FLOAT8[] 00192 ) 00193 RETURNS MADLIB_SCHEMA.summary_result AS $$ 00194 SELECT MADLIB_SCHEMA.summary( 00195 $1, $2, $3, $4, $5, $6, $7, 10, True) 00196 $$ LANGUAGE sql; 00197 00198 CREATE OR REPLACE FUNCTION 00199 MADLIB_SCHEMA.summary 00200 ( 00201 source_table TEXT, 00202 output_table TEXT, 00203 target_cols TEXT, 00204 grouping_cols TEXT, 00205 get_distinct BOOLEAN, 00206 get_quartiles BOOLEAN 00207 ) 00208 RETURNS MADLIB_SCHEMA.summary_result AS $$ 00209 SELECT MADLIB_SCHEMA.summary( 00210 $1, $2, $3, $4, $5, $6, NULL, 10, True) 00211 $$ LANGUAGE sql; 00212 00213 CREATE OR REPLACE FUNCTION 00214 MADLIB_SCHEMA.summary 00215 ( 00216 source_table TEXT, 00217 output_table TEXT, 00218 target_cols TEXT, 00219 grouping_cols TEXT, 00220 get_distinct BOOLEAN 00221 ) 00222 RETURNS MADLIB_SCHEMA.summary_result AS $$ 00223 SELECT MADLIB_SCHEMA.summary( 00224 $1, $2, $3, $4, $5, True, NULL, 10, True) 00225 $$ LANGUAGE sql; 00226 00227 CREATE OR REPLACE FUNCTION 00228 MADLIB_SCHEMA.summary 00229 ( 00230 source_table TEXT, 00231 output_table TEXT, 00232 target_cols TEXT, 00233 grouping_cols TEXT 00234 ) 00235 RETURNS MADLIB_SCHEMA.summary_result AS $$ 00236 SELECT MADLIB_SCHEMA.summary( 00237 $1, $2, $3, $4, True, True, NULL, 10, True) 00238 $$ LANGUAGE sql; 00239 00240 CREATE OR REPLACE FUNCTION 00241 MADLIB_SCHEMA.summary 00242 ( 00243 source_table TEXT, 00244 output_table TEXT, 00245 target_cols TEXT 00246 ) 00247 RETURNS MADLIB_SCHEMA.summary_result AS $$ 00248 SELECT MADLIB_SCHEMA.summary( 00249 $1, $2, $3, NULL, True, True, NULL, 10, True) 00250 $$ LANGUAGE sql; 00251 00252 CREATE OR REPLACE FUNCTION 00253 MADLIB_SCHEMA.summary 00254 ( 00255 source_table TEXT, 00256 output_table TEXT 00257 ) 00258 RETURNS MADLIB_SCHEMA.summary_result AS $$ 00259 SELECT MADLIB_SCHEMA.summary( 00260 $1, $2, NULL, NULL, True, True, NULL, 10, True) 00261 $$ LANGUAGE sql; 00262 00263 ----------------------------------------------------------------------- 00264 -- Help functions 00265 ----------------------------------------------------------------------- 00266 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.summary( 00267 input_message TEXT 00268 ) 00269 RETURNS TEXT AS $$ 00270 PythonFunctionBodyOnly(`summary', `summary') 00271 return summary.summary_help_message(schema_madlib, input_message) 00272 $$ LANGUAGE plpythonu; 00273 00274 00275 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.summary() 00276 RETURNS TEXT AS $$ 00277 PythonFunctionBodyOnly(`summary', `summary') 00278 return summary.summary_help_message(schema_madlib, None) 00279 $$ LANGUAGE plpythonu;